Contents

Python Pandas For Your Grandpa - 3.3 DataFrame Basic Indexing

In this section, we’ll look at some basic indexing operations for subsetting and modifying DataFrames. This is really similar to indexing a Series, so if you didn’t see my lecture on indexing a Series, you might wanna go back and watch that one.

To start, let’s build a DataFrame called df with six rows and three columns.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'shrimp': [10, 20, 30, 40, 50, 60],
    'crab': [5, 10, 15, 20, 25, 30],
    'red fish': [2, 3, 5, 7, 11, 13]
})
print(df)
##    shrimp  crab  red fish
## 0      10     5         2
## 1      20    10         3
## 2      30    15         5
## 3      40    20         7
## 4      50    25        11
## 5      60    30        13

We can access an individual column in its Series form using either square brackets with the column name passed in

df['shrimp']
## 0    10
## 1    20
## 2    30
## 3    40
## 4    50
## 5    60
## Name: shrimp, dtype: int64

or dot notation like this

df.shrimp
## 0    10
## 1    20
## 2    30
## 3    40
## 4    50
## 5    60
## Name: shrimp, dtype: int64

Personally I prefer using dot notation, but sometimes column names include a space, in which case you’re pretty much forced to use square bracket notation, unless you rename your columns. For example, you can do df['red fish'] but not df.red fish.

If you wanted to access multiple columns, you can do that too, using a list of column names like

df[['shrimp', 'crab']]
##    shrimp  crab
## 0      10     5
## 1      20    10
## 2      30    15
## 3      40    20
## 4      50    25
## 5      60    30

in which case you’ll get back a DataFrame object whose data is a copy of the original DataFrame.

If you want to pick out certain rows of a DataFrame by their position, you can do so using iloc[] just like you can with Series. For example, if you want to get the 1st, 3rd, and 5th rows of df, you could do

df.iloc[[0, 2, 4]]
##    shrimp  crab  red fish
## 0      10     5         2
## 2      30    15         5
## 4      50    25        11

Or you could use slicing, like

df.iloc[0:5:2]
##    shrimp  crab  red fish
## 0      10     5         2
## 2      30    15         5
## 4      50    25        11

If you wanted to pick out an individual row, like the 2nd row, you might try something like

df.iloc[1]
## shrimp      20
## crab        10
## red fish     3
## Name: 1, dtype: int64

which works, but returns the result as a Series. In most cases, you probably want to get this back as a 1-row DataFrame, and to do that just wrap the 1 inside a list like

df.iloc[[1]]
##    shrimp  crab  red fish
## 1      20    10         3

You can also use iloc to subset columns so you could do something like df.iloc[:, [0, 1]] to get back columns 0 and 1.

df.iloc[:, [0, 1]]
##    shrimp  crab
## 0      10     5
## 1      20    10
## 2      30    15
## 3      40    20
## 4      50    25
## 5      60    30

In english you could describe this expression as saying “give me every row of the DataFrame, but only columns 0 and 1”.

If you did df.iloc[[0, 2], [1, 2]], it’d be like saying “give me rows 0 and 2, columns 1 and 2”.

df.iloc[[0, 2], [1, 2]]
##    crab  red fish
## 0     5         2
## 2    15         5

As you probably guessed, just like iloc can be used to subset a DataFrame, so can loc. To make the distinction a little more clear, let’s start by replacing df's index with the letters ‘a’ through ‘f’.

df.index = ['a','b','c','d','e','f']
print(df)
##    shrimp  crab  red fish
## a      10     5         2
## b      20    10         3
## c      30    15         5
## d      40    20         7
## e      50    25        11
## f      60    30        13

Now if we want to pick out rows with index labels ‘b’ and ‘e’, we can do

df.loc[['b', 'e']]
##    shrimp  crab  red fish
## b      20    10         3
## e      50    25        11

We could combine row and column subsetting to pick out rows ‘a’, ‘c’, and ‘f’ with columns ‘crab’ and ‘shrimp’ like

df.loc[['a', 'c', 'f'], ['crab', 'shrimp']]
##    crab  shrimp
## a     5      10
## c    15      30
## f    30      60

And we can even use slicing to pick out every row between ‘b’ and ‘e’ for column ‘crab’.

df.loc['b':'e', ['crab']]
##    crab
## b    10
## c    15
## d    20
## e    25

Now, what if we wanted to pick out the first three rows with columns shrimp and red fish? This is one of those times where, frankly, Pandas is kind of disappointing because what you really want to do is something like df.loc[:3, ['shrimp', 'red fish']] but this doesn’t work because loc searches for index labels, so it doesn’t know how to interpret the positional slicing :3. On the other hand df.iloc[:3, ['shrimp', 'red fish']] doesn’t work because iloc uses positional indexing so it doesn’t know how to interpret the list of column names ['shrimp', 'red fish'].

Fortunately, there’s a not-so-terrible workaround for this. The trick is to convert the column names ‘shrimp’, and ‘red fish’ to their corresponding positional indices 0 and 2, and then just use iloc as usual.

Essentially, we want to do df.iloc[:3, [0, 2]], but to allow for column names instead of column indices, we can replace [0, 2] with [df.columns.get_loc(c) for c in ['shrimp', 'red fish']].

df.iloc[:3, [df.columns.get_loc(c) for c in ['shrimp', 'red fish']]]
##    shrimp  red fish
## a      10         2
## b      20         3
## c      30         5

Here, df.columns.get_loc() returns the positional index of a column given its name. So we just combine that with a simple list comprehension to convert a list of column names to positional indices on the fly.

I should also mention that df.columns returns the column index of the DataFrame.

df.columns
## Index(['shrimp', 'crab', 'red fish'], dtype='object')

and in this case it’s just like the row index since our row index is also made up of strings.

df.index
## Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

You can also apply boolean indexing to DataFrames just as you can with Series. For example, if you wanted to get the rows of df where column shrimp is less than 40, you can start by building a boolean Series like

mask = df.shrimp < 40

If we print(mask), notice it has the same index as our DataFrame.

print(mask)
## a     True
## b     True
## c     True
## d    False
## e    False
## f    False
## Name: shrimp, dtype: bool

So then we can use this boolean series with df.loc to pick out the rows where shrimp is less than 40.

df.loc[mask]
##    shrimp  crab  red fish
## a      10     5         2
## b      20    10         3
## c      30    15         5

More commonly you’ll see this as a one liner like

df.loc[df.shrimp < 40]
##    shrimp  crab  red fish
## a      10     5         2
## b      20    10         3
## c      30    15         5

And just like Series, you can combine logical conditions to create more intricate filters like, “rows where shrimp is less than 50 and crab is not divisible by 10”.

df.loc[(df.shrimp < 50) & ~(df.crab % 10 == 0)]
##    shrimp  crab  red fish
## a      10     5         2
## c      30    15         5

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts