Python Pandas For Your Grandpa - 3.3 DataFrame Basic Indexing
In this section, we’ll look at some basic indexing operations for subsetting and modifying DataFrames. This is really similar to indexing a Series, so if you didn’t see my lecture on indexing a Series, you might wanna go back and watch that one.
To start, let’s build a DataFrame called df
with six rows and three columns.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'shrimp': [10, 20, 30, 40, 50, 60],
'crab': [5, 10, 15, 20, 25, 30],
'red fish': [2, 3, 5, 7, 11, 13]
})
print(df)
## shrimp crab red fish
## 0 10 5 2
## 1 20 10 3
## 2 30 15 5
## 3 40 20 7
## 4 50 25 11
## 5 60 30 13
We can access an individual column in its Series form using either square brackets with the column name passed in
df['shrimp']
## 0 10
## 1 20
## 2 30
## 3 40
## 4 50
## 5 60
## Name: shrimp, dtype: int64
or dot notation like this
df.shrimp
## 0 10
## 1 20
## 2 30
## 3 40
## 4 50
## 5 60
## Name: shrimp, dtype: int64
Personally I prefer using dot notation, but sometimes column names include a space, in which case you’re pretty much forced to use square bracket notation, unless you rename your columns. For example, you can do df['red fish']
but not df.red fish
.
If you wanted to access multiple columns, you can do that too, using a list of column names like
df[['shrimp', 'crab']]
## shrimp crab
## 0 10 5
## 1 20 10
## 2 30 15
## 3 40 20
## 4 50 25
## 5 60 30
in which case you’ll get back a DataFrame object whose data is a copy of the original DataFrame.
If you want to pick out certain rows of a DataFrame by their position, you can do so using iloc[]
just like you can with Series. For example, if you want to get the 1st, 3rd, and 5th rows of df
, you could do
df.iloc[[0, 2, 4]]
## shrimp crab red fish
## 0 10 5 2
## 2 30 15 5
## 4 50 25 11
Or you could use slicing, like
df.iloc[0:5:2]
## shrimp crab red fish
## 0 10 5 2
## 2 30 15 5
## 4 50 25 11
If you wanted to pick out an individual row, like the 2nd row, you might try something like
df.iloc[1]
## shrimp 20
## crab 10
## red fish 3
## Name: 1, dtype: int64
which works, but returns the result as a Series. In most cases, you probably want to get this back as a 1-row DataFrame, and to do that just wrap the 1 inside a list like
df.iloc[[1]]
## shrimp crab red fish
## 1 20 10 3
You can also use iloc
to subset columns so you could do something like df.iloc[:, [0, 1]]
to get back columns 0 and 1.
df.iloc[:, [0, 1]]
## shrimp crab
## 0 10 5
## 1 20 10
## 2 30 15
## 3 40 20
## 4 50 25
## 5 60 30
In english you could describe this expression as saying “give me every row of the DataFrame, but only columns 0 and 1”.
If you did df.iloc[[0, 2], [1, 2]]
, it’d be like saying “give me rows 0 and 2, columns 1 and 2”.
df.iloc[[0, 2], [1, 2]]
## crab red fish
## 0 5 2
## 2 15 5
As you probably guessed, just like iloc
can be used to subset a DataFrame, so can loc
. To make the distinction a little more clear, let’s start by replacing df
's index with the letters ‘a’ through ‘f’.
df.index = ['a','b','c','d','e','f']
print(df)
## shrimp crab red fish
## a 10 5 2
## b 20 10 3
## c 30 15 5
## d 40 20 7
## e 50 25 11
## f 60 30 13
Now if we want to pick out rows with index labels ‘b’ and ‘e’, we can do
df.loc[['b', 'e']]
## shrimp crab red fish
## b 20 10 3
## e 50 25 11
We could combine row and column subsetting to pick out rows ‘a’, ‘c’, and ‘f’ with columns ‘crab’ and ‘shrimp’ like
df.loc[['a', 'c', 'f'], ['crab', 'shrimp']]
## crab shrimp
## a 5 10
## c 15 30
## f 30 60
And we can even use slicing to pick out every row between ‘b’ and ‘e’ for column ‘crab’.
df.loc['b':'e', ['crab']]
## crab
## b 10
## c 15
## d 20
## e 25
Now, what if we wanted to pick out the first three rows with columns shrimp and red fish? This is one of those times where, frankly, Pandas is kind of disappointing because what you really want to do is something like df.loc[:3, ['shrimp', 'red fish']]
but this doesn’t work because loc
searches for index labels, so it doesn’t know how to interpret the positional slicing :3
. On the other hand df.iloc[:3, ['shrimp', 'red fish']]
doesn’t work because iloc
uses positional indexing so it doesn’t know how to interpret the list of column names ['shrimp', 'red fish']
.
Fortunately, there’s a not-so-terrible workaround for this. The trick is to convert the column names ‘shrimp’, and ‘red fish’ to their corresponding positional indices 0 and 2, and then just use iloc
as usual.
Essentially, we want to do df.iloc[:3, [0, 2]]
, but to allow for column names instead of column indices, we can replace [0, 2]
with [df.columns.get_loc(c) for c in ['shrimp', 'red fish']]
.
df.iloc[:3, [df.columns.get_loc(c) for c in ['shrimp', 'red fish']]]
## shrimp red fish
## a 10 2
## b 20 3
## c 30 5
Here, df.columns.get_loc()
returns the positional index of a column given its name. So we just combine that with a simple list comprehension to convert a list of column names to positional indices on the fly.
I should also mention that df.columns
returns the column index of the DataFrame.
df.columns
## Index(['shrimp', 'crab', 'red fish'], dtype='object')
and in this case it’s just like the row index since our row index is also made up of strings.
df.index
## Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
You can also apply boolean indexing to DataFrames just as you can with Series. For example, if you wanted to get the rows of df
where column shrimp is less than 40, you can start by building a boolean Series like
mask = df.shrimp < 40
If we print(mask)
, notice it has the same index as our DataFrame.
print(mask)
## a True
## b True
## c True
## d False
## e False
## f False
## Name: shrimp, dtype: bool
So then we can use this boolean series with df.loc
to pick out the rows where shrimp is less than 40.
df.loc[mask]
## shrimp crab red fish
## a 10 5 2
## b 20 10 3
## c 30 15 5
More commonly you’ll see this as a one liner like
df.loc[df.shrimp < 40]
## shrimp crab red fish
## a 10 5 2
## b 20 10 3
## c 30 15 5
And just like Series, you can combine logical conditions to create more intricate filters like, “rows where shrimp is less than 50 and crab is not divisible by 10”.
df.loc[(df.shrimp < 50) & ~(df.crab % 10 == 0)]
## shrimp crab red fish
## a 10 5 2
## c 30 15 5
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts