Python Pandas For Your Grandpa - 2.8 Series View vs Copy
In this section, we’ll see when Series operations create a copy and when they create a view. Suppose you have this series, x
.
import numpy as np
import pandas as pd
x = pd.Series(
data=[2, 3, 5, 7, 11],
index=[2, 11, 12, 30, 30]
)
print(x)
## 2 2
## 11 3
## 12 5
## 30 7
## 30 11
## dtype: int64
and then you set a new variable y
equal to x
.
y = x
Then you modify the first element of y
to be, 99.
y.iloc[0] = 99
Obviously this modifies y
, but you might be surprised to see it also modifies x
.
print(x)
## 2 99
## 11 3
## 12 5
## 30 7
## 30 11
## dtype: int64
The reason this happens is because when we set y
equal to x
, Pandas didn’t make a copy of x, it merely made y
a reference to x
. In other words, the variable y
points to the same block of data stored by x
. This is known as “assignment by reference” and some people would call y
a “view” of x
.
To avoid this, we can explicitly set y
equal to a copy of x
using something like
y = x.copy()
Now if we change y
, x
is unchanged because y
points to a completely separate block of data.
y.iloc[0] = 123
print(x)
## 2 99
## 11 3
## 12 5
## 30 7
## 30 11
## dtype: int64
One of the reasons this is so confusing is because assignment by reference only happens under some circumstances which aren’t clearly documented and aren’t always obvious. For example, if we have the Series,
foo = pd.Series(['a', 'b', 'c', 'd'], dtype='string')
print(foo)
## 0 a
## 1 b
## 2 c
## 3 d
## dtype: string
and we set bar = foo.loc[foo <= 'b']
bar = foo.loc[foo <= 'b']
print(bar)
## 0 a
## 1 b
## dtype: string
Then we modify bar
, setting the 1st element equal to ‘z’.
bar.iloc[0] = 'z'
print(bar)
## 0 z
## 1 b
## dtype: string
foo
doesn’t get changed which means under the hood, Pandas copied the data in foo
to create bar
.
print(foo)
## 0 a
## 1 b
## 2 c
## 3 d
## dtype: string
Now, if we set baz = foo.iloc[:2]
, which is the same exact subset we used when we built bar
, except here we use slicing
baz = foo.iloc[:2]
print(baz)
## 0 a
## 1 b
## dtype: string
and then, just like with bar
, we set the first element of baz
equal to ‘z’.
baz.iloc[0] = 'z'
This time, in addition to baz
changing, foo
also gets changed.
print(foo)
## 0 z
## 1 b
## 2 c
## 3 d
## dtype: string
As far as I can tell, when it comes to Series, if you assign A
equal to B.loc[something]
, Pandas returns a copy, otherwise it returns a view, but this is undocumented and the rules change when we start using DataFrames. So I don’t recommend memorizing any hard and fast rules. Instead, you kind of just have to play around with things. Use .copy()
to be safe, and just be aware that this quirky behavior exists. I know it sounds weird, but this is the kind of thing you get a feel for over time.
Another situation where it’s important to understand if Pandas is copying data is when it comes to pretty much any Pandas function that modifies a Series. For example, every Series has a method called replace()
which basically lets you
replace values with other values. So if you have a Series of strings like this one called zoo
zoo = pd.Series(['tiger', 'lion', 'zebra', 'lion'])
print(zoo)
## 0 tiger
## 1 lion
## 2 zebra
## 3 lion
## dtype: object
If you want to replace every instance of ‘lion’ with ‘hamster’ and every instance of ‘tiger’ with ‘bunny’, you could do
zoo.replace({'lion':'hamster', 'tiger':'bunny'})
## 0 bunny
## 1 hamster
## 2 zebra
## 3 hamster
## dtype: object
The result of this method is a copy of zoo
with the replaced values. So we’re not actually modifying zoo
, we’re just building a brand new Series from it.
If you wanted to update zoo
with these replacements, you could just overwrite the variable like zoo = zoo.replace({'lion':'hamster', 'tiger':'bunny'})
which would work, but it’d be highly inefficient since internally Pandas would create a whole new Series, reassign zoo
to it, and then delete the old Series. To circumvent this, lots of Pandas functions have a parameter called ‘inplace’ which, when True, tells Pandas to modify the data you’re operating on rather than return a modified copy of the data.
So, if we wanted our replacements to stick, we could call
zoo.replace({'lion':'hamster', 'tiger':'bunny'}, inplace=True)
print(zoo)
## 0 bunny
## 1 hamster
## 2 zebra
## 3 hamster
## dtype: object
and now the data inside zoo
actually gets updated with our replacements.
Course Curriculum
- Introduction
1.1 Introduction - Series
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.7 Seriesapply()
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams - DataFrame
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.5 DataFrameapply()
3.6 DataFrame View vs Copy
3.7 DataFramemerge()
3.8 DataFrame Aggregation
3.9 DataFramegroupby()
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes - Advanced
4.1 Strings
4.2 Dates And Times
4.3 Categoricals
4.4 MultiIndex
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM - Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts