Contents

Python Pandas For Your Grandpa - 2.8 Series View vs Copy

In this section, we’ll see when Series operations create a copy and when they create a view. Suppose you have this series, x.

import numpy as np
import pandas as pd

x = pd.Series(
    data=[2, 3, 5, 7, 11],
    index=[2, 11, 12, 30, 30]
)
print(x)
## 2      2
## 11     3
## 12     5
## 30     7
## 30    11
## dtype: int64

and then you set a new variable y equal to x.

y = x

Then you modify the first element of y to be, 99.

y.iloc[0] = 99

Obviously this modifies y, but you might be surprised to see it also modifies x.

print(x)
## 2     99
## 11     3
## 12     5
## 30     7
## 30    11
## dtype: int64

The reason this happens is because when we set y equal to x, Pandas didn’t make a copy of x, it merely made y a reference to x. In other words, the variable y points to the same block of data stored by x. This is known as “assignment by reference” and some people would call y a “view” of x.

To avoid this, we can explicitly set y equal to a copy of x using something like

y = x.copy()

Now if we change y, x is unchanged because y points to a completely separate block of data.

y.iloc[0] = 123
print(x)
## 2     99
## 11     3
## 12     5
## 30     7
## 30    11
## dtype: int64

One of the reasons this is so confusing is because assignment by reference only happens under some circumstances which aren’t clearly documented and aren’t always obvious. For example, if we have the Series,

foo = pd.Series(['a', 'b', 'c', 'd'], dtype='string')
print(foo)
## 0    a
## 1    b
## 2    c
## 3    d
## dtype: string

and we set bar = foo.loc[foo <= 'b']

bar = foo.loc[foo <= 'b']
print(bar)
## 0    a
## 1    b
## dtype: string

Then we modify bar, setting the 1st element equal to ‘z’.

bar.iloc[0] = 'z'
print(bar)
## 0    z
## 1    b
## dtype: string

foo doesn’t get changed which means under the hood, Pandas copied the data in foo to create bar.

print(foo)
## 0    a
## 1    b
## 2    c
## 3    d
## dtype: string

Now, if we set baz = foo.iloc[:2], which is the same exact subset we used when we built bar, except here we use slicing

baz = foo.iloc[:2]
print(baz)
## 0    a
## 1    b
## dtype: string

and then, just like with bar, we set the first element of baz equal to ‘z’.

baz.iloc[0] = 'z'

This time, in addition to baz changing, foo also gets changed.

print(foo)
## 0    z
## 1    b
## 2    c
## 3    d
## dtype: string

As far as I can tell, when it comes to Series, if you assign A equal to B.loc[something], Pandas returns a copy, otherwise it returns a view, but this is undocumented and the rules change when we start using DataFrames. So I don’t recommend memorizing any hard and fast rules. Instead, you kind of just have to play around with things. Use .copy() to be safe, and just be aware that this quirky behavior exists. I know it sounds weird, but this is the kind of thing you get a feel for over time.

Another situation where it’s important to understand if Pandas is copying data is when it comes to pretty much any Pandas function that modifies a Series. For example, every Series has a method called replace() which basically lets you replace values with other values. So if you have a Series of strings like this one called zoo

zoo = pd.Series(['tiger', 'lion', 'zebra', 'lion'])
print(zoo)
## 0    tiger
## 1     lion
## 2    zebra
## 3     lion
## dtype: object

If you want to replace every instance of ‘lion’ with ‘hamster’ and every instance of ‘tiger’ with ‘bunny’, you could do

zoo.replace({'lion':'hamster', 'tiger':'bunny'})
## 0      bunny
## 1    hamster
## 2      zebra
## 3    hamster
## dtype: object

The result of this method is a copy of zoo with the replaced values. So we’re not actually modifying zoo, we’re just building a brand new Series from it.

If you wanted to update zoo with these replacements, you could just overwrite the variable like zoo = zoo.replace({'lion':'hamster', 'tiger':'bunny'}) which would work, but it’d be highly inefficient since internally Pandas would create a whole new Series, reassign zoo to it, and then delete the old Series. To circumvent this, lots of Pandas functions have a parameter called ‘inplace’ which, when True, tells Pandas to modify the data you’re operating on rather than return a modified copy of the data.

So, if we wanted our replacements to stick, we could call

zoo.replace({'lion':'hamster', 'tiger':'bunny'}, inplace=True)
print(zoo)
## 0      bunny
## 1    hamster
## 2      zebra
## 3    hamster
## dtype: object

and now the data inside zoo actually gets updated with our replacements.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts