Contents

Python Pandas For Your Grandpa - 3.7 DataFrame merge()

In this section, we’ll look at one of the most common and important operations regarding dataframes - merging them together.

To do this in pandas, the workhorse function is merge(). To motivate it, suppose we’re doing some analysis for a veterinary office and the office has two tables of data

  1. pets which has one row per pet - here the index represents each pet_id.
import numpy as np
import pandas as pd

pets = pd.DataFrame(
    data={
        'name': ['Mr. Snuggles', 'Honey Chew Chew', 'Professor', 'Chairman Meow', 'Neighbelline'],
        'type': ['cat', 'dog', 'dog', 'cat', 'horse']
    },
    index=[71, 42, 11, 98, 42]
)
print(pets)
##                name   type
## 71     Mr. Snuggles    cat
## 42  Honey Chew Chew    dog
## 11        Professor    dog
## 98    Chairman Meow    cat
## 42     Neighbelline  horse
  1. visits which has one row per visit - where the index represents each visit_id.
visits = pd.DataFrame(
    data={
        'pet_id': [42, 31, 71, 42, 98, 42],
        'date': ['2019-03-15', '2019-03-15', '2019-04-05', '2019-04-06', '2019-04-12', '2019-04-12']
    }
)
print(visits)
##    pet_id        date
## 0      42  2019-03-15
## 1      31  2019-03-15
## 2      71  2019-04-05
## 3      42  2019-04-06
## 4      98  2019-04-12
## 5      42  2019-04-12

merge() has four types of joins: left, right, outer, and inner, with the default being inner. The join type determines which keys show up in the result.

Let’s start with inner join. The result of an inner join only includes keys that exist in both the left and right tables, although a single key can appear multiple times in the result if it appears multiple times in one of the input tables.

This makes more sense when you see it, so let’s look at an inner join between pets and visits based on pet ids. We’ll designate

  • pets as the left table,
  • visits as the right table,
  • the join type as inner,
  • left_index=True because in the left table i.e. the pets table, pet ids are stored in the row index, and
  • right_on='pet_id' because in the right table i.e. the visits table pet ids are stored in a column called pet_id.

In other words, we want to join the row index in the left table to the pet_id column in the right table.

pd.merge(left=pets, right=visits, how='inner', left_index=True, right_on='pet_id')
##               name   type  pet_id        date
## 2     Mr. Snuggles    cat      71  2019-04-05
## 0  Honey Chew Chew    dog      42  2019-03-15
## 3  Honey Chew Chew    dog      42  2019-04-06
## 5  Honey Chew Chew    dog      42  2019-04-12
## 0     Neighbelline  horse      42  2019-03-15
## 3     Neighbelline  horse      42  2019-04-06
## 5     Neighbelline  horse      42  2019-04-12
## 4    Chairman Meow    cat      98  2019-04-12

Notice a couple things about the result.

  1. pet_id 11 is excluded because it only existed in the pets table and pet_id 31 is excluded because it only existed in the visits table.
  2. pet_id 42 occurs six times in result because it occurred three times in the visits table, each of which matched to two occurrences in the pets table.

When your DataFrame’s row index has some specific meaning (kind of like ours), it’s a good practice to give it a name. So let’s go ahead and rename the row index in pets to ‘pet_id’.

pets.index.rename('pet_id', inplace=True)
print(pets)
##                    name   type
## pet_id                        
## 71         Mr. Snuggles    cat
## 42      Honey Chew Chew    dog
## 11            Professor    dog
## 98        Chairman Meow    cat
## 42         Neighbelline  horse

and let’s rename the row index in visits to ‘visit_id’.

visits.index.rename('visit_id', inplace=True)
print(visits)
##           pet_id        date
## visit_id                    
## 0             42  2019-03-15
## 1             31  2019-03-15
## 2             71  2019-04-05
## 3             42  2019-04-06
## 4             98  2019-04-12
## 5             42  2019-04-12

A convenient benefit to naming the pet_id index is that we can rebuild the inner join from before using on='pet_id' instead of setting left_index = True and right_on = 'pet_id'.

pd.merge(left=pets, right=visits, how='inner', on='pet_id')
##    pet_id             name   type        date
## 0      71     Mr. Snuggles    cat  2019-04-05
## 1      42  Honey Chew Chew    dog  2019-03-15
## 2      42  Honey Chew Chew    dog  2019-04-06
## 3      42  Honey Chew Chew    dog  2019-04-12
## 4      42     Neighbelline  horse  2019-03-15
## 5      42     Neighbelline  horse  2019-04-06
## 6      42     Neighbelline  horse  2019-04-12
## 7      98    Chairman Meow    cat  2019-04-12

When you use the on keyword like this, Pandas searches both tables for a matching column name and/or row index to merge the tables with. Also notice that this result is slightly different than our previous version - specifically the resulting row index is different and the resulting column order is different. I’ll touch on that a bit later.

Now let’s look at a left join. The result of a left join retains all and only the keys in the left table. Let’s see it in action, again making pets the left table and visits the right table, and joining on pet_id.

pd.merge(left=pets, right=visits, how='left', on='pet_id')
##    pet_id             name   type        date
## 0      71     Mr. Snuggles    cat  2019-04-05
## 1      42  Honey Chew Chew    dog  2019-03-15
## 2      42  Honey Chew Chew    dog  2019-04-06
## 3      42  Honey Chew Chew    dog  2019-04-12
## 4      11        Professor    dog         NaN
## 5      98    Chairman Meow    cat  2019-04-12
## 6      42     Neighbelline  horse  2019-03-15
## 7      42     Neighbelline  horse  2019-04-06
## 8      42     Neighbelline  horse  2019-04-12

In this case, the left table, pets, includes pet_id 11 which doesn’t have a matching record in the right table, visits, but the result still includes it. And for unmatched records like this one, Pandas fills in the columns from the right table with NaN. Also note that pet_id 42 again expands into 6 resultant records.

A right join is exactly the same thing as a left join, just reversed.

pd.merge(left=pets, right=visits, how='right', on='pet_id')
##    pet_id             name   type        date
## 0      42  Honey Chew Chew    dog  2019-03-15
## 1      42     Neighbelline  horse  2019-03-15
## 2      31              NaN    NaN  2019-03-15
## 3      71     Mr. Snuggles    cat  2019-04-05
## 4      42  Honey Chew Chew    dog  2019-04-06
## 5      42     Neighbelline  horse  2019-04-06
## 6      98    Chairman Meow    cat  2019-04-12
## 7      42  Honey Chew Chew    dog  2019-04-12
## 8      42     Neighbelline  horse  2019-04-12

So, in this case, pet_id 11 is excluded but pet_id 31 is retained in the result.

The last type of join is an outer join which includes every key from both tables in the output.

pd.merge(left=pets, right=visits, how='outer', on='pet_id')
##    pet_id             name   type        date
## 0      71     Mr. Snuggles    cat  2019-04-05
## 1      42  Honey Chew Chew    dog  2019-03-15
## 2      42  Honey Chew Chew    dog  2019-04-06
## 3      42  Honey Chew Chew    dog  2019-04-12
## 4      42     Neighbelline  horse  2019-03-15
## 5      42     Neighbelline  horse  2019-04-06
## 6      42     Neighbelline  horse  2019-04-12
## 7      11        Professor    dog         NaN
## 8      98    Chairman Meow    cat  2019-04-12
## 9      31              NaN    NaN  2019-03-15

Somewhat disappointing is the fact that Pandas doesn’t have a native anti-join which basically answers the question, “What records from table A don’t match any records from table B?” But there’s a couple ways we can cook this up.

Suppose we want to see which records in the pets table don’t have a matching record in the visits table via pet_id. We can start by doing an outer join with the indicator=True.

outer = pd.merge(left=pets, right=visits, how='outer', on='pet_id', indicator=True)
print(outer)
##    pet_id             name   type        date      _merge
## 0      71     Mr. Snuggles    cat  2019-04-05        both
## 1      42  Honey Chew Chew    dog  2019-03-15        both
## 2      42  Honey Chew Chew    dog  2019-04-06        both
## 3      42  Honey Chew Chew    dog  2019-04-12        both
## 4      42     Neighbelline  horse  2019-03-15        both
## 5      42     Neighbelline  horse  2019-04-06        both
## 6      42     Neighbelline  horse  2019-04-12        both
## 7      11        Professor    dog         NaN   left_only
## 8      98    Chairman Meow    cat  2019-04-12        both
## 9      31              NaN    NaN  2019-03-15  right_only

That’ll produce an output where the column ‘_merge’ indicates if the resulting record came from the left table, right table or both.

From there it’s pretty obvious how to get the records in pets not in visits - just filter the result where _merge is “left_only”.

a_not_in_b = outer.loc[outer._merge == 'left_only']
print(a_not_in_b)
##    pet_id       name type date     _merge
## 7      11  Professor  dog  NaN  left_only

The nice thing about this is that it’s simple, the downside is that it’s extremely memory inefficient. We’re unnecessarily creating this big intermediate table before we filter down to the one row that we care about.

The second method is a little trickier but it’s more memory efficient. The basic idea is, for each pet_id in the pets table, we check if it exists in the visits table using the isin() method. Then we negate the result and use that to subset the pets table.

So if we do pets.index.isin(visits.pet_id) we get back a boolean array indicating whether each pet_id in the pets table also exists in the visits table.

pets.index.isin(visits.pet_id)
## array([ True,  True, False,  True,  True])

Then we can negate that array and use it as a boolen index to subset the pets table.

pets.loc[~pets.index.isin(visits.pet_id)]
##              name type
## pet_id                
## 11      Professor  dog

Since the visits table includes duplicate pet_ids, we can slightly improve this by using only its unique pet_ids.

pets.loc[~pets.index.isin(visits.pet_id.unique())]
##              name type
## pet_id                
## 11      Professor  dog

Before I close out this lecture I want to briefly mention a few things..

First off, if you want to join DataFrames using multiple columns, you can do that just by passing in a list of column names.

Secondly, if you join two tables A and B using A’s index and a column from B, the resulting DataFrame will inherit its index from B. If you use columns from both tables, the resulting DataFrame will create a brand new row index.

Lastly, let’s say you left join table B onto table A, meaning A is the left table and B is the right table. The rows in the result will be ordered by: first, when they ocurred in A, and second, when they occurred in B. The kicker is that, this behavior isn’t documented - so while it can be useful in some situations, you might not want to rely on it. I asked about this on github, and kind of frustratingly no one got back to me.

Also, this behavior doesn’t work for right joins, which some people have already classified as a bug and is supposed to be fixed soon.


Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts

Additional Content

  1. Python NumPy For Your Grandma
  2. Neural Networks For Your Dog
  3. Introduction To Google Colab