Contents

Python Pandas For Your Grandpa - 4.1 Strings

Earlier in the course I said a Pandas Series is like a souped up version of a NumPy 1-D array. Perhaps the best example of this is when you’re dealing with a Series of strings. In this video, we’ll look at Pandas methods for processing a Series of strings.

Let’s start by building a Series of strings that we’ll call cajun.

import numpy as np
import pandas as pd

cajun = pd.Series(['gumbo', 'crawfish boil', 'Mardi Gras', 'pirogue', pd.NA, 'Zatarains'], dtype='string')
print(cajun)
## 0            gumbo
## 1    crawfish boil
## 2       Mardi Gras
## 3          pirogue
## 4             <NA>
## 5        Zatarains
## dtype: string

Keep in mind, we’re setting dtype='string' here to tell Pandas we want to use the StringDType extension which allows for NaN values and enforces the elements to be of type ‘string’.

The .str accessor of a Series of strings gives us back a StringMethods object, from which we can call each of the string processing methods.

cajun.str
## <pandas.core.strings.accessor.StringMethods object at 0x7ffa58ab9520>

These methods are pretty simple and self-explanatory, so in this case the challenge is more about knowing what methods exist rather than how they work. So in this lecture, we’re just gonna speed through a bunch of string methods and if something peaks your interest, you can pause the video and do some more research own. Without further ado…

We can use upper() to get back a Series of upper case strings

cajun.str.upper()
## 0            GUMBO
## 1    CRAWFISH BOIL
## 2       MARDI GRAS
## 3          PIROGUE
## 4             <NA>
## 5        ZATARAINS
## dtype: string

and lower() to get back a Series of lower case strings.

cajun.str.lower()
## 0            gumbo
## 1    crawfish boil
## 2       mardi gras
## 3          pirogue
## 4             <NA>
## 5        zatarains
## dtype: string

We can use len() get the number of characters in each string.

cajun.str.len()
## 0       5
## 1      13
## 2      10
## 3       7
## 4    <NA>
## 5       9
## dtype: Int64

We can use split() to split each string along some specified delimiter and put the resulting substrings in a list.

cajun.str.split(' ')
## 0             [gumbo]
## 1    [crawfish, boil]
## 2       [Mardi, Gras]
## 3           [pirogue]
## 4                <NA>
## 5         [Zatarains]
## dtype: object

Chaining that with .str.get() let’s us pick out the ith substring in each list. For example,

cajun.str.split(' ').str.get(0)
## 0        gumbo
## 1     crawfish
## 2        Mardi
## 3      pirogue
## 4         <NA>
## 5    Zatarains
## dtype: object

We can use replace() to replace part of a string with another string. Here we replace all spaces with dashes.

cajun.str.replace(pat=' ', repl='-', regex=False)
## 0            gumbo
## 1    crawfish-boil
## 2       Mardi-Gras
## 3          pirogue
## 4             <NA>
## 5        Zatarains
## dtype: string

By default, replace() assumes you’re passing in a regular expression, but you can turn that off with regex=False. If you don’t know what a regular expression is, it’s basically a universally used syntax that let’s you do advanced string matching. It’s well worth learning but it’s beyond the scope of this course.

We can use cat() to concatenate the strings together using some specified separator.

cajun.str.cat(sep='_')
## 'gumbo_crawfish boil_Mardi Gras_pirogue_Zatarains'

Or we can use cat() to concatenate a Series of strings with another, same-sized Series or list of strings.

cajun.str.cat(['1', '2', '3', '4', '5', '6'], sep=' ')
## 0            gumbo 1
## 1    crawfish boil 2
## 2       Mardi Gras 3
## 3          pirogue 4
## 4               <NA>
## 5        Zatarains 6
## dtype: string

You can use string indexing to pick out a certain character or group of characters by position. For example, here we pick out the first character in each string.

cajun.str[0]
## 0       g
## 1       c
## 2       M
## 3       p
## 4    <NA>
## 5       Z
## dtype: string

Here we pick out characters up to (but excluding) the 3rd character.

cajun.str[:2]
## 0      gu
## 1      cr
## 2      Ma
## 3      pi
## 4    <NA>
## 5      Za
## dtype: string

And here we pick out the last character in each string.

cajun.str[-1]
## 0       o
## 1       l
## 2       s
## 3       e
## 4    <NA>
## 5       s
## dtype: string

We can use startswith() to check if each string starts with the letter “p”

cajun.str.startswith("p")
## 0    False
## 1    False
## 2    False
## 3     True
## 4     <NA>
## 5    False
## dtype: boolean

or endswith() to check if each string ends with the letter “s”.

cajun.str.endswith("s")
## 0    False
## 1    False
## 2     True
## 3    False
## 4     <NA>
## 5     True
## dtype: boolean

We can use contains() tp check whether each string contains some other string or regular expression.

cajun.str.contains('bo', regex=False)
## 0     True
## 1     True
## 2    False
## 3    False
## 4     <NA>
## 5    False
## dtype: boolean

We can use extract() which extracts the first matching substring using a regular expression with at least one capture group. For example, here we extract the first word that start with a capital letter.

cajun.str.extract(r'(\b[A-Z][a-z]+\b)')
##            0
## 0       <NA>
## 1       <NA>
## 2      Mardi
## 3       <NA>
## 4       <NA>
## 5  Zatarains

Or we can use extractall() to do the same thing, except it returns every matching substring. The output’s in a slightly different format too.

cajun.str.extractall(r'(\b[A-Z][a-z]+\b)')
##                  0
##   match           
## 2 0          Mardi
##   1           Gras
## 5 0      Zatarains

If you don’t know regular expressions, this is probably pretty cryptic, but if you do know them, hopefully you see this is incredibly useful.

Also, if you want to insert a prefix or suffix to each element in the Series, you can do that simply by adding a string prefix or string suffix directly to the Series. For example

'i like ' + cajun + ' a lot'
## 0            i like gumbo a lot
## 1    i like crawfish boil a lot
## 2       i like Mardi Gras a lot
## 3          i like pirogue a lot
## 4                          <NA>
## 5        i like Zatarains a lot
## dtype: string

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts