Share on:

March 18, 2020

1. Introduction
2. Series
2.1 Series Creation
2.2 Series Basic Operations
2.3 Series Basic Indexing
2.4 Series Overwriting Data
2.5 Series Apply
2.6 Series Concatenation
2.7 Series Boolean Indexing
2.8 Series View Vs Copy
2.9 Series Missing Values
2.10 Series Challenges

pandas is a vast library of data wrangling tools, but all those tools are centered around two fundamental data structures: Series and DataFrame. If you imagine a table of data, you can think of each column as a Series and the structured collection of every column as a DataFrame.

Now you might be wondering, what’s the difference between a NumPy 1d array and a pandas Series? After all, they both represent a 1-dimensional set of values. And in fact, Pandas Series actually stores data using a NumPy array. The difference lies in the additional functionality that Series has. You can think of a Series as a souped up version of a NumPy 1d array. You’ll see what I mean as we go through the course, but to get started, we need to learn how to make a Series from scratch.

# Import

Before we do anything, we need to import pandas. The common convention is to import pandas as pd.

import pandas as pd

# Series From A List

The easiest way to make a series is from a list, just like making a NumPy array

x = pd.Series([5, 10, 15, 20, 25])

If we print the series, we get back something like this

print(x)
## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## dtype: int64

Notice how it already looks a bit different from a NumPy array. See that first column of values? That’s called the series index and you can use it to access the series’s elements in creative and meaningful ways. More on that a bit later..

Also notice how the output includes ‘dtype int64’ which tells us the datatype of the elements in the series. Just like a NumPy array, each element in the Series should be of the same type. You can have a Series of ints, a Series of floats, and a Series of strings, but you shouldn’t have a Series of ints, floats, and strings together. (Technically it’s possible, but it’s not recommended).

You can use Python’s type() function to check that x is indeed a Series object.

type(x)
## <class 'pandas.core.series.Series'>

And if you wanted to check the internal data type of the Series elements without printing the whole Series, you can use the .dtype attribute.

x.dtype
## dtype('int64')

If you wanted to access the underlying NumPy array, you can use the .to_NumPy() method.

x.to_numpy()
## array([ 5, 10, 15, 20, 25])

You might also see people using the .values attribute to get the underlying NumPy array, but this technique is deprecated and not recommended.

# don't do this
x.values
## array([ 5, 10, 15, 20, 25])

# Series From A Dictionary

You can also make a Series from a python dictionary.

data = {'a' : 0., 'b' : 1., 'c' : 2., 'e': 3.}
y = pd.Series(data)
print(y)
## a    0.0
## b    1.0
## c    2.0
## e    3.0
## dtype: float64

In this case, pandas uses the dictionary keys for the Series index and the dictionary values for the Series values. Again, we’ll cover the index and its purpose shortly. For now, just know it’s a thing.

# Series From A NumPy Array

Since pandas makes heavy use of NumPy, it’s often handy to import both packages. We’ll do that here, and for future lectures, you can assume I have both packages imported.

import numpy as np

So far, we’ve seen how you can make a Series from a list and a dictionary, but perhaps the most powerful way to make a Series from scratch is to make it from a NumPy array. Like this.

hw = np.array(['hello', 'world', 'hello', 'world'])
z = pd.Series(hw)
print(z)
## 0    hello
## 1    world
## 2    hello
## 3    world
## dtype: object

In this example, we build an array of strings and use that to initialize our Series. Notice how the output of print(z) shows dtype ‘object’. That seems awfully generic.. Why wouldn’t it display something like ‘string’?

This is more of a NumPy thing than a pandas thing, but it’s an important concept, so let’s walk through it.

Arrays are stored as contiguous, fixed size blocks of memory. For example, an array of 32-bit integers like [3,0,1] would internally be stored in binary like this

If you wanted to access the 3rd element in the array, you can essentially say “hey computer, give me the 3rd element of this array”, and then your computer starts at the beginning of the array and knows exactly how many bits to jump across to get to the 3rd element. In this case, your computer would know to jump across 64 bits to get to the 3rd element. Knowing exactly how many its to jump across in order to access a requested element is what makes arrays fast for data access.

Unlike integers, strings are objects that vary in size. For example, consider the strings, ['hello', 'i', 'am', 'a', 'banana']. The string “hello” is 40 bits, “i” is 8 bits, “am” is 16 bits, “a” is 8 bits, and “banana” is 48 bits. So, we can’t represent these strings with contiguous, fixed-sized memory blocks because the strings are different sizes. And if you just stored them in contiguous memory blocks like this, as soon as you start asking for elements things’ll start to go slow because your computer doesn’t know how far to jump from the beginning of the array to get to some requested element. Not to mention a host of other problems, like what happens if we try to swap two elements with different sizes?

To get around this complexity, the strings themselves are stored in some random location in memory and instead of storing the strings, NumPy stores an array of fixed size pointers to the strings. Now, pointers are just integers that specify some location in your computer’s memory, but we’re already way beyond what’s important for this course. The main takeaway here is that, when you see pandas report a datatype as object and you think to yourself that’s not very helpful, it’s because the underlying array is storing pointers, and technically those pointers could be pointing to anything, including a mixture of different types of objects.

So far, we’ve seen how to make a Series from a list, a dictionary, and a NumPy array. You might be wondering, how do I whip up a more complex Series from scratch, like a sequence of integers or a random sample of values from a normal distribution? The somewhat lame, but practical answer is to use NumPy. NumPy has lots of great tools for making arrays from scratch, and we’ve already seen how to wrap them into a Series. So, for example, we could make the sequential Series [10, 20, 30, 40, 50] using the np.arange() function and then just wrapping the resulting NumPy array with pd.Series().

sequence = pd.Series(np.arange(start=10, stop=60, step=10))
print(sequence)
## 0    10
## 1    20
## 2    30
## 3    40
## 4    50
## dtype: int64

Similarly, we could make a Series of standard normal random values using NumPy’s np.random.normal() function

randnorm = pd.Series(np.random.normal(size = 5))
print(randnorm)
## 0    0.502734
## 1   -0.702550
## 2    0.567036
## 3   -0.104345
## 4    0.399541
## dtype: float64

So, if you can’t see where this is going, one of the keys to mastering pandas is actually to master NumPy, since most of pandas is built on top of it.