Python Pandas For Your Grandpa - 2.1 Series Creation

Pandas is a vast library of data wrangling tools, but all those tools are centered around two fundamental data structures: Series and DataFrame. If you imagine a table of data, you can think of each column as a Series and the structured collection of every column as a DataFrame.

If you know NumPy, you might be wondering, what’s the difference between a Series and a Numpy 1-d array? After all, they both represent a 1-dimensional set of values. And in fact, Pandas Series actually stores data using a NumPy 1-d array.

The difference lies in the additional functionality that Series has. You can think of a Series as a souped up version of a NumPy 1-d array. You’ll see what I mean as we go through the course, but to get started, we need to learn how to make a Series from scratch, so that’ll be the focus of this section.

Before we do anything, we need to install and import pandas. The easiest way to install it is with pip like pip install pandas. If you’re using Google Colab like me, you would use !pip install pandas, except it’s already installed by default, so you don’t even need to do that.

Once it’s installed we need to import it. The common convention is to

import pandas as pd

The easiest way to make a Series is from a list, just like making a NumPy array.

x = pd.Series([5, 10, 15, 20, 25, 30, 35])

If we print the Series, we get back something like this

## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## 5    30
## 6    35
## dtype: int64

Notice how it already looks a bit different from a NumPy array. See that first column of values? That’s called the Series index and you can use it to access the Series elements in creative and meaningful ways. More on that a bit later.

Also notice the output includes ‘dtype int64’ which tells us the datatype of the elements in the Series. Just like a NumPy array, each element in a Series must be of the same size and type. You can have a Series of ints, a Series of floats, and a Series of strings, but you can’t have a Series of ints, floats, and strings together.

You can use Python’s type() function to check that x is indeed a Series object

## <class 'pandas.core.series.Series'>

But if you want to check the internal data type of the Series elements without printing the whole Series, you can use the .dtype attribute, like

## dtype('int64')

If you wanted to access the underlying NumPy array, you can use the .to_numpy() method.

## array([ 5, 10, 15, 20, 25, 30, 35])

You might also see people using the .values attribute here, but this technique is deprecated and not recommended.

x.values  # don't do this
## array([ 5, 10, 15, 20, 25, 30, 35])

You can also use the highly popular head() and tail() methods to pick out the first and last 5 elements of the Series. For instance x.head() returns the first 5 elements while x.tail() returns the last 5 elements.

## 0     5
## 1    10
## 2    15
## 3    20
## 4    25
## dtype: int64
## 2    15
## 3    20
## 4    25
## 5    30
## 6    35
## dtype: int64

and if you wanted to pick out the first 3 elements, you would just do x.head(3).

## 0     5
## 1    10
## 2    15
## dtype: int64

Another way you can make a Series is from a python dictionary, like this

data = {'a' : 0., 'b' : 1., 'c' : 2., 'd': 3.}
y = pd.Series(data)
## a    0.0
## b    1.0
## c    2.0
## d    3.0
## dtype: float64

In this case, Pandas uses the dictionary keys for the Series index and the dictionary values for the Series values. Again, we’ll cover the index and its purpose shortly. For now, just know it’s a thing.

Also notice in the first example we created a Series of integers, but in this example we appended each value with a decimal, so we got back a Series of floats. If we wanted to make a Series of strings we could do that too with something like

z = pd.Series(['frank', 'dee', 'dennis'])

If we print(z), notice the dtype is listed as “object”.

## 0     frank
## 1       dee
## 2    dennis
## dtype: object

I’ll explain why that’s so cryptic in a future section, but in the meantime I want to show you the newer, and probably better way to create a Series of strings. If we modify this statement by explicitly setting dtype='string',

z = pd.Series(['frank', 'dee', 'dennis'], dtype='string')

now when we print(z), it lists the dtype as “string”.

## 0     frank
## 1       dee
## 2    dennis
## dtype: string

Again, I’m sweeping a lot under the rug here, but we’ll cover these things in a future section.

Now, you might be wondering how I knew the Series function included a parameter for dtype. If we look at the documentation for Series, we can see the full function signature for the Series function, which includes parameters for data, index, dtype, and some other things. So, this is usually the first place to look when you want to learn about or have a question about a function.

So far, we’ve seen how you can make a Series from a list and a dictionary, but perhaps the most powerful way to make a Series from scratch is to make it from a NumPy array. Before we do that, we need to import numpy.

import numpy as np

So now, if we create a NumPy array like

x = np.array([10, 20, 30, 40])

we can convert it to a Series just by wrapping x in pd.Series().

## 0    10
## 1    20
## 2    30
## 3    40
## dtype: int64

So, why is this so “powerful”? Well, suppose you wanted to make a complex Series from scratch, like an unusual sequence of floats or a random sample of values from a normal distribution. The somewhat lame, but practical way to do this is to use NumPy. NumPy has lots of great tools for making arrays from scratch, and we’ve already seen how you can wrap them into a Series.

So, for example, we could make a Series with 10 evenly spaced values between 1 and 2 using the np.linspace() function, passing in start, stop, and num parameters, and then just wrapping the resulting NumPy array with pd.Series().

pd.Series(np.linspace(start=1, stop=2, num=10))
## 0    1.000000
## 1    1.111111
## 2    1.222222
## 3    1.333333
## 4    1.444444
## 5    1.555556
## 6    1.666667
## 7    1.777778
## 8    1.888889
## 9    2.000000
## dtype: float64

Similarly, we could make a Series of standard normal random values using NumPy’s np.random.normal() function, and wrapping that result with pd.Series().

pd.Series(np.random.normal(size = 5))
## 0    0.018256
## 1    1.594655
## 2   -0.451890
## 3   -0.358798
## 4   -0.317518
## dtype: float64

So, if you can’t see where this is going, one of the keys to mastering Pandas is actually to master NumPy, since most of Pandas is built on top of it. And if you don’t feel good about your NumPy skills, lucky for you I happen to have a course on that too called Python NumPy For Your Grandma.

Course Curriculum

  1. Introduction
    1.1 Introduction
  2. Series
    2.1 Series Creation
    2.2 Series Basic Indexing
    2.3 Series Basic Operations
    2.4 Series Boolean Indexing
    2.5 Series Missing Values
    2.6 Series Vectorization
    2.7 Series apply()
    2.8 Series View vs Copy
    2.9 Challenge: Baby Names
    2.10 Challenge: Bees Knees
    2.11 Challenge: Car Shopping
    2.12 Challenge: Price Gouging
    2.13 Challenge: Fair Teams
  3. DataFrame
    3.1 DataFrame Creation
    3.2 DataFrame To And From CSV
    3.3 DataFrame Basic Indexing
    3.4 DataFrame Basic Operations
    3.5 DataFrame apply()
    3.6 DataFrame View vs Copy
    3.7 DataFrame merge()
    3.8 DataFrame Aggregation
    3.9 DataFrame groupby()
    3.10 Challenge: Hobbies
    3.11 Challenge: Party Time
    3.12 Challenge: Vending Machines
    3.13 Challenge: Cradle Robbers
    3.14 Challenge: Pot Holes
  4. Advanced
    4.1 Strings
    4.2 Dates And Times
    4.3 Categoricals
    4.4 MultiIndex
    4.5 DataFrame Reshaping
    4.6 Challenge: Class Transitions
    4.7 Challenge: Rose Thorn
    4.8 Challenge: Product Volumes
    4.9 Challenge: Session Groups
    4.10 Challenge: OB-GYM
  5. Final Boss
    5.1 Challenge: COVID Tracing
    5.2 Challenge: Pickle
    5.3 Challenge: TV Commercials
    5.4 Challenge: Family IQ
    5.5 Challenge: Concerts