Python Pandas For Your Grandpa - 4.2 Dates And Times
Inevitably, at some point you’ll need to work with dates and times. Pandas helps ease the pain, but due to the hundreds of date and time formats, timezones, daylight saving, leap years, and even leap seconds, dates and times can be a frustrating beast. In this section, we’ll do our best to tackle that beast, although we’re not going cover everything because doing so would take hours. We’re about to go on a journey together, so grab a cup of coffee and buckle up.
For this section I’m using Google Colab connected to a local python runtime, version 3.9.1. Unfortunately, at the time of writing this article, Colab is stuck on python 3.6.9 and some of the features in this lecture require version 3.9 and up. So, keep that in mind if you’re following along on Google Colab.
How Time Works
The first thing we need to discuss are timezones. So, when I was writing this script, I was sitting in Mandeville, Lousiana and it was Thursday, December 10th, 2020 @ 7:45PM, keeping in mind that I was in the U.S. Central timezone. If, at the same point in time, my friend in Melbourne looked at his clock, he’d say it was Friday December 11th @ 12:45PM because he’s located in the Australia Melbourne timezone.
If you look at a map of timezones you can see how jagged and un-uniform they are. Not to mention, there’s a lot of historical messiness like Russia going from 11 timezones to 9 timezones then back to 11, and Nepal not even adopting a standard timezone until 1986.
Nonetheless, suppose you were building a software application where people can schedule and share meetings. How would you store the datetime each meeting is scheduled to start? Well, a logical thing to do would be to store all the data relative to a single timezone, and then make your users (or basically the display of your application) responsible for converting those datetimes into their respective local timezones.
For example, let’s suppose you decide you want to use US Eastern as your master key timezone, from which all other timezones are referenced. If I schedule a meeting for 10:30 AM of May 1st 2030, US Central Time, then under the hood, your calendar application would convert that to 11:30 AM of May 1st 2030 and store that value into your database of all calendar events. Note that you don’t need to store the timezone because you know that every time in your database is in US Eastern Time. Now if I invite my Aussie friend to join my meeting, he can do a quick conversion to figure out the meeting starts at 1:30 AM of May 2nd 2030 local Melbourne time. This is a nice architecture because you’ve made storing datetimes very simple, and you’ve decoupled the hard work of actually converting between timezones to the end users. There’s just one little problem..
Like most states in the U.S. and many countries around the world New York (US Eastern Timezone) observes Daylight Saving Time which means the clock rolls forward an hour in the Spring and rolls backward an hour in the Fall. Now I think most people know this but they don’t really know this. I mean how, and when, does the clock actually roll forward and backward? It happens so early in the morning that you’ve probably never witnessed it. Assume you’re in New York; the date is Sunday, March 14, 2021 @ 1:00AM. If you actually stare at a digital clock, you’ll see it jumpy from 1:59AM back to 1:00AM, which means the times in the middle actually occurred twice on this date. This is a fatal flaw in your calendar software! Remember, you were storing data in US Eastern Time. So let’s say someone scheduled a meeting that your software logged for March 14, 2021 @ 1:30AM. When is that meeting supposed to occur? The 1st occurrence of 1:30AM or the 2nd occurrence of 1:30AM??
The solution to this problem? Pick a better timezone for which to store your data. Specifically, pick Coordinated Universal Time aka UTC aka Greenwich Mean Time. This is the golden child of timezones. It does not observe daylight saving. It has never observed daylight saving. It is not going to observe daylight saving. It is the timezone which all of world agrees to measure relative to.
Now if we’re gonna pass around times relative to UTC, it can be useful to have a standard format for doing so. One such format, that you’ve probably seen is ISO 8601. For example, this ‘2000-06-01 12:00:00−05:00’ is in ISO 8601, and in English it means “June 1st, 2000 @ 12PM, 5 hours behind UTC” meaning, at that same instance in time, it was 5PM in UTC. So whenever you see a datetime like this ‘2000-06-01 12:00:00−00:00’ with all 0s after the dash, it’s safe to assume you’re looking at UTC time. A somewhat strange but common alternative is to use a capital Z in place of the UTC offset, like ‘2000-06-01 12:00:00Z’. Also note that you can kind of infer where an ISO datetime is, based on the UTC offset, but it’s not perfect because something like “5 hours beind UTC” could refer to Chicago during Daylight Saving Time, or New York during Standard Time, or Bogota, Columbia on any day because Bogota doesn’t observe Daylight Saving.
Alright, now that we got that out of the way, let’s talk about Python. We’re gonna cover three datetime packages / modules and they’re all inter-related. The first is Python’s datetime module, the second is NumPy and the third is Pandas.
Let’s start with Python’s datetime module which is probably the most widely used datetime library among python developers. And we’re not gonna cover everything because this course is about Pandas, but we are gonna cover a few really fundamental concepts.
The best way to import the module is a slight point of contention between developers, but I like to
import datetime as dt.
import datetime as dt
With datetime, you can make a date like
dt.date(2019, 12, 4) ## datetime.date(2019, 12, 4)
Now, dates are always timezone naive which means they don’t know anything about timezones - they’re just dates.
So, what about datetimes? Well, you can make a timezone naive datetime like
dtnaive1 = dt.datetime(2020, 1, 1, 8, 15, 20) print(dtnaive1) ## 2020-01-01 08:15:20
dtnaive2 = dt.datetime.strptime('5/19/2020 8PM', '%m/%d/%Y %I%p') print(dtnaive2) ## 2020-05-19 20:00:00
Strptime is a universal format for parsing dates and times from strings. Here I’m telling python, “the first portion of this string is the month, follow by a slash, followed by the day, followed by a slash followed by a 4-digit year, followed by a space, followed by a 12-unit hour, follow by AM or PM”. If you’re wondering how I knew what which format codes to use, the docs for datetime include this really handy reference table.
You can also make timezone aware datetimes like
dt.datetime.fromisoformat() and then pass in a valid ISO 8601 string.
dtaware1 = dt.datetime.fromisoformat('2021-11-06 12:00:00-05:00') print(dtaware1) ## 2021-11-06 12:00:00-05:00
Or, we can import the ZoneInfo module and then do
dt.datetime.strptime() just like before, and then chain that with
.replace(tzinfo=ZoneInfo()) passing in a valid geographic timezone name.
from zoneinfo import ZoneInfo dtaware2 = dt.datetime.strptime('11/6/2021 12PM', '%m/%d/%Y %I%p').replace(tzinfo=ZoneInfo('America/New_York')) print(dtaware2) ## 2021-11-06 12:00:00-04:00
By the way, a fantastic reference for timezone names is this Wikipedia article titled “List of tz database time zones”.
Now that we have some datetimes, we can use the timedelta class to add 24 hours to each datetime. For example, if add 24 hours to
dtnaive1, unsurprisingly we get back “January 2nd @ 8:15AM and 20 seconds”
dtnaive1 + dt.timedelta(hours=24) ## datetime.datetime(2020, 1, 2, 8, 15, 20)
Watch what happens if we add 24 hours to
print(dtaware1 + dt.timedelta(hours=24)) ## 2021-11-07 12:00:00-05:00 print(dtaware2 + dt.timedelta(hours=24)) ## 2021-11-07 12:00:00-05:00
In both cases we get back November 7th, 2021 @ 12PM. The issue with this is that, if you’re in New York, 24 hours after 12PM on Nov 6th 2021, the clock’s actually going read 11AM, not 12PM, because Daylight Saving ended and the clock rolled back an hour. This information is embedded in
.tzinfo.dst(). So if you do
dtaware2.tzinfo.dst(dtaware2) it tells you that the daylight saving adjusted time is 1 hour behind what’s being reported.
dtaware2.tzinfo.dst(dtaware2) ## datetime.timedelta(seconds=3600)
However, when we call
dtaware1.tzinfo.dst(dtaware1) we get back
This brings up an important distinction between iso UTC offsets and geographic timezones. When you think about the datetime ‘2021-11-06 12:00:00-05:00’, that’s a perfectly valid representation of a datetime in New York which observes Daylight Saving, but it’s also a perfectly valid representation of a datetime in Bogota Columbia which doesn’t observe Daylight Saving. So, there’s really no reason for Python to assume that this datetime is being used somewhere that observes Daylight Saving vs somewhere that doesn’t. So, if you want your datetimes to be “Daylight Saving aware”, make sure you use actual geographic timezones and not UTC offsets.
Next up, we have NumPy. Things are a little simpler in NumPy. For one, there’s no date type, only a datetime type. Two - NumPy datetimes are timezone naive, meaning there’s no support timezones in NumPy. There was, but it was actually deprecated in v1.11.0. Yeah, maintaining timezone support sucks.
So if you want to make a datetime in NumPy you could do
import numpy as np np.datetime64('2005-02-25T03:30') ## numpy.datetime64('2005-02-25T03:30')
or if you wanted to make an array of datetimes you could do np.array(), then pass in a list of datetime strings and set
np.array(['2001-01-01T12:00', '2002-02-03T13:56:03.172'], dtype='datetime64') ## array(['2001-01-01T12:00:00.000', '2002-02-03T13:56:03.172'], ## dtype='datetime64[ms]')
So, how are these datetimes represented internally? Well, datetime64 is really just a wrapper for int64. In fact, we can add and integer to a datetime64. For example, if we add 1 to this datetime, we get back a new datetime that’s one minute ahead of the original datetime.
np.datetime64('2005-02-25T03:30') + 1 ## numpy.datetime64('2005-02-25T03:31')
And if we add 1 to this array of datetimes, we get back new datetimes which are are one millisecond ahead of the original datetimes.
np.array(['2001-01-01T12:00', '2002-02-03T13:56:03.172'], dtype='datetime64') + 1 ## array(['2001-01-01T12:00:00.001', '2002-02-03T13:56:03.173'], ## dtype='datetime64[ms]')
This brings up an important concept which is that the datetime64 datatype in NumPy has an associated ‘unit’ that determines the step size between sequential datetimes. For example, if we call
np.dtype() on the first datetime, you can see that the associated unit is ’m' which stands for ‘minutes’, which is why when we add 1 to it, we jump by a minute.
np.dtype(np.datetime64('2005-02-25T03:30')) ## dtype('<M8[m]')
If we explicitly set the unit as seconds like this,
np.datetime64('2005-02-25T03:30', 's') ## numpy.datetime64('2005-02-25T03:30:00')
then when we add 1 we jump by a second.
np.datetime64('2005-02-25T03:30', 's') + 1 ## numpy.datetime64('2005-02-25T03:30:01')
Now keep in mind datetime64 is 64 bits which means is can represent 2^64, or about 18 million trillion different datetimes. Which datetimes, you ask? Well, it’s the 9 million trillion datetimes before and after Junary 1st, 1970 UTC.
So if we make an array of datetimes and set our unit to seconds, it means we could have datetimes that range from about 290M years before 1970, up to 290M years after 1970. If our unit was nanoseconds, we could only have datetimes that range from about 292 years before 1970, up to 292 years after 1970. And if you tried to create a datetime outside that range, NumPy would convert it to something totally different.
np.array(['2100-01-01', '2200-01-01', '2300-01-01'], dtype='datetime64[ns]') ## array(['2100-01-01T00:00:00.000000000', '2200-01-01T00:00:00.000000000', ## '1715-06-13T00:25:26.290448384'], dtype='datetime64[ns]')
By the way, when you have datetimes centered around Junary 1st, 1970 UTC, it’s known as “Unix Time” or “POSIX Time”.
Also NumPy reserves one of those bits for ‘NaT’ or “not a time” so you can represent missing or invalid values, like
np.array(['2020-01-01', 'NaT', '2021-01-01'], dtype='datetime64') ## array(['2020-01-01', 'NaT', '2021-01-01'], dtype='datetime64[D]')
Also notice the unit here is ‘day’, so even though there’s no date type in NumPy, we can kind of represent dates using datetimes with daily units.
Alas we get to Pandas. The workhorse datetime type in Pandas is Timestamp which is really just a wrapper for NumPy’s datetime64 type. Similar to NumPy, Pandas doesn’t have a proper date type, so if you want to represent a date like “January 3rd, 2001”, you’d typically use the timestamp “January 3rd, 2001, hour 0, minute 0, second 0”.
There’s also something called a Period Type, but we’re not gonna cover it because it’s seldom used, especially compared to timestamps.
the workhorse function to create a timestamp is the
to_datetime() function, where you usually pass in a string like
import pandas as pd pd.to_datetime('2020-03-01') ## Timestamp('2020-03-01 00:00:00')
Note that unlike NumPy, Pandas timestamps are always stored with nanosecond units, so you can’t create a timestamp before 1678 or after 2262. If you try it, you’ll get an error.
pd.to_datetime('2263-01-01') # ERROR
Also, unlike NumPy, Pandas timestamps can be timezone aware. If you want to create a timestamp like “March 8th, 2001 @ 3PM, US Pacific timezone”, you could do
pd.to_datetime('March 8, 2001 3PM').tz_localize('America/Los_Angeles') ## Timestamp('2001-03-08 15:00:00-0800', tz='America/Los_Angeles')
By the way, notice how the
to_datetime() function smartly interpretted our datetime string without us telling it how. Cool! Although, it’s better to be explicit about the format which you can do by passing in a strptime format using the
format parameter, in this case like
mar8 = pd.to_datetime('March 8, 2001 3PM', format='%B %d, %Y %I%p').tz_localize('America/Los_Angeles') print(mar8) ## 2001-03-08 15:00:00-08:00
The timezone information is stored in the
.tz attribute of the timestamp.
mar8.tz ## <DstTzInfo 'America/Los_Angeles' PST-1 day, 16:00:00 STD>
If you wanna change the timezone you can use the
tz_convert() method, like
mar8.tz_convert('Europe/London') ## Timestamp('2001-03-08 23:00:00+0000', tz='Europe/London')
If we pass a list of datetime strings into the
to_datetime() function, we get back a DatetimeIndex that we could use as the index of a Series or DataFrame.
pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03']) ## DatetimeIndex(['2020-01-01', 'NaT', '2020-01-03'], dtype='datetime64[ns]', freq=None)
If we wrap that with
pd.Series(), we get back a Series of datetime64s
pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03'])) ## 0 2020-01-01 ## 1 NaT ## 2 2020-01-03 ## dtype: datetime64[ns]
Now let’s say we wanted to add 12 hours to each of those datetimes. If we assign that Series to a variable
x, then we could do
x = pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03'])) x + pd.Timedelta(value=12, unit='H') ## 0 2020-01-01 12:00:00 ## 1 NaT ## 2 2020-01-03 12:00:00 ## dtype: datetime64[ns]
So if you want to add or subtract time from a Series of datetimes, you have to do so using an instance of Timedelta, specifying a
value and a
unit. By the way, if you subtract two timestamps like this, what you get back is a timedelta.
pd.to_datetime('2020-03-02') - pd.to_datetime('2020-03-01') ## Timedelta('1 days 00:00:00')
Sometimes you’ll want to build a sequence of datetimes based on a regular interval. For example, you can make a DatetimeIndex with 10 values starting on March 8th 2020, iterating by 12 hours, using the
pd.date_range('2020-03-08', periods=10, freq='12H') ## DatetimeIndex(['2020-03-08 00:00:00', '2020-03-08 12:00:00', ## '2020-03-09 00:00:00', '2020-03-09 12:00:00', ## '2020-03-10 00:00:00', '2020-03-10 12:00:00', ## '2020-03-11 00:00:00', '2020-03-11 12:00:00', ## '2020-03-12 00:00:00', '2020-03-12 12:00:00'], ## dtype='datetime64[ns]', freq='12H')
If we did the same thing, but with
tz='America/Denver', notice the output changes slightly because Daylight Saving in Denver started @ 2AM on March 8th in 2020, meaning the clock in Denver went from 1:59AM to 3:00AM, basically skipping the whole 2 O’clock hour.
pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver') ## DatetimeIndex(['2020-03-08 00:00:00-07:00', '2020-03-08 13:00:00-06:00', ## '2020-03-09 01:00:00-06:00', '2020-03-09 13:00:00-06:00', ## '2020-03-10 01:00:00-06:00', '2020-03-10 13:00:00-06:00', ## '2020-03-11 01:00:00-06:00', '2020-03-11 13:00:00-06:00', ## '2020-03-12 01:00:00-06:00', '2020-03-12 13:00:00-06:00'], ## dtype='datetime64[ns, America/Denver]', freq='12H')
Let’s go ahead and use this datetime index as the index of a Series of stock prices.
foostock = pd.Series( data = np.round(np.random.normal(size=10), 2), index = pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver') ) print(foostock) ## 2020-03-08 00:00:00-07:00 -0.66 ## 2020-03-08 13:00:00-06:00 0.71 ## 2020-03-09 01:00:00-06:00 0.34 ## 2020-03-09 13:00:00-06:00 -0.69 ## 2020-03-10 01:00:00-06:00 -0.93 ## 2020-03-10 13:00:00-06:00 -0.68 ## 2020-03-11 01:00:00-06:00 -0.07 ## 2020-03-11 13:00:00-06:00 1.03 ## 2020-03-12 01:00:00-06:00 -0.39 ## 2020-03-12 13:00:00-06:00 0.68 ## Freq: 12H, dtype: float64
One of the neat things about this datetime index is that we can index the Series in some simple but power ways. If we say
foostock.loc['2020-03-09'], we actually get back both records that happened on March 9th.
foostock.loc['2020-03-09'] ## 2020-03-09 01:00:00-06:00 0.34 ## 2020-03-09 13:00:00-06:00 -0.69 ## Freq: 12H, dtype: float64
If we did
foostock.loc['2020-03'], we’d get back every record that happened in March.
foostock.loc['2020-03'] ## 2020-03-08 00:00:00-07:00 -0.66 ## 2020-03-08 13:00:00-06:00 0.71 ## 2020-03-09 01:00:00-06:00 0.34 ## 2020-03-09 13:00:00-06:00 -0.69 ## 2020-03-10 01:00:00-06:00 -0.93 ## 2020-03-10 13:00:00-06:00 -0.68 ## 2020-03-11 01:00:00-06:00 -0.07 ## 2020-03-11 13:00:00-06:00 1.03 ## 2020-03-12 01:00:00-06:00 -0.39 ## 2020-03-12 13:00:00-06:00 0.68 ## Freq: 12H, dtype: float64
If we passed in a particular timestamp, we could pick out a singlue value from the series, like the value on March 9th @1AM.
foostock.loc[pd.to_datetime('2020-03-09 1:00:00').tz_localize('America/Denver')] ## 0.34
And we could even select a slice of values with something like
foostock.loc['2020-03-09':'2020-03-10'] ## 2020-03-09 01:00:00-06:00 0.34 ## 2020-03-09 13:00:00-06:00 -0.69 ## 2020-03-10 01:00:00-06:00 -0.93 ## 2020-03-10 13:00:00-06:00 -0.68 ## Freq: 12H, dtype: float64
Note that both endpoints here are included.
So, that’s about all we’re gonna cover on dates and times in Python. There’s actually a lot of stuff we didn’t cover, but I think we hit all the fundamental things. It’s no secret that the real key to learning this stuff is to practice using it, so check out my challenge problems to get started.
2.1 Series Creation
2.2 Series Basic Indexing
2.3 Series Basic Operations
2.4 Series Boolean Indexing
2.5 Series Missing Values
2.6 Series Vectorization
2.8 Series View vs Copy
2.9 Challenge: Baby Names
2.10 Challenge: Bees Knees
2.11 Challenge: Car Shopping
2.12 Challenge: Price Gouging
2.13 Challenge: Fair Teams
3.1 DataFrame Creation
3.2 DataFrame To And From CSV
3.3 DataFrame Basic Indexing
3.4 DataFrame Basic Operations
3.6 DataFrame View vs Copy
3.8 DataFrame Aggregation
3.10 Challenge: Hobbies
3.11 Challenge: Party Time
3.12 Challenge: Vending Machines
3.13 Challenge: Cradle Robbers
3.14 Challenge: Pot Holes
4.2 Dates And Times
4.5 DataFrame Reshaping
4.6 Challenge: Class Transitions
4.7 Challenge: Rose Thorn
4.8 Challenge: Product Volumes
4.9 Challenge: Session Groups
4.10 Challenge: OB-GYM
- Final Boss
5.1 Challenge: COVID Tracing
5.2 Challenge: Pickle
5.3 Challenge: TV Commercials
5.4 Challenge: Family IQ
5.5 Challenge: Concerts