BC dates in Python - Part 1 - Numpy/Pandas
An approach for expressing BC dates in Python using Numpy and Pandas
The problem
A while ago, I was doing a small visualisation project on an ancient history topic, and as part of that, I needed to express dates and times around 400BC. Sounds simple enough, right? Python has a datetime module, just use that and we're golden.
Well, not so fast. The MINYEAR is 1AD, so we can't express BC dates like that. We'll need a different solution.
What are our requirements for a good solution? What functionality are we looking for?
- Expressing BC as well as AD dates
- Create from string and/or numeric parameters
- Print time
- Getters
- Add/subtract time span and getting time deltas
- Get time span delta
- Lightweight objects
- Useable in pandas?
1. Data science libraries
Numpy - datetime64 + timedelta64
Numpy's datetime64 and timedelta64 extends Python's base datetime/timedelta from 32 to 64 bit and also adds some extra functionality.
import numpy as np
# Example code for creating BC and AD times
ad_date = np.datetime64("2020-01-02T03:04:05")
print(ad_date)
bc_date = np.datetime64("-00400-01-02T03:04:05")
print(bc_date)
Numpy: Getters
Numpy doesn't provide extraction functions out of the box, but we can roll our own based on the string representation (inspired by this stack overflow answer).
def dt2cal(dt):
"""
Convert datetime64 to a calendar array of year, month, day, hour, minute, seconds, microsecond.
Parameters
----------
dt : datetime64
datetime
Returns
-------
cal : int32 array (7)
calendar array representing year, month, day, hour, minute, second, microsecond
"""
# allocate output
out = np.empty(7, dtype="i4")
# decompose calendar floors
Y, M, D, h, m, s = [dt.astype(f"M8[{x}]") for x in "YMDhms"]
out[0] = Y.astype(int) + 1970 # Gregorian year
out[1] = (M - Y) + 1 # month
out[2] = (D - M).astype(int) + 1 # day
out[3] = (dt - D).astype("m8[h]").astype(int) # hour
out[4] = (dt - h).astype("m8[m]").astype(int) # minute
out[5] = (dt - m).astype("m8[s]").astype(int) # second
out[6] = (dt - s).astype("m8[us]").astype(int) # microsecond
return out
print(bc_date)
cal = dt2cal(bc_date)
cal
def dt2cal_columns(dt):
"""
Convert array of datetime64 to a calendar array of year, month, day, hour,
minute, seconds, microsecond with these quantites indexed on the last axis.
Parameters
----------
dt : datetime64 array (...)
numpy.ndarray of datetimes of arbitrary shape
Returns
-------
cal : int32 array (..., 7)
calendar array with last axis representing year, month, day, hour,
minute, second, microsecond
"""
# allocate output
out = np.empty(dt.shape + (7,), dtype="i4")
# decompose calendar floors
Y, M, D, h, m, s = [dt.astype(f"M8[{x}]") for x in "YMDhms"]
out[..., 0] = Y.astype(int) + 1970 # Gregorian year
out[..., 1] = (M - Y) + 1 # month
out[..., 2] = (D - M).astype(int) + 1 # day
out[..., 3] = (dt - D).astype("m8[h]").astype(int) # hour
out[..., 4] = (dt - h).astype("m8[m]").astype(int) # minute
out[..., 5] = (dt - m).astype("m8[s]").astype(int) # second
out[..., 6] = (dt - s).astype("m8[us]").astype(int) # microsecond
return out
dates = np.array([bc_date, ad_date], dtype='datetime64')
print(dates)
cal_arr = dt2cal_columns(dates)
cal_arr
Numpy: Time spans
timedelta64 handles addition and subtraction as expected; the only thing worth calling out is the concept of time scale inherent in Numpy's time classes: To operate on two timedelta64 objects, they both need to have the same time scale, i.e. both need to have the [Y], [d], etc format. You can easily convert these using .astype(timedelta64[X]).
This also holds for printing: The first print in the code below prints as seconds, which isn't very useful, but you can cast it to a better time scale such as years for better readability. Similarly, you need to specify the time scale when creating a time delta from scratch.
delta = ad_date - bc_date
print(delta)
print(delta.astype("timedelta64[D]"))
print(delta.astype("timedelta64[Y]"))
new_delta = np.timedelta64(123, 'D')
print(new_delta)
print(delta - new_delta)
print(ad_date + new_delta)
Conclusion
Numpy is a solid choice if all you need is to cover a large range of times. Creating and dealing with dates and time spans is fairly straightforward, and the objects are lean as can be. As we've seen, we can compensate for the lack of easy extraction functions with our own workarounds, but it's not as efficient as if it was supported natively, and is prone to breaking as the language evolves.
Pandas
Pandas is the standard framework many data scientists use when dealing with large amounts of data. Its to_datetime method is a great way to convert columns of dates from strings to Pandas Timestamps which wrap Numpy's datetime64.
#Imports
import pandas as pd
df = pd.DataFrame({'date': ['1970-1-2 03:04:05', '2020-6-7 08:09:10'],
'value': [2, 3]})
df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m %H:%M:%S")
df
Pandas: Creating times BC/AD
Since Pandas internally uses datetime64, we should just be able to use to_datetime on both BC and AD dates and we're all good to go, right? Wrong. While Pandas does use 64 bits to represent time, it does so at a precision of nanoseconds which severly limits the available date range as discussed here.
# This will result in an error:
#date = pd.to_datetime('-400-01-02T03:04:05', format="%Y-%d-%m %H:%M:%S")
If you try to parse a value before 1677AD, you'll get this error:
ValueError: time data '-400-01-02T03:04:05' does not match format '%Y-%d-%m %H:%M:%S' (match).
The error message is a bit misleading since the input format is just fine, the problem is that the date itself falls outside the supported range. So to_datetime is out.
What if we populated our data frames directly with datetime64?
df = pd.DataFrame({'date': [np.datetime64('-400-01-02T03:04:05'), np.datetime64('2020-06-07T08:09:10')],
'value': [2, 3]})
df
This works beautifully! Looks like we can get our BC dates into Pandas data frames after all.
So we can use datetime64 directly, we just don't have the convenience of assembling Timestamps straight from string column(s). Here's a utility function that converts datetime64-compatible strings and puts them back into the column.
def custom_to_datetime_from_date(df):
df['date'] = df.apply(lambda row: np.datetime64(row['date']), axis=1)
return df
test_df = pd.DataFrame({'date': ['-400-01-02T03:04:05', '2020-06-07T08:09:10'],
'value': [2, 3]})
print(test_df)
test_df = custom_to_datetime_from_date(test_df)
test_df
Another great use of to_datetime is its ability to assemble Timestamps from multiple colums, e.g. year, month, etc. Here's what a workaround for this could look like:
helper_df = pd.DataFrame({'column_names': ['month', 'day', 'hour', 'minute', 'second', 'ms', 'ns'],
'date_args': ['M', 'D', 'h', 'm', 's', 'ms', 'ns'],
'offsets': [1, 1, 0, 0, 0, 0, 0]})
print(helper_df)
def row_to_date_from_time_columns(row):
if not 'year' in row:
return None
date = np.datetime64(row['year'], 'Y')
for idx, r in helper_df.iterrows():
name = r['column_names']
if name in row:
arg = r['date_args']
date += np.timedelta64(row[name], arg)
offset = r['offsets']
if offset != 0:
date -= np.timedelta64(offset, arg)
return date
def custom_to_datetime_from_time_columns(df, drop_source_columns=False):
df['date'] = df.apply(lambda row: row_to_date_from_time_columns(row), axis=1)
if drop_source_columns:
df = df.drop(helper_df['column_names'], axis=1, errors='ignore')
return df
test_df = pd.DataFrame({'year': ['-400', '2020'],
'month': ['01', '02'],
'day': ['03', '04'],
'value': [2, 3]})
print(test_df)
test_df = custom_to_datetime_from_time_columns(test_df, True)
test_df
For efficiency's sake, I've moved some of the arrays outside the function. You should further adjust this by removing the columns you don't need, but this should be a good starting point.
And finally, a combined version that converts the date column if it exists, or assembles time-related columns if it doesn't:
helper_df = pd.DataFrame({'column_names': ['month', 'day', 'hour', 'minute', 'second', 'ms', 'ns'],
'date_args': ['M', 'D', 'h', 'm', 's', 'ms', 'ns'],
'offsets': [1, 1, 0, 0, 0, 0, 0]})
def row_to_date(row):
if 'date' in row:
return np.datetime64(row['date'])
if not 'year' in row:
return None
date = np.datetime64(row['year'], 'Y')
for idx, r in helper_df.iterrows():
name = r['column_names']
if name in row:
arg = r['date_args']
date += np.timedelta64(row[name], arg)
offset = r['offsets']
if offset != 0:
date -= np.timedelta64(offset, arg)
return date
def custom_to_datetime(df, drop_source_columns=False):
df['date'] = df.apply(lambda row: row_to_date(row), axis=1)
if drop_source_columns:
df = df.drop(helper_df['column_names'], axis=1, errors='ignore')
return df
Great, so we've worked around that limitation in Pandas! However, this still doesn't solve the Timestamp limitation in DatetimeIndex and PeriodIndex.
Since we can create columns of datetime64 from strings now, maybe we can use the origin parameter of datetime64 and turn them into dates based on the Julian calender? This would mean that we can get a datetime64 with origin 4713BC that counts days from that date. Sounds like a good solution? Can we trick Timestamp into accepting a valid datetime64 and have it not modify it?
julian_origin = np.datetime64('-4713-01-01T12:00:00')
test_input = np.datetime64('-400-01-02')
# This will error out with:
# OutOfBoundsDatetime: 1574925 is Out of Bounds for origin='julian'
#pd.to_datetime((test_input - julian).astype('timedelta64[D]').astype(int), unit='D', origin='julian')
Nope, doesn't work. There is simply no way to get Timestamp to extend its range, even if you pass in a Julian datetime64. Pandas takes that representation and converts it to datetime64[ns] which again only goes as far back as 1677AD.
Pandas gives you false hope by stating that you can provide a different origin to to_datetime like this: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01")). However, for that to work, the origin Timestamp needs to be valid in the first place. So we'd need to create a BC Timestamp to use as an origin point to get BC Timestamps, which we can't do because BC Timestamps aren't valid. Great. I guess we just can't use DateTimeIndex.
Pandas: Getters
Timestamp and DatetimeIndex have a great selection of component getters (as described here. Sadly, we can't use any of them since we can't use Timestamp/DatetimeIndex...
The best we can do is use the getters defined above as a starting point and write per-component extractor utility that we can then apply to each row to get a new column.
def get_time_component(dt, comp):
"""
Extract a time component (year, month, day, hour, minute, seconds, microsecond) from datetime64.
Parameters
----------
dt : datetime64
datetime to extract from
comp: string
time component to extract
Returns
-------
res : int32
extracted time component
"""
arr = dt2cal(dt)
idx = ['y', 'm', 'd', 'h', 'min', 's', 'ms'].index(comp)
if idx == -1:
return None
return arr[idx]
df = pd.DataFrame({'date': [np.datetime64('-400-01-02T03:04:05'), np.datetime64('2020-06-07T08:09:10')],
'value': [2, 3]})
df['year'] = df.apply(lambda row: get_time_component(row['date'], 'y'), axis=1)
df
Pandas: Time spans
One would think that if you have a working column of datetime64 objects and try to apply timedelta64 operations to the whole column, it would work. However, Pandas helpfully attempts to convert the column to Timestamps before applying the operation, which of course fails since our data is outside the Timestamp limits.
Instead, we have to apply operations row-by-row, not using the column shorthand.
(Again, note that type casts are required to make the datetime64 in second-format compatible with the timedelta64 which is in year-format due to its construction.)
test_df = pd.DataFrame({'date': ['-400-01-02T03:04:05', '2020-06-07T08:09:10'],
'value': [2, 3]})
test_df = custom_to_datetime(test_df)
sub_ts = np.timedelta64(100, 'Y')
# Pandas will attempt to convert the date column into a Timestamp and crashes as a result
#test_df['date'] -= sub_ts
test_df['date'] = df.apply(lambda row: row['date'] - sub_ts.astype('m8[s]'), axis=1)
test_df
Conclusion
You can make Pandas work with Numpy's 64-bit time classes, it just needs some workarounds and a careful approach to keep Pandas from accidentally attempting an automatic conversion. Pandas usually adds a lot of utility to extend Numpy's time handling, but since they're all tied into range-limited classes, we can't use most of them.
Some articles that discuss the advantages of Pandas' time handling can be found here and here, just so you know what you're missing out on.