The problem

A while ago, I was doing a small visualisation project on an ancient history topic, and as part of that, I needed to express dates and times around 400BC. Sounds simple enough, right? Python has a datetime module, just use that and we're golden.

Well, not so fast. The MINYEAR is 1AD, so we can't express BC dates like that. We'll need a different solution.

What are our requirements for a good solution? What functionality are we looking for?

  • Expressing BC as well as AD dates
  • Create from string and/or numeric parameters
  • Print time
  • Getters
  • Add/subtract time span and getting time deltas
  • Get time span delta
  • Lightweight objects
  • Useable in pandas?

1. Data science libraries

Numpy - datetime64 + timedelta64

Numpy's datetime64 and timedelta64 extends Python's base datetime/timedelta from 32 to 64 bit and also adds some extra functionality.

import numpy as np

Numpy: Creating times BC/AD

Both classes support a range of +/-2.9e11 years with a second precision which should be enough for most purposes.

# Example code for creating BC and AD times
ad_date = np.datetime64("2020-01-02T03:04:05")
print(ad_date)

bc_date = np.datetime64("-00400-01-02T03:04:05")
print(bc_date)

2020-01-02T03:04:05
-400-01-02T03:04:05

Numpy: Getters

Numpy doesn't provide extraction functions out of the box, but we can roll our own based on the string representation (inspired by this stack overflow answer).

def dt2cal(dt):
    """
    Convert datetime64 to a calendar array of year, month, day, hour, minute, seconds, microsecond.

    Parameters
    ----------
    dt : datetime64
        datetime

    Returns
    -------
    cal : int32 array (7)
        calendar array representing year, month, day, hour, minute, second, microsecond
    """

    # allocate output 
    out = np.empty(7, dtype="i4")
    
    # decompose calendar floors
    Y, M, D, h, m, s = [dt.astype(f"M8[{x}]") for x in "YMDhms"]
    out[0] = Y.astype(int) + 1970 # Gregorian year
    out[1] = (M - Y) + 1 # month
    out[2] = (D - M).astype(int) + 1 # day
    out[3] = (dt - D).astype("m8[h]").astype(int) # hour
    out[4] = (dt - h).astype("m8[m]").astype(int) # minute
    out[5] = (dt - m).astype("m8[s]").astype(int) # second
    out[6] = (dt - s).astype("m8[us]").astype(int) # microsecond
    
    return out

print(bc_date)
cal = dt2cal(bc_date)
cal

-400-01-02T03:04:05
array([-400,    1,    2,    3,    4,    5,    0], dtype=int32)

def dt2cal_columns(dt):
    """
    Convert array of datetime64 to a calendar array of year, month, day, hour,
    minute, seconds, microsecond with these quantites indexed on the last axis.

    Parameters
    ----------
    dt : datetime64 array (...)
        numpy.ndarray of datetimes of arbitrary shape

    Returns
    -------
    cal : int32 array (..., 7)
        calendar array with last axis representing year, month, day, hour,
        minute, second, microsecond
    """

    # allocate output 
    out = np.empty(dt.shape + (7,), dtype="i4")
    # decompose calendar floors
    Y, M, D, h, m, s = [dt.astype(f"M8[{x}]") for x in "YMDhms"]
    out[..., 0] = Y.astype(int) + 1970 # Gregorian year
    out[..., 1] = (M - Y) + 1 # month
    out[..., 2] = (D - M).astype(int) + 1 # day
    out[..., 3] = (dt - D).astype("m8[h]").astype(int) # hour
    out[..., 4] = (dt - h).astype("m8[m]").astype(int) # minute
    out[..., 5] = (dt - m).astype("m8[s]").astype(int) # second
    out[..., 6] = (dt - s).astype("m8[us]").astype(int) # microsecond
    return out

dates = np.array([bc_date, ad_date], dtype='datetime64')
print(dates)

cal_arr = dt2cal_columns(dates)
cal_arr

['-400-01-02T03:04:05' '2020-01-02T03:04:05']
array([[-400,    1,    2,    3,    4,    5,    0],
       [2020,    1,    2,    3,    4,    5,    0]], dtype=int32)

Numpy: Time spans

timedelta64 handles addition and subtraction as expected; the only thing worth calling out is the concept of time scale inherent in Numpy's time classes: To operate on two timedelta64 objects, they both need to have the same time scale, i.e. both need to have the [Y], [d], etc format. You can easily convert these using .astype(timedelta64[X]).

This also holds for printing: The first print in the code below prints as seconds, which isn't very useful, but you can cast it to a better time scale such as years for better readability. Similarly, you need to specify the time scale when creating a time delta from scratch.

delta = ad_date - bc_date
print(delta)
print(delta.astype("timedelta64[D]"))
print(delta.astype("timedelta64[Y]"))

new_delta = np.timedelta64(123, 'D')
print(new_delta)

print(delta - new_delta)

print(ad_date + new_delta)

76367836800 seconds
883887 days
2420 years
123 days
76357209600 seconds
2020-05-04T03:04:05

Conclusion

Numpy is a solid choice if all you need is to cover a large range of times. Creating and dealing with dates and time spans is fairly straightforward, and the objects are lean as can be. As we've seen, we can compensate for the lack of easy extraction functions with our own workarounds, but it's not as efficient as if it was supported natively, and is prone to breaking as the language evolves.

Pandas

Pandas is the standard framework many data scientists use when dealing with large amounts of data. Its to_datetime method is a great way to convert columns of dates from strings to Pandas Timestamps which wrap Numpy's datetime64.

#Imports
import pandas as pd

df = pd.DataFrame({'date': ['1970-1-2 03:04:05', '2020-6-7 08:09:10'],
                   'value': [2, 3]})
df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m %H:%M:%S")
df

date value
0 1970-02-01 03:04:05 2
1 2020-07-06 08:09:10 3

Pandas: Creating times BC/AD

Since Pandas internally uses datetime64, we should just be able to use to_datetime on both BC and AD dates and we're all good to go, right? Wrong. While Pandas does use 64 bits to represent time, it does so at a precision of nanoseconds which severly limits the available date range as discussed here.

# This will result in an error:
#date = pd.to_datetime('-400-01-02T03:04:05', format="%Y-%d-%m %H:%M:%S")

If you try to parse a value before 1677AD, you'll get this error:

ValueError: time data '-400-01-02T03:04:05' does not match format '%Y-%d-%m %H:%M:%S' (match).

The error message is a bit misleading since the input format is just fine, the problem is that the date itself falls outside the supported range. So to_datetime is out.

What if we populated our data frames directly with datetime64?

df = pd.DataFrame({'date': [np.datetime64('-400-01-02T03:04:05'), np.datetime64('2020-06-07T08:09:10')],
                   'value': [2, 3]})
df

date value
0 -400-01-02T03:04:05 2
1 2020-06-07T08:09:10 3

This works beautifully! Looks like we can get our BC dates into Pandas data frames after all.

So we can use datetime64 directly, we just don't have the convenience of assembling Timestamps straight from string column(s). Here's a utility function that converts datetime64-compatible strings and puts them back into the column.

def custom_to_datetime_from_date(df):
    df['date'] = df.apply(lambda row: np.datetime64(row['date']), axis=1)
    return df

test_df = pd.DataFrame({'date': ['-400-01-02T03:04:05', '2020-06-07T08:09:10'],
                   'value': [2, 3]})
print(test_df)

test_df = custom_to_datetime_from_date(test_df)
test_df

                  date  value
0  -400-01-02T03:04:05      2
1  2020-06-07T08:09:10      3
date value
0 -400-01-02T03:04:05 2
1 2020-06-07T08:09:10 3

Another great use of to_datetime is its ability to assemble Timestamps from multiple colums, e.g. year, month, etc. Here's what a workaround for this could look like:

helper_df = pd.DataFrame({'column_names': ['month', 'day', 'hour', 'minute', 'second', 'ms', 'ns'],
                          'date_args': ['M', 'D', 'h', 'm', 's', 'ms', 'ns'],
                          'offsets': [1, 1, 0, 0, 0, 0, 0]})

print(helper_df)

def row_to_date_from_time_columns(row):
    if not 'year' in row:
        return None
    
    date = np.datetime64(row['year'], 'Y')
    
    for idx, r in helper_df.iterrows():
        name = r['column_names']
        if name in row:
            arg = r['date_args']
            date += np.timedelta64(row[name], arg)
            
            offset = r['offsets']
            if offset != 0:
                date -= np.timedelta64(offset, arg)
    
    return date

def custom_to_datetime_from_time_columns(df, drop_source_columns=False):
    df['date'] = df.apply(lambda row: row_to_date_from_time_columns(row), axis=1)
    
    if drop_source_columns:
        df = df.drop(helper_df['column_names'], axis=1, errors='ignore')
    
    return df

  column_names date_args  offsets
0        month         M        1
1          day         D        1
2         hour         h        0
3       minute         m        0
4       second         s        0
5           ms        ms        0
6           ns        ns        0

test_df = pd.DataFrame({'year': ['-400', '2020'],
                        'month': ['01', '02'],
                        'day': ['03', '04'],
                        'value': [2, 3]})
print(test_df)

test_df = custom_to_datetime_from_time_columns(test_df, True)
test_df

   year month day  value
0  -400    01  03      2
1  2020    02  04      3
year value date
0 -400 2 -400-01-03
1 2020 3 2020-02-04

For efficiency's sake, I've moved some of the arrays outside the function. You should further adjust this by removing the columns you don't need, but this should be a good starting point.

And finally, a combined version that converts the date column if it exists, or assembles time-related columns if it doesn't:

helper_df = pd.DataFrame({'column_names': ['month', 'day', 'hour', 'minute', 'second', 'ms', 'ns'],
                          'date_args': ['M', 'D', 'h', 'm', 's', 'ms', 'ns'],
                          'offsets': [1, 1, 0, 0, 0, 0, 0]})

def row_to_date(row):
    if 'date' in row:
        return np.datetime64(row['date'])
    
    if not 'year' in row:
        return None
    
    date = np.datetime64(row['year'], 'Y')
    
    for idx, r in helper_df.iterrows():
        name = r['column_names']
        if name in row:
            arg = r['date_args']
            date += np.timedelta64(row[name], arg)
            
            offset = r['offsets']
            if offset != 0:
                date -= np.timedelta64(offset, arg)
    
    return date

def custom_to_datetime(df, drop_source_columns=False):
    df['date'] = df.apply(lambda row: row_to_date(row), axis=1)
    
    if drop_source_columns:
        df = df.drop(helper_df['column_names'], axis=1, errors='ignore')
    
    return df

Great, so we've worked around that limitation in Pandas! However, this still doesn't solve the Timestamp limitation in DatetimeIndex and PeriodIndex.

Since we can create columns of datetime64 from strings now, maybe we can use the origin parameter of datetime64 and turn them into dates based on the Julian calender? This would mean that we can get a datetime64 with origin 4713BC that counts days from that date. Sounds like a good solution? Can we trick Timestamp into accepting a valid datetime64 and have it not modify it?

julian_origin = np.datetime64('-4713-01-01T12:00:00')
test_input = np.datetime64('-400-01-02')

# This will error out with:
# OutOfBoundsDatetime: 1574925 is Out of Bounds for origin='julian'
#pd.to_datetime((test_input - julian).astype('timedelta64[D]').astype(int), unit='D', origin='julian')

Nope, doesn't work. There is simply no way to get Timestamp to extend its range, even if you pass in a Julian datetime64. Pandas takes that representation and converts it to datetime64[ns] which again only goes as far back as 1677AD.

Pandas gives you false hope by stating that you can provide a different origin to to_datetime like this: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01")). However, for that to work, the origin Timestamp needs to be valid in the first place. So we'd need to create a BC Timestamp to use as an origin point to get BC Timestamps, which we can't do because BC Timestamps aren't valid. Great. I guess we just can't use DateTimeIndex.

Pandas: Getters

Timestamp and DatetimeIndex have a great selection of component getters (as described here. Sadly, we can't use any of them since we can't use Timestamp/DatetimeIndex...

The best we can do is use the getters defined above as a starting point and write per-component extractor utility that we can then apply to each row to get a new column.

def get_time_component(dt, comp):
    """
    Extract a time component (year, month, day, hour, minute, seconds, microsecond) from datetime64.

    Parameters
    ----------
    dt : datetime64
        datetime to extract from
    comp: string
        time component to extract

    Returns
    -------
    res : int32
        extracted time component
    """
    
    arr = dt2cal(dt)
    idx = ['y', 'm', 'd', 'h', 'min', 's', 'ms'].index(comp)
    if idx == -1:
        return None
    
    return arr[idx]
df = pd.DataFrame({'date': [np.datetime64('-400-01-02T03:04:05'), np.datetime64('2020-06-07T08:09:10')],
                   'value': [2, 3]})

df['year'] = df.apply(lambda row: get_time_component(row['date'], 'y'), axis=1)
df
date value year
0 -400-01-02T03:04:05 2 -400
1 2020-06-07T08:09:10 3 2020

Pandas: Time spans

One would think that if you have a working column of datetime64 objects and try to apply timedelta64 operations to the whole column, it would work. However, Pandas helpfully attempts to convert the column to Timestamps before applying the operation, which of course fails since our data is outside the Timestamp limits. Instead, we have to apply operations row-by-row, not using the column shorthand.

(Again, note that type casts are required to make the datetime64 in second-format compatible with the timedelta64 which is in year-format due to its construction.)

test_df = pd.DataFrame({'date': ['-400-01-02T03:04:05', '2020-06-07T08:09:10'],
                   'value': [2, 3]})

test_df = custom_to_datetime(test_df)

sub_ts = np.timedelta64(100, 'Y')

# Pandas will attempt to convert the date column into a Timestamp and crashes as a result
#test_df['date'] -= sub_ts

test_df['date'] = df.apply(lambda row: row['date'] - sub_ts.astype('m8[s]'), axis=1)

test_df

Conclusion

You can make Pandas work with Numpy's 64-bit time classes, it just needs some workarounds and a careful approach to keep Pandas from accidentally attempting an automatic conversion. Pandas usually adds a lot of utility to extend Numpy's time handling, but since they're all tied into range-limited classes, we can't use most of them.

Some articles that discuss the advantages of Pandas' time handling can be found here and here, just so you know what you're missing out on.