pandas dataframe add missing date from range in a multi-dimensional structure with duplicate index

This solution demonstrates how to fill in the missing dates in a given range in a multi-index pandas dataframe. The complexity is added by the presence of duplicate dates in the given data, where the date is considered as an index.
In case you try to reindex a data frame with duplicate indexes, you will get the following error.

ValueError: cannot reindex from a duplicate axis

To resolve this situation and to achieve the end goal of refitting dataset with missing indexes, following pseudo code can be used.


  1. Read multidimensional data into pandas dataframe (dataset), with date column as an index (only one index).
  2. Transform dataframe index created above into datetime index type
  3. Create a new dataframe (d) with the required date range, and value of other records as null
  4. Append 'd' into 'dataset'
  5. Set index of 'dataset' to include more column to create a multi-level index
  6. Reindex to 'dataset', and fill the desired value.

The example python code is as follows

import MySQLdb as db
import pandas as pd
import numpy as np

date_range = pd.date_range(start='02-01-2018',end='03-21-2018',name='DAY')
cnx = db.connect(user='*****', passwd='****',
                              host='******',
                              db='*******')
dataset = pd.read_sql('select DAY,NODE,CLASS, TALLY FROM TABLE WHERE', con=cnx, index_col=['DAY'])
cnx.close()

dataset.index = pd.to_datetime(dataset.index)

d = pd.DataFrame({"NODE":[np.nan],"CLASS":[np.nan],"TALLY":[np.nan]},index=date_range)

d = dataset.append(d)

d=d.set_index([d.index,'NODE','CLASS'])

indices = pd.MultiIndex.from_product(d.index.levels)

d = d.reindex(indices,fill_value=0)

Example Input:

DAY NODE CLASS TALLY
3-02-2018 A X 12
3-02-2018 B Y 2
4-02-2018 A X 23
6-02-2018 B Y 34

Required Output

DAY NODE CLASS TALLY
3-02-2018 A X 12
3-02-2018 A Y 0
3-02-2018 B X 0
3-02-2018 B Y 12
4-02-2018 A X 23
4-02-2018 A Y 0
4-02-2018 B X 0
4-02-2018 B Y 0
5-02-2018 A X 0
5-02-2018 A Y 0
5-02-2018 B X 0
5-02-2018 B Y 0
6-02-2018 A X 0
6-02-2018 A Y 0
6-02-2018 B X 0
6-02-2018 B Y 34
.......

Comments

Popular posts from this blog

Caused by: java.sql.SQLTimeoutException: ORA-01013: user requested cancel of current operation

HashiCorp Vault Integration with Ansible Etower using approle

utility to extract date from text with java