pandas dataframe add missing date from range in a multi-dimensional structure with duplicate index
This solution demonstrates how to fill in the missing dates in a given range in a multi-index pandas dataframe. The complexity is added by the presence of duplicate dates in the given data, where the date is considered as an index.
In case you try to reindex a data frame with duplicate indexes, you will get the following error.
To resolve this situation and to achieve the end goal of refitting dataset with missing indexes, following pseudo code can be used.
The example python code is as follows
Example Input:
In case you try to reindex a data frame with duplicate indexes, you will get the following error.
ValueError: cannot reindex from a duplicate axis
To resolve this situation and to achieve the end goal of refitting dataset with missing indexes, following pseudo code can be used.
- Read multidimensional data into pandas dataframe (dataset), with date column as an index (only one index).
- Transform dataframe index created above into datetime index type
- Create a new dataframe (d) with the required date range, and value of other records as null
- Append 'd' into 'dataset'
- Set index of 'dataset' to include more column to create a multi-level index
- Reindex to 'dataset', and fill the desired value.
The example python code is as follows
import MySQLdb as db import pandas as pd import numpy as np date_range = pd.date_range(start='02-01-2018',end='03-21-2018',name='DAY') cnx = db.connect(user='*****', passwd='****', host='******', db='*******') dataset = pd.read_sql('select DAY,NODE,CLASS, TALLY FROM TABLE WHERE', con=cnx, index_col=['DAY']) cnx.close() dataset.index = pd.to_datetime(dataset.index) d = pd.DataFrame({"NODE":[np.nan],"CLASS":[np.nan],"TALLY":[np.nan]},index=date_range) d = dataset.append(d) d=d.set_index([d.index,'NODE','CLASS']) indices = pd.MultiIndex.from_product(d.index.levels) d = d.reindex(indices,fill_value=0)
Example Input:
DAY | NODE | CLASS | TALLY |
3-02-2018 | A | X | 12 |
3-02-2018 | B | Y | 2 |
4-02-2018 | A | X | 23 |
6-02-2018 | B | Y | 34 |
Required Output
DAY | NODE | CLASS | TALLY | |
3-02-2018 | A | X | 12 | |
3-02-2018 | A | Y | 0 | |
3-02-2018 | B | X | 0 | |
3-02-2018 | B | Y | 12 | |
4-02-2018 | A | X | 23 | |
4-02-2018 | A | Y | 0 | |
4-02-2018 | B | X | 0 | |
4-02-2018 | B | Y | 0 | |
5-02-2018 | A | X | 0 | |
5-02-2018 | A | Y | 0 | |
5-02-2018 | B | X | 0 | |
5-02-2018 | B | Y | 0 | |
6-02-2018 | A | X | 0 | |
6-02-2018 | A | Y | 0 | |
6-02-2018 | B | X | 0 | |
6-02-2018 | B | Y | 34 |
.......
Comments
Post a Comment