Tutorial#

Introduction#

This notebook will provide a step-by-step tutorial for using the caterpillard package to create the Caterpillar Diagram as proposed in the research article.

A Caterpillar Diagram is a visualization technique used for analyzing univariate time-series data. It consists of a series of colored circles with varying radii. The circle’s color represents the direction of change in the time-series data, and the circle’s size shows its variation.

It implements the innovative and intuitive Difference of Differences (DoD) approach to create a color schema. As proposed, it segregates the time-series data under analysis into a cohort of three consecutive time units. Further, it utilizes the unsigned differences between observations to assign a size to each cohort. This novel visualization technique can segregate the time-series data using seven colors or five stages of Aggressive, Ascent, Descent, Controlled, and Status Quo.

Further, the proposed mechanism utilizes the accumulated color information regarding each cohort to forecast the next step transition using a stationary matrix of Markov Chains.

[1]:
import numpy as np
import pandas as pd

from caterpillard import CaterpillarDiagram

Relative Analysis#

The Caterpillar Diagram can analyze a collection of time-series dataset together and produce a visualization for the user-defined data series out of them. This kind of analysis is termed here as relative analysis.

An example dataset is provided with the package which can be imported using from caterpillard import load_dataframe

[2]:
from caterpillard import load_dataframe

DataFrame info#

As shown in the output of info command above, the dataframe is a wide form dataset where each column stores the repeated response of each item (row) over the years. The package requires the dataframe in this format to create the diagram. This dataframe has 205 distinct entities in the index which will serve as an identification information for selecting the collection of time-series to work with. There are 49 columns in the dataset each reporting a numeric observation.

[3]:
df = load_dataframe()
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 4 to 499
Data columns (total 49 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1970    205 non-null    int64
 1   1971    205 non-null    int64
 2   1972    205 non-null    int64
 3   1973    205 non-null    int64
 4   1974    205 non-null    int64
 5   1975    205 non-null    int64
 6   1976    205 non-null    int64
 7   1977    205 non-null    int64
 8   1978    205 non-null    int64
 9   1979    205 non-null    int64
 10  1980    205 non-null    int64
 11  1981    205 non-null    int64
 12  1982    205 non-null    int64
 13  1983    205 non-null    int64
 14  1984    205 non-null    int64
 15  1985    205 non-null    int64
 16  1986    205 non-null    int64
 17  1987    205 non-null    int64
 18  1988    205 non-null    int64
 19  1989    205 non-null    int64
 20  1990    205 non-null    int64
 21  1991    205 non-null    int64
 22  1992    205 non-null    int64
 23  1994    205 non-null    int64
 24  1995    205 non-null    int64
 25  1996    205 non-null    int64
 26  1997    205 non-null    int64
 27  1998    205 non-null    int64
 28  1999    205 non-null    int64
 29  2000    205 non-null    int64
 30  2001    205 non-null    int64
 31  2002    205 non-null    int64
 32  2003    205 non-null    int64
 33  2004    205 non-null    int64
 34  2005    205 non-null    int64
 35  2006    205 non-null    int64
 36  2007    205 non-null    int64
 37  2008    205 non-null    int64
 38  2009    205 non-null    int64
 39  2010    205 non-null    int64
 40  2011    205 non-null    int64
 41  2012    205 non-null    int64
 42  2013    205 non-null    int64
 43  2014    205 non-null    int64
 44  2015    205 non-null    int64
 45  2016    205 non-null    int64
 46  2017    205 non-null    int64
 47  2018    205 non-null    int64
 48  2019    205 non-null    int64
dtypes: int64(49)
memory usage: 80.1 KB
None

Initialization#

The generation of the Caterpillar diagram requires a series of steps. The first step is to initialize the CaterpillarDiagram class with the required parameters, data, relative, and the optional output_path.

Select the sub-collection of data series from the data frame as the input for the data parameter and assign True to the relative parameter because we are performing the relative analysis.

Here, we select indexes (4, 19, 25, 92, 122, 129, 141, 153, 186) from the data frame for relative analysis.

[4]:
df_subset = df[df.index.isin([4, 19, 25, 92, 122, 129, 141, 153, 186])]
print(df_subset.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 4 to 186
Data columns (total 49 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1970    9 non-null      int64
 1   1971    9 non-null      int64
 2   1972    9 non-null      int64
 3   1973    9 non-null      int64
 4   1974    9 non-null      int64
 5   1975    9 non-null      int64
 6   1976    9 non-null      int64
 7   1977    9 non-null      int64
 8   1978    9 non-null      int64
 9   1979    9 non-null      int64
 10  1980    9 non-null      int64
 11  1981    9 non-null      int64
 12  1982    9 non-null      int64
 13  1983    9 non-null      int64
 14  1984    9 non-null      int64
 15  1985    9 non-null      int64
 16  1986    9 non-null      int64
 17  1987    9 non-null      int64
 18  1988    9 non-null      int64
 19  1989    9 non-null      int64
 20  1990    9 non-null      int64
 21  1991    9 non-null      int64
 22  1992    9 non-null      int64
 23  1994    9 non-null      int64
 24  1995    9 non-null      int64
 25  1996    9 non-null      int64
 26  1997    9 non-null      int64
 27  1998    9 non-null      int64
 28  1999    9 non-null      int64
 29  2000    9 non-null      int64
 30  2001    9 non-null      int64
 31  2002    9 non-null      int64
 32  2003    9 non-null      int64
 33  2004    9 non-null      int64
 34  2005    9 non-null      int64
 35  2006    9 non-null      int64
 36  2007    9 non-null      int64
 37  2008    9 non-null      int64
 38  2009    9 non-null      int64
 39  2010    9 non-null      int64
 40  2011    9 non-null      int64
 41  2012    9 non-null      int64
 42  2013    9 non-null      int64
 43  2014    9 non-null      int64
 44  2015    9 non-null      int64
 45  2016    9 non-null      int64
 46  2017    9 non-null      int64
 47  2018    9 non-null      int64
 48  2019    9 non-null      int64
dtypes: int64(49)
memory usage: 3.5 KB
None
[5]:
cd = CaterpillarDiagram(data=df_subset, relative=True, output_path="/tmp/tut_output/")

Data Summary#

Use the data_summary method available in the class object to get info regarding the input data for analysis

[6]:
cd.data_summary()
Summarizing Data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 4 to 186
Data columns (total 49 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   1970    9 non-null      int64
 1   1971    9 non-null      int64
 2   1972    9 non-null      int64
 3   1973    9 non-null      int64
 4   1974    9 non-null      int64
 5   1975    9 non-null      int64
 6   1976    9 non-null      int64
 7   1977    9 non-null      int64
 8   1978    9 non-null      int64
 9   1979    9 non-null      int64
 10  1980    9 non-null      int64
 11  1981    9 non-null      int64
 12  1982    9 non-null      int64
 13  1983    9 non-null      int64
 14  1984    9 non-null      int64
 15  1985    9 non-null      int64
 16  1986    9 non-null      int64
 17  1987    9 non-null      int64
 18  1988    9 non-null      int64
 19  1989    9 non-null      int64
 20  1990    9 non-null      int64
 21  1991    9 non-null      int64
 22  1992    9 non-null      int64
 23  1994    9 non-null      int64
 24  1995    9 non-null      int64
 25  1996    9 non-null      int64
 26  1997    9 non-null      int64
 27  1998    9 non-null      int64
 28  1999    9 non-null      int64
 29  2000    9 non-null      int64
 30  2001    9 non-null      int64
 31  2002    9 non-null      int64
 32  2003    9 non-null      int64
 33  2004    9 non-null      int64
 34  2005    9 non-null      int64
 35  2006    9 non-null      int64
 36  2007    9 non-null      int64
 37  2008    9 non-null      int64
 38  2009    9 non-null      int64
 39  2010    9 non-null      int64
 40  2011    9 non-null      int64
 41  2012    9 non-null      int64
 42  2013    9 non-null      int64
 43  2014    9 non-null      int64
 44  2015    9 non-null      int64
 45  2016    9 non-null      int64
 46  2017    9 non-null      int64
 47  2018    9 non-null      int64
 48  2019    9 non-null      int64
dtypes: int64(49)
memory usage: 3.5 KB

Generate Color Schema#

[7]:
cd.color_schema()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 423 entries, 0 to 46
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   d11         423 non-null    int64
 1   d12         423 non-null    int64
 2   d2          423 non-null    int64
 3   data_index  423 non-null    int64
 4   Cohort      423 non-null    object
 5   color       423 non-null    object
 6   level       423 non-null    object
 7   n_color     423 non-null    int64
dtypes: int64(5), object(3)
memory usage: 29.7+ KB

The color_schema method has implemented the DoD approach and created a data frame complete_cohort_df available as the instance attribute to the class object cd as shown below.

The column \(d_{11}\), \(d_{12}\) are the first differences and \(d_2\) is the second difference, respectively. The data_index column shows the index of the subset of the data frame that we selected earlier.

The Cohort column marks the cohort number for which the row of this data frame (complet_cohort_df) is storing the information. The columns color, level, and n_color stores the color and level information for the particular cohort as assigned by the DoD approach.

[8]:
cd.complete_cohort_df.head()
[8]:
d11 d12 d2 data_index Cohort color level n_color
0 0 0 0 4 Cohort1 grey level7 7
1 0 2 2 4 Cohort2 red level1 1
2 2 -2 -4 4 Cohort3 cyan level4 4
3 -2 0 2 4 Cohort4 blue level5 5
4 0 0 0 4 Cohort5 grey level7 7

Caterpillar Size#

The class object has a method named caterpillar_size which facilitates the size assignment to each cohort. Only four radii lengths can be assigned to a cohort based on the dispersion in the unsigned differences of consecutive observations in the time series. Please refer to the original article or source code for implementation details.

[9]:
cd.caterpillar_size()
Calculating sizes for each cohort

The above method will update the instance attribute complete_cohort_df with size information for each cohort.

[10]:
cd.complete_cohort_df.head()
[10]:
d11 d12 d2 data_index Cohort color level n_color d11_radius d12_radius final_cohort_radius
0 0 0 0 4 Cohort1 grey level7 7 4 4 4.0
1 0 2 2 4 Cohort2 red level1 1 4 4 4.0
2 2 -2 -4 4 Cohort3 cyan level4 4 4 4 4.0
3 -2 0 2 4 Cohort4 blue level5 5 4 4 4.0
4 0 0 0 4 Cohort5 grey level7 7 4 4 4.0

Since each cohort contains two first differences, \(d_{11}\) and \(d_{12}\), the package will assign the radius to each of them. The final radius of the cohort is the mean of these two radii stored in final_cohort_radius.

Caterpillar Diagram Generation#

The class object provides the generate method to create and store the caterpillar diagram of the user-defined entity from the collection of time-series data which was provided as input.

[11]:
# Caterpillar diagram for entity 92 based on relative analysis
# Showing last 11 cohorts out of available 47 cohorts
cd.generate(data_index=92, n_last_cohorts=11)
../_images/notebooks_tutorial_23_0.png
[12]:
# Caterpillar diagram for entity 153 based on relative analysis
# Showing last seven cohorts out of available 47 cohorts
cd.generate(data_index=153, n_last_cohorts=7)
../_images/notebooks_tutorial_24_0.png

Color Schema Transitions#

The schema_transitions method will count the occurrence of transition from a particular color in a cohort to another color in the next consecutive cohort and creates a transition matrix for applying Markov Chains.

The relative analysis will utilize the color transitions of all the entities under analysis to forecast the next state transition.

[13]:
cd.schema_transitions()
Finding transitions
[14]:
# Instance attribute transition_mat stores the matrix
cd.transition_mat
[14]:
red orange yellow cyan blue green grey
red 13 13 0 36 0 0 0
orange 10 1 0 16 0 0 0
yellow 21 13 0 25 1 0 1
cyan 0 0 40 0 28 8 1
blue 3 0 16 0 5 3 11
green 0 0 6 0 4 1 1
grey 15 0 0 0 0 0 130

For example, the matrix above shows that red has transitioned 36 times to the cyan color.

The above matrix facilitates creating the stationary matrix for the Markov chain.

Stationary Matrix#

[15]:
# n_sim_iter is the number of times the transition_mat
# will be multiplied by itself to reach a stationary state

cd.stationary_matrix(n_sim_iter=10**4)
Finding stationary matrix
100% (10000 of 10000) |##################| Elapsed Time: 0:00:12 Time:  0:00:12
[15]:
red orange yellow cyan blue green grey
red 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
orange 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
yellow 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
cyan 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
blue 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
green 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313
grey 0.148306 0.065532 0.150192 0.186501 0.092087 0.029069 0.328313

The above matrix is the final stationary matrix for all the entities taken together as input for the current relative analysis.

Absolute Analysis#

The Caterpillar Diagram can analyze a single data series and produce a visualization. This kind of analysis is termed here as absolute analysis. The transition matrix of the single time series data will facilitate the forecast of the next-step transition based on this single data series only.

The caterpillard package provides the single data series as an example dataset with the load_series function.

[16]:
from caterpillard import load_series
[17]:
data_ser = load_series()
[18]:
print(data_ser)
1970       0
1971       0
1972       1
1973       0
1974       0
1975      13
1976       1
1977       1
1978       0
1979     134
1980      74
1981     100
1982     307
1983     445
1984    1110
1985     279
1986    1279
1987    2137
1988    4307
1989    3743
1990    4136
1991    5026
1992    4622
1994    1687
1995    1878
1996    2872
1997    4168
1998    1702
1999    2198
2000    3020
2001    3478
2002    3231
2003    2909
2004    2119
2005    2837
2006    4533
2007    3274
2008    4919
2009    4190
2010    4259
2011    3296
2012    2408
2013    3208
2014    3524
2015    3090
2016    3673
2017    3681
2018    3152
2019    2336
Name: 92, dtype: int64
[30]:
data_ser.index
[30]:
Index(['1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019'],
      dtype='object')

Initialization#

Similar to the approach discussed for relative analysis, the CaterpillarDiagram class accepts three arguments, viz. data for providing Pandas Series. The parameter relative should be False to mark this as an Absolute analysis. The parameter output_path is optional. When undefined by the user, the package will create an output directory, caterpillard_output.

[19]:
cd_absolute = CaterpillarDiagram(data=data_ser, relative=False, output_path="/tmp/tut_output/")

Data Summary#

[20]:
cd_absolute.data_summary()
Summarizing Data

Generate Color Schema#

The color_schema method will evaluate the color for each cohort.

[22]:
cd_absolute.color_schema()

The complete_cohort_df has first and second differences for the input data series with the assigned color, level, and n_color for each cohort.

[23]:
cd_absolute.complete_cohort_df.head()
[23]:
d11 d12 d2 Cohort color level n_color
0 0.0 1.0 1.0 Cohort1 red level1 1
1 1.0 -1.0 -2.0 Cohort2 cyan level4 4
2 -1.0 0.0 1.0 Cohort3 blue level5 5
3 0.0 13.0 13.0 Cohort4 red level1 1
4 13.0 -12.0 -25.0 Cohort5 cyan level4 4

Caterpillar Size#

[24]:
cd_absolute.caterpillar_size()
Calculating sizes for each cohort

The caterpillar_size method will update the complete_cohort_df instance attribute.

[25]:
cd_absolute.complete_cohort_df.head()
[25]:
d11 d12 d2 Cohort color level n_color d11_radius d12_radius final_cohort_radius
0 0.0 1.0 1.0 Cohort1 red level1 1 2 2 2.0
1 1.0 -1.0 -2.0 Cohort2 cyan level4 4 2 2 2.0
2 -1.0 0.0 1.0 Cohort3 blue level5 5 2 2 2.0
3 0.0 13.0 13.0 Cohort4 red level1 1 2 2 2.0
4 13.0 -12.0 -25.0 Cohort5 cyan level4 4 2 2 2.0

Caterpillar Diagram Generation#

In the Absolute analysis, the generate method should not provide the data_index as it is a single data series. Further, if the data series is very long, then the user can provide the integer n_last_cohort to generate only the specified number of the cohort from the end.

[26]:
cd_absolute.generate(data_index=None, n_last_cohorts=7)
../_images/notebooks_tutorial_50_0.png

Color Schema Transitions#

The schema_transitions method will count the color change in the consecutive cohorts and create a matrix, as shown below. The generated matrix is available for the user by transition_mat as the instance attribute.

[27]:
cd_absolute.schema_transitions()
Finding transitions
[28]:
cd_absolute.transition_mat
[28]:
red orange yellow cyan blue green grey
red 1 2 0 7 0 0 0
orange 2 0 0 3 0 0 0
yellow 5 3 0 3 0 0 0
cyan 0 0 7 0 3 3 0
blue 1 0 1 0 0 1 0
green 0 0 3 0 0 1 0
grey 0 0 0 0 0 0 0

Stationary Matrix#

The next task is to calculate the stationary matrix of the Markov chain using the method stationary_matrix. The n_sim_iter parameter required in the method will run the search for the stationarity for the specified number of times.

[29]:
cd_absolute.stationary_matrix(n_sim_iter=10**4)
Finding stationary matrix
100% (10000 of 10000) |##################| Elapsed Time: 0:00:13 Time:  0:00:13
[29]:
red orange yellow cyan blue green grey
red 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
orange 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
yellow 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
cyan 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
blue 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
green 0.197381 0.107704 0.250169 0.271017 0.062542 0.111186 0.0
grey 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0

Miscellaneous#

Logging#

This package is configure with loggers. The user can take advantage of logging in their scripts.

Cite#

If you are utilizing this package in your work, then consider citing the same with the following citation:

DOI

    1. Singh and D. Philip, “An innovative color-coding scheme for terrorism threat advisory system,” Methodological Innovations, p. 20597991221144576, Dec. 2022, doi: 10.1177/20597991221144577.