Source Code Details#

Method definitions#

class caterpillar.CaterpillarDiagram(data, relative: bool, output_path=None)#

Main class for generating Caterpillar Diagram and subsequent forecasting

__init__(data, relative: bool, output_path=None) → None#

Constructor

The class constructor will initialize the data variable for input, a boolean relative variable for choosing either relative or individual analysis and let users choose an output directory for storing the generated data.

Variables:

data – input data
relative – boolean variable
output_path – path for writing output

Parameters:

data (Pandas Series or DataFrame) –

Univariate data as an input for the package
in wide format.

If the input is a dataframe, then the column must represent the time-axis (in years/months etc.) while each row represents a unique entity in the dataset. Refer to the example dataset or tutorial section for further details.
relative (bool) – Boolean argument for executing a relative analysis or an individual analysis
output_path (str) –
User-defined path for output data.

When user doesn’t specify an output path, the constructor will create a caterpillard_output directory in the current working directory.

Return type:

None

data_summary()#

Initial data summary

This method provides an initial data summary by logging the head of the data, length of the data (length of transposed dataframe due to wide form data as input) and calculates the length of cohorts.

This method also evaluates the number of cohorts based on the number of columns provided in the wide format data as input.

Variables:

data –
Pandas Series or Pandas DataFrame.

This method utilizes the self.data variable instantiated by the caterpillar.CaterpillarDiagram.__init__()
n_cohorts –
int

n_cohorts store the number of cohorts possible in the input data as an integer number

schema(data, out='color')#

Method for color assignment using proposed color schema.

This method assigns a color or a level based on the proposed color schema in the original article available at https://doi.org/10.1177/20597991221144577. The method will return only one out of the following three choices of color, level or color number.

The following is the proposed color schema based on the combination of d₁₁ , d₁₂ and d₂ .

\(d_{11}\)	\(d_{12}\)	\(d_2\)	level	color	n_color
+	+	+	level1	red	1
0	+	+	level1	red	1
+	+	-	level2	orange	2
+	0	-	level2	orange	2
-	+	+	level3	yellow	3
+	-	-	level4	cyan	4
-	-	+	level5	blue	5
-	0	+	level5	blue	5
-	-	-	level6	green	6
0	0	0	level7	grey	7
0	-	-	level6	green	6

Parameters:

data (Pandas Series) – The input data series that contains three values of \(d_{11}\), \(d_{12}\) and \(d_2\)
out (str) – Argument can take value as color or level or n_color to modify the data returned by this method.

Returns:

color (str)
level (str)
n_color (int)

color_schema()#

Generate the color schema using DoD

This method will generate the color schema using the Difference of Differences (DoD) approach as explained in doi: https://doi.org/10.1177/20597991221144577

Input: A series of values or a Pandas DataFrame

Pre-processing: NAs in the input data will get filled with zero

This method utilizes caterpillar.CaterpillarDiagram.schema() to assign a color and level to the combination of first and second differences in each cohort.

Writes complete cohort details to the filesystem at the given output_path

Variables:

cohort_df –

Pandas DataFrame

Stores all cohort details like color of the cohort, level of the cohort, and the respective first and second differences for each cohort in a class variable

caterpillar_assign_radius(diff, quartiles_threshold)#

This function will assign the radius to each cohort in a Caterpillar Diagram.

Employing the box-plot threshold of each quartile resulted from the absolute first difference, this function assign radius in following fashion:

Radius of 2 units: \(min.\ of\ box-plot \le x < first\ quartile\)

Radius of 4 units: \(first\ quartile \le x < second\ quartile\)

Radius of 6 units: \(second\ quartile \le x < third\ quartile\)

Radius of 8 units: \(third\ quartile \le x\)

Parameters:

diff (float) – First differences value, \(d_{11}\ or\ d_{12}\)
quartiles_threshold (dictionary) – Dictionary that contains the key and values of the box-plot of the data

Returns:

radius – radius of the cohort

Return type:

int

caterpillar_size()#

This method will provide the size to each cohort of the caterpillar diagram based on the first differences, \(d_{11}\ and\ d_{12}\) and utilizes caterpillar.CaterpillarDiagram.caterpillar_assign_radius()

The method will assign the radius to \(d_{11}\) and \(d_{12}\) separately. Since a cohort constitutes two first differences \(d_{11}\) and \(d_{12}\), this function will find the mean of the radius of \(d_{11}\) and \(d_{12}\) and mark this mean as the final radius for this cohort.

The method will write the complete_cohort_df attribute to the filesystem as per the output_path

Variables:

complete_cohort_df –

Pandas DataFrame

The complete_cohort_df is an instance attribute that contains the color and radius for each cohort

schema_transitions()#

This method will collect the consecutive transitions between each cohort for complete dataset

Variables:

transition_count –
dictionary

It contains the number of times a particular transition was observed between two consecutive years
transition_mat –
Pandas DataFrame

Stores the consecutive color transitions as a Pandas Dataframe

stationary_matrix(n_sim_iter=10000)#

This method will generate the stationary transition matrix using simulation approach

Variables:

trans_mat_prob –

Pandas DataFrame

Stores the probability of color transitions in a DataFrame

Parameters:

n_sim_iter (int) – Number of iterations for finding the stationary transition matrix

Returns:

stationary_mat_final_df – Final stationary Matrix

Return type:

Pandas DataFrame

generate(data_index=None, n_last_cohorts=None)#

This method fetches the specified data and creates the caterpillar visualization. It will evaluate the X axis coordinates for the cohort circles and the start & end coordinates of the lines in-between cohorts circle.

This method will write the Caterpillar image to the filesystem and provides the figure object as instance attribute for any downstream application by the user.

Variables:

lx_s – List of coordinates of start of lines in Caterpillar
lx_e – List of coordinates of end of lines in Caterpillar
cx – List of coordinates for center point of Caterpillar
caterpillar_fig – Caterpillar figure object

Parameters:

data_index (int) – Choose the row for which Caterpillar Diagram needs to be generated. In case of individual analysis, this parameter is not required.
n_last_cohorts (int) – Specify the number of last cohorts for which the Caterpillar Diagram needs to be generated