Source Code Details#
Method definitions#
- class caterpillar.CaterpillarDiagram(data, relative: bool, output_path=None)#
Main class for generating Caterpillar Diagram and subsequent forecasting
- __init__(data, relative: bool, output_path=None) None#
Constructor
The class constructor will initialize the
datavariable for input, a booleanrelativevariable for choosing either relative or individual analysis and let users choose an output directory for storing the generated data.- Variables:
data – input data
relative – boolean variable
output_path – path for writing output
- Parameters:
data (Pandas Series or DataFrame) –
- Univariate data as an input for the package
in wide format.
If the input is a dataframe, then the column must represent the time-axis (in years/months etc.) while each row represents a unique entity in the dataset. Refer to the example dataset or tutorial section for further details.
relative (bool) – Boolean argument for executing a relative analysis or an individual analysis
output_path (str) –
User-defined path for output data.
When user doesn’t specify an output path, the constructor will create a
caterpillard_outputdirectory in the current working directory.
- Return type:
None
- data_summary()#
Initial data summary
This method provides an initial data summary by logging the head of the data, length of the data (length of transposed dataframe due to wide form data as input) and calculates the length of cohorts.
This method also evaluates the number of cohorts based on the number of columns provided in the wide format data as input.
- Variables:
data –
Pandas Series or Pandas DataFrame.
This method utilizes the self.data variable instantiated by the
caterpillar.CaterpillarDiagram.__init__()n_cohorts –
int
n_cohorts store the number of cohorts possible in the input data as an integer number
- schema(data, out='color')#
Method for color assignment using proposed color schema.
This method assigns a color or a level based on the proposed color schema in the original article available at https://doi.org/10.1177/20597991221144577. The method will return only one out of the following three choices of color, level or color number.
The following is the proposed color schema based on the combination of d11 , d12 and d2 .
\(d_{11}\)
\(d_{12}\)
\(d_2\)
level
color
n_color
+
+
+
level1
red
1
0
+
+
level1
red
1
+
+
-
level2
orange
2
+
0
-
level2
orange
2
-
+
+
level3
yellow
3
+
-
-
level4
cyan
4
-
-
+
level5
blue
5
-
0
+
level5
blue
5
-
-
-
level6
green
6
0
0
0
level7
grey
7
0
-
-
level6
green
6
- Parameters:
data (Pandas Series) – The input data series that contains three values of \(d_{11}\), \(d_{12}\) and \(d_2\)
out (str) – Argument can take value as
colororlevelorn_colorto modify the data returned by this method.
- Returns:
color (str)
level (str)
n_color (int)
- color_schema()#
Generate the color schema using DoD
This method will generate the color schema using the Difference of Differences (DoD) approach as explained in doi: https://doi.org/10.1177/20597991221144577
Input: A series of values or a Pandas DataFrame
Pre-processing: NAs in the input data will get filled with zero
This method utilizes
caterpillar.CaterpillarDiagram.schema()to assign a color and level to the combination of first and second differences in each cohort.Writes complete cohort details to the filesystem at the given output_path
- Variables:
cohort_df –
Pandas DataFrame
Stores all cohort details like color of the cohort, level of the cohort, and the respective first and second differences for each cohort in a class variable
- caterpillar_assign_radius(diff, quartiles_threshold)#
This function will assign the radius to each cohort in a Caterpillar Diagram.
Employing the box-plot threshold of each quartile resulted from the absolute first difference, this function assign radius in following fashion:
Radius of 2 units: \(min.\ of\ box-plot \le x < first\ quartile\)
Radius of 4 units: \(first\ quartile \le x < second\ quartile\)
Radius of 6 units: \(second\ quartile \le x < third\ quartile\)
Radius of 8 units: \(third\ quartile \le x\)
- Parameters:
diff (float) – First differences value, \(d_{11}\ or\ d_{12}\)
quartiles_threshold (dictionary) – Dictionary that contains the key and values of the box-plot of the data
- Returns:
radius – radius of the cohort
- Return type:
int
- caterpillar_size()#
This method will provide the size to each cohort of the caterpillar diagram based on the first differences, \(d_{11}\ and\ d_{12}\) and utilizes
caterpillar.CaterpillarDiagram.caterpillar_assign_radius()The method will assign the radius to \(d_{11}\) and \(d_{12}\) separately. Since a cohort constitutes two first differences \(d_{11}\) and \(d_{12}\), this function will find the mean of the radius of \(d_{11}\) and \(d_{12}\) and mark this mean as the final radius for this cohort.
The method will write the complete_cohort_df attribute to the filesystem as per the output_path
- Variables:
complete_cohort_df –
Pandas DataFrame
The complete_cohort_df is an instance attribute that contains the color and radius for each cohort
- schema_transitions()#
This method will collect the consecutive transitions between each cohort for complete dataset
- Variables:
transition_count –
dictionary
It contains the number of times a particular transition was observed between two consecutive years
transition_mat –
Pandas DataFrame
Stores the consecutive color transitions as a Pandas Dataframe
- stationary_matrix(n_sim_iter=10000)#
This method will generate the stationary transition matrix using simulation approach
- Variables:
trans_mat_prob –
Pandas DataFrame
Stores the probability of color transitions in a DataFrame
- Parameters:
n_sim_iter (int) – Number of iterations for finding the stationary transition matrix
- Returns:
stationary_mat_final_df – Final stationary Matrix
- Return type:
Pandas DataFrame
- generate(data_index=None, n_last_cohorts=None)#
This method fetches the specified data and creates the caterpillar visualization. It will evaluate the X axis coordinates for the cohort circles and the start & end coordinates of the lines in-between cohorts circle.
This method will write the Caterpillar image to the filesystem and provides the figure object as instance attribute for any downstream application by the user.
- Variables:
lx_s – List of coordinates of start of lines in Caterpillar
lx_e – List of coordinates of end of lines in Caterpillar
cx – List of coordinates for center point of Caterpillar
caterpillar_fig – Caterpillar figure object
- Parameters:
data_index (int) – Choose the row for which Caterpillar Diagram needs to be generated. In case of individual analysis, this parameter is not required.
n_last_cohorts (int) – Specify the number of last cohorts for which the Caterpillar Diagram needs to be generated