Source Code Details#

Method definitions#

class caterpillar.CaterpillarDiagram(data, relative: bool, output_path=None)#

Main class for generating Caterpillar Diagram and subsequent forecasting

__init__(data, relative: bool, output_path=None) None#

Constructor

The class constructor will initialize the data variable for input, a boolean relative variable for choosing either relative or individual analysis and let users choose an output directory for storing the generated data.

Variables:
  • data – input data

  • relative – boolean variable

  • output_path – path for writing output

Parameters:
  • data (Pandas Series or DataFrame) –

    Univariate data as an input for the package

    in wide format.

    If the input is a dataframe, then the column must represent the time-axis (in years/months etc.) while each row represents a unique entity in the dataset. Refer to the example dataset or tutorial section for further details.

  • relative (bool) – Boolean argument for executing a relative analysis or an individual analysis

  • output_path (str) –

    User-defined path for output data.

    When user doesn’t specify an output path, the constructor will create a caterpillard_output directory in the current working directory.

Return type:

None

data_summary()#

Initial data summary

This method provides an initial data summary by logging the head of the data, length of the data (length of transposed dataframe due to wide form data as input) and calculates the length of cohorts.

This method also evaluates the number of cohorts based on the number of columns provided in the wide format data as input.

Variables:
  • data

    Pandas Series or Pandas DataFrame.

    This method utilizes the self.data variable instantiated by the caterpillar.CaterpillarDiagram.__init__()

  • n_cohorts

    int

    n_cohorts store the number of cohorts possible in the input data as an integer number

schema(data, out='color')#

Method for color assignment using proposed color schema.

This method assigns a color or a level based on the proposed color schema in the original article available at https://doi.org/10.1177/20597991221144577. The method will return only one out of the following three choices of color, level or color number.

The following is the proposed color schema based on the combination of d11 , d12 and d2 .

\(d_{11}\)

\(d_{12}\)

\(d_2\)

level

color

n_color

+

+

+

level1

red

1

0

+

+

level1

red

1

+

+

-

level2

orange

2

+

0

-

level2

orange

2

-

+

+

level3

yellow

3

+

-

-

level4

cyan

4

-

-

+

level5

blue

5

-

0

+

level5

blue

5

-

-

-

level6

green

6

0

0

0

level7

grey

7

0

-

-

level6

green

6

Parameters:
  • data (Pandas Series) – The input data series that contains three values of \(d_{11}\), \(d_{12}\) and \(d_2\)

  • out (str) – Argument can take value as color or level or n_color to modify the data returned by this method.

Returns:

  • color (str)

  • level (str)

  • n_color (int)

color_schema()#

Generate the color schema using DoD

This method will generate the color schema using the Difference of Differences (DoD) approach as explained in doi: https://doi.org/10.1177/20597991221144577

Input: A series of values or a Pandas DataFrame

Pre-processing: NAs in the input data will get filled with zero

This method utilizes caterpillar.CaterpillarDiagram.schema() to assign a color and level to the combination of first and second differences in each cohort.

Writes complete cohort details to the filesystem at the given output_path

Variables:

cohort_df

Pandas DataFrame

Stores all cohort details like color of the cohort, level of the cohort, and the respective first and second differences for each cohort in a class variable

caterpillar_assign_radius(diff, quartiles_threshold)#

This function will assign the radius to each cohort in a Caterpillar Diagram.

Employing the box-plot threshold of each quartile resulted from the absolute first difference, this function assign radius in following fashion:

Radius of 2 units: \(min.\ of\ box-plot \le x < first\ quartile\)

Radius of 4 units: \(first\ quartile \le x < second\ quartile\)

Radius of 6 units: \(second\ quartile \le x < third\ quartile\)

Radius of 8 units: \(third\ quartile \le x\)

Parameters:
  • diff (float) – First differences value, \(d_{11}\ or\ d_{12}\)

  • quartiles_threshold (dictionary) – Dictionary that contains the key and values of the box-plot of the data

Returns:

radius – radius of the cohort

Return type:

int

caterpillar_size()#

This method will provide the size to each cohort of the caterpillar diagram based on the first differences, \(d_{11}\ and\ d_{12}\) and utilizes caterpillar.CaterpillarDiagram.caterpillar_assign_radius()

The method will assign the radius to \(d_{11}\) and \(d_{12}\) separately. Since a cohort constitutes two first differences \(d_{11}\) and \(d_{12}\), this function will find the mean of the radius of \(d_{11}\) and \(d_{12}\) and mark this mean as the final radius for this cohort.

The method will write the complete_cohort_df attribute to the filesystem as per the output_path

Variables:

complete_cohort_df

Pandas DataFrame

The complete_cohort_df is an instance attribute that contains the color and radius for each cohort

schema_transitions()#

This method will collect the consecutive transitions between each cohort for complete dataset

Variables:
  • transition_count

    dictionary

    It contains the number of times a particular transition was observed between two consecutive years

  • transition_mat

    Pandas DataFrame

    Stores the consecutive color transitions as a Pandas Dataframe

stationary_matrix(n_sim_iter=10000)#

This method will generate the stationary transition matrix using simulation approach

Variables:

trans_mat_prob

Pandas DataFrame

Stores the probability of color transitions in a DataFrame

Parameters:

n_sim_iter (int) – Number of iterations for finding the stationary transition matrix

Returns:

stationary_mat_final_df – Final stationary Matrix

Return type:

Pandas DataFrame

generate(data_index=None, n_last_cohorts=None)#

This method fetches the specified data and creates the caterpillar visualization. It will evaluate the X axis coordinates for the cohort circles and the start & end coordinates of the lines in-between cohorts circle.

This method will write the Caterpillar image to the filesystem and provides the figure object as instance attribute for any downstream application by the user.

Variables:
  • lx_s – List of coordinates of start of lines in Caterpillar

  • lx_e – List of coordinates of end of lines in Caterpillar

  • cx – List of coordinates for center point of Caterpillar

  • caterpillar_fig – Caterpillar figure object

Parameters:
  • data_index (int) – Choose the row for which Caterpillar Diagram needs to be generated. In case of individual analysis, this parameter is not required.

  • n_last_cohorts (int) – Specify the number of last cohorts for which the Caterpillar Diagram needs to be generated