survivalstan package¶
Submodules¶
survivalstan.models module¶
survivalstan.sim module¶
Functions to simulate failure-time data for testing & model checking purposes
-
survivalstan.sim.
sim_data_exp
(N, censor_time, rate)[source]¶ simulate true lifetimes (t) according to exponential model
N: (int) number of observations censor_time: (float) uniform censor time for each observation rate: (float, positive) hazard rate used to parameterize failure times- pandas DataFrame with N observations, and 3 columns:
true_t: “actual” simulated failure time
t: observed failure/censor time, given censor_time
- event: boolean indicating if failure event was observed (TRUE)
or censored (FALSE)
simulate true lifetimes (t) according to exponential model
N: (int) number of observations censor_time: (float) uniform censor time for each observation rate_form: names of variables to use when estimating rate. defaults to ‘1 + age + sex’ rate_coefs: inputs to rate-calc (coefs used to estimate log-rate). defaults to [-3, 0.3, 0]- pandas DataFrame with N observations, and 3 columns:
true_t: “actual” simulated failure time
t: observed failure/censor time, given censor_time
- event: boolean indicating if failure event was observed (TRUE)
or censored (FALSE)
age: simulated age in years (poisson random variable, expectation = 55)
sex: simulated sex, as ‘female’ or ‘male’ (uniform 50/50 split)
rate: simulated rate value for each obs
-
survivalstan.sim.
sim_data_jointmodel
(N, p=0.5, **kwargs)[source]¶ Simulate data for joint model
Dictionary of 4 ojects:
- params: parameter values used to simulate data
- covars: dataframe of covariates per subject_id
- events: dataframe of multiple-event data, per subject_id
- biomarker: dataframe of longitudinal biomarker values simulated
survivalstan.survivalstan module¶
-
class
survivalstan.survivalstan.
SurvivalStanData
(df, formula, event_col, time_col=None, sample_id_col=None, sample_col=None, group_id_col=None, group_col=None, timepoint_id_col=None, timepoint_end_col=None, drop_intercept=True, **kwargs)[source]¶ Bases:
object
Input data representing a survival model in survivalstan
-
survivalstan.survivalstan.
extract_baseline_hazard
(results, element='baseline', timepoint_id_col='timepoint_id', timepoint_end_col='end_time')[source]¶ If model results contain a baseline object, extract & summarize it
-
survivalstan.survivalstan.
extract_grp_baseline_hazard
(results, timepoint_id_col='timepoint_id', timepoint_end_col='end_time')[source]¶ If model results contain a grp_baseline object, extract & summarize it
-
survivalstan.survivalstan.
fit_stan_survival_model
(df=None, formula=None, event_col=None, model_code=None, file=None, model_cohort='survival model', time_col=None, sample_id_col=None, sample_col=None, group_id_col=None, group_col=None, timepoint_id_col=None, timepoint_end_col=None, make_inits=None, stan_data={}, grp_coef_type=None, FIT_FUN=<function fit>, drop_intercept=True, input_data=None, *args, **kwargs)[source]¶ Prepare data & fit a survival model using Stan
This function wraps a number of steps into one function:
- Prepare input data dictionary for Stan - calls SurvivalStanData with user-provided formulas & df - (can be overridden using the input_data parameter)
- Compiles & optionally caches compiled stan code
- Fits model to data
- Tries the following functions on the resulting fit object:
- stanity.psisloo to summarize model fit using LOO-PSIS approximation
- extract posterior draws for beta coefficients (if model contains beta parameter)
- extract posterior draws for grouped-beta coefficients (if applicable)
- Parameters:
df (pandas DataFrame): The data frame containing input data to Survival model. formula (chr): Patsy formula to use for covariates. E.g ‘met_status + pd_l1’ event_col (chr): name of column containing event status. Will be coerced to boolean model_code (chr): stan model code to use. file (chr): path to stan file (if model_code not given) *args, **kwargs: passed to FIT_FUN (stanity.fit or replacement)
model_cohort (chr): description of this model fit, to be used when plotting or summarizing output time_col (chr): name of column containing event time – used for parameteric models sample_id_col (chr): name of column containing numeric sample ids (1-indexed & sequential) sample_col (chr): name of column containing sample descriptions - will be converted to an ID group_id_col (chr): name of column containing numeric group ids (1-indexed & sequential) group_col (chr): name of column containing group descriptions - will be converted to an ID timepoint_id_col (chr): name of column containing timepoint ids (1-indexed & sequential) timepoint_end_col (chr): name of column containing end times for each timepoint (will be converted to an ID) stan_data (dict): extra params passed to stan data object grp_coef_type (chr): type of group coef specified, if using a varying-coef model
Can be one of: - ‘None’ (default): guess group coef orientation from data.
Works except in case where M (num covariates) == G (num groups)- ‘matrix’: grp_beta defined as matrix[M, G] grp_beta;
- ‘vector-of-vectors’: grp_beta defined as vector[M] grp_beta[G];
drop_intercept (bool): whether to drop the intercept term from the model matrix (default: True)
- Returns:
dictionary of results objects.
- Contents::
- df: Pandas data frame containing input data, filtered to non-missing obs & with ID variables created x_df: Covariate matrix passed to Stan x_names: Column names for the covariate matrix passed to Stan data: List passed to Stan - contains dimensions, etc. fit: pystan fit object returned from Stan call coefs: posterior draws for coefficient values loo: psis-loo object returned for fit model. Used for model comparison & summary model_cohort: description of this model and/or cohort on which the model was fit df_all: input df given, with calculated values included sample_col: name of column (in df_all) used to identify the sample sample_id_col: name of column containing numeric id derived from the sample timepoint_end_col: name of column (in df_all) used to determine end-time of ‘long’ data, if relevant timepoint_id_col: name of column containing numeric id derived from timepoint_end_col
- Raises:
- AttributeError, KeyError
Example:
>>> testfit = fit_stan_survival_model( model_file = stanmodels.stan.pem_survival_model, formula = '~ met_status + pd_l1', df = dflong, sample_col = 'patient_id', timepoint_end_col = 'end_time', event_col = 'end_failure', model_cohort = 'PEM survival model', iter = 30000, chains = 4, ) >>> print(testfit['fit']) >>> seaborn.boxplot(x = 'value', y = 'variable', data = testfit['coefs'])
-
survivalstan.survivalstan.
prep_data_long_surv
(df, time_col, event_col, sample_col=None, event_name=None)[source]¶ Convert wide survival dataframe (df) to long format, in preparation for modeling using PEM models.
- Returns a pandas DataFrame with original records duplicated for each unique failure time observed.
- Each record will have two new columns: ‘end_failure’ and ‘end_time’, indicating the event status (end_failure) for each unique timepoint (end_time).
- Parameters:
- df (pandas.DataFrame):
- Input data containing survival time & status for each subject
- time_col (str):
- name of column containing time to censor/event
- event_col (str or list of strings):
- name of column containing status (1 or True: event, 0 or False: censor) If a list is provided, these will be processed as multiple event types.
- sample_col (str):
- (optional) column containing sample or subject identifier. If given, result will be de-duped so that multiple events within a sample are handled correctly.
- event_name (str):
- (optional) column containing description of event type, if more than one type of event is observed. If given, then then multiple events per subject will be processed.
- Returns:
pandas.DataFrame with original records duplicated for each unique failure time observed.
Each record will _include all original covariate values_, plus two new columns: ‘end_failure’ and ‘end_time’, indicating the timepoint-specific event status for each record.
If multiple events are given (either via a list of event_cols or by providing an event_name, the result will contain multiple end_failure columns, one for each event type.
survivalstan.utils module¶
-
survivalstan.utils.
extract_params_long
(models, element, rename_vars=None, varnames=None)[source]¶ Helper function to extract & reformat params
- models (list):
- List of model objects
- element (string, optional):
- Which element to plot. defaults to ‘coefs’. Other options (depending on model type) include: - ‘grp_coefs’ - ‘baseline_hazard’
- rename_vars (dict, optional):
- dictionary mapping from integer positions (0, 1, 2) to variable names
- varnames (list of strings, optional):
- list of variable names to apply to columns from the extracted object
Pandas dataframe containing posterior draws per iteration
-
survivalstan.utils.
extract_time_betas
(models, element='beta_time', value_name='beta', **kwargs)[source]¶ Extract posterior draws for values of time-varying element from each model given in the list of models.
- Returns a pandas.DataFrame containing one record for each posterior draw of each parameter, where
the parameter varies over time.
Columns include:
model_cohort: description of the model or cohort from which the draw was taken
<value-column>: the value of the posterior draw, named according to given parameter value_name
coef: description of the coefficient estimated, as per patsy formula provided
iter: integer indicator of the draw from which that estimate was taken
- <timepoint-id-column>: integer identifier for each unique time at which betas are estimated
(default column name is set by fit_stan_survival_model, typically as “timepoint_id”)
- <timepoint-end-column>: time at which this beta was estimated
(default column name is set by fit_stan_survival_model, typically as “end_time”)
** Parameters **:
param models: list of model-fit objects returned by survivalstan.fit_stan_survival_model. type models: list param element: name of parameter to extract. Defaults to “beta_time”, the parameter name used in the example time-varying stan model. type element: str param value_name: what you would like the “value” column called in the resulting dataframe type value_name: str param **kwargs: **kwargs are passed to _extract_time_betas_single_model, allowing user to customize “default” values which would otherwise be read from each model object. examples include: coefs, timepoint_id_col, and timepoint_end_col. ** Returns **:
returns: pandas.DataFrame containing posterior draws of parameter values.
-
survivalstan.utils.
filter_stan_summary
(stan_fit, pars=None, remove_nan=False)[source]¶ Filter stan fit summary, for the set of parameters in pars. See ?pystan.summary for details about summary stats given.
- stan_fit:
- StanFit object for which posterior draws are desired to be summarized
- pars: (list, optional)
- list of strings used to filter parameters. Passed directly to pystan.summary. default: return all parameters
- remove_nan: (bool, optional)
- whether to remove (and report on) NaN values for Rhat. These are problematic for distplot.
pandas dataframe containing summary stats for posterior draws of selected parameters
-
survivalstan.utils.
plot_coefs
(models, element='coefs', force_direction=None, trans=None, **kwargs)[source]¶ Plot coefficients for models listed
- models (list):
- List of model objects
- element (string, optional):
- Which element to plot. defaults to ‘coefs’. Other options (depending on model type) include: - ‘grp_coefs’ - ‘baseline’ - ‘beta_time’
- force_direction (string, optional):
- Takes values ‘h’ or ‘v’
- if ‘h’: forces horizontal orientation, (variable names along the x axis)
- if ‘v’: forces vertical orientation (variable names along the y axis)
if None (default), coef plots default to ‘v’ for all plots except baseline hazard.
- trans (function, optional):
- If present, transforms value of value column
- example: np.exp to plot exp(beta)
if None (default), plots raw value
-
survivalstan.utils.
plot_observed_survival
(df, event_col, time_col, label='observed', *args, **kwargs)[source]¶
-
survivalstan.utils.
plot_pp_survival
(models, time_element='y_hat_time', event_element='y_hat_event', num_ticks=10, step_size=None, ticks_at=None, time_col='event_time', event_col='event_status', fill=True, by=None, alpha=0.5, pal=None, subplot=None, **kwargs)[source]¶ Plot KM curve estimates from posterior-predicted values by group, for each model given in the list of models.
See prep_pp_survival_data for details regarding process of extracting posterior-predicted values.**Parameters controlling data extraction **:
param models: list of fit_stan_survival_model results from which to extract posterior-predicted values
type models: list
param by: additional column or columns by which to summarize posterior-predicted values. Default is None, which results in draws summarized by [iter and model_cohort]. Values can include any covariates provided in the original df.
type by: str or list of strings
param time_element: (optional) name of parameter containing posterior-predicted event time for each subject Defaults to standard used in survivalstan models: y_hat_time.
type time_element: str
param event_element: (optional) name of parameter containing posterior-predicted event status for each subject Defaults to the standard used in survivalstan models: y_hat_event.
type event_element: str
param event_col: (optional) name to use for column containing posterior draw for event_status
type event_col: str
param time_col: (optional) name to use for column containing posterior draw for time to event
type time_col: str
param **kwargs: - **kwargs are passed to _prep_pp_data_single_model, allowing user to override
or specify default values given in the original call to fit_stan_survival_model. Parameters include: sample_col, sample_id_col to define names of sample description & id columns
as well as join_with giving name of dataframe to join with (options include df_nonmiss, x_df, or None).
Use join_with = None to disable merge with original dataframe.
** Parameters controlling plot orientation/presentation **:
param pal: (optional) palette to use for plotting. type pal: list of colors, matching length of by groups param ticks_at: (optional) exact locations for placement of ticks param num_ticks: (optional) control number of ticks, if ticks_at not given. param step_size: (optional) control tick spacing, if ticks_at or num_ticks not given param alpha: (optional) level of transparency for boxplots param fill: (optional) whether to fill in boxplots or just show outlines. Defaults to True param subplot: (optional) pyplot.subplots object to use, if provided. Useful if you want to overlay observed or true survival on the same plot. param xlabel: (optional) label for x-axis (defaults to “Days”) param ylabel: (optional) label for y-axis (defaults to “Survival %”) param label: (optional) legend-label for this plot group (defaults to “posterior predictions”, model-cohort, or by-group label depending options) param **kwargs: (optional) args passed to set properties of boxes, medians & whiskers (e.g. color) ** Returns **:
returns: Nothing. Plotted object is a side-effect.
-
survivalstan.utils.
plot_stan_summary
(stan_fit, pars=None, metric='Rhat')[source]¶ Plot distribution of values in stan fit summary, for the set of parameters in pars.
Primary use case is to summarize Rhat estimates for set of parameters, as a quick check of convergence.
- stan_fit:
- StanFit object for which posterior draws are desired to be summarized
- pars: (list of str, optional)
- list of strings used to filter parameters. Passed directly to pystan.summary. default: return all parameters
- metric: (str, optional)
- the name of the metric to plot, as one of: [‘mean’,’se_mean’,’sd’,‘2.5%’,‘50%’,‘97.5%’,’Rhat’] default: Rhat
-
survivalstan.utils.
plot_time_betas
(models=None, df=None, element='beta_time', y='beta', trans=None, coefs=None, x='timepoint_end_col', by=['model_cohort', 'coef'], timepoint_id_col=None, timepoint_end_col=None, subplot=None, ticks_at=None, ylabel=None, xlabel='time', num_ticks=10, step_size=None, fill=True, alpha=0.5, pal=None, value_name='beta', **kwargs)[source]¶ Plot posterior draws of time-varying parameters (element) from each model given in the list of models.
See also
extract_time_betas to return the dataframe used by this function to plot data.
Note
this function can optionally take a df argument (the result of extract_time_betas) to support data-extraction & plotting in a two-step operation.
** Parameters controlling data extraction **:
param models: list of model-fit objects returned by survivalstan.fit_stan_survival_model. type models: list param element: name of parameter to extract. Defaults to “beta_time”, the parameter name used in the example time-varying stan model. type element: str param value_name: what you would like the “value” column called in the resulting dataframe type value_name: str param coefs: (optional) parameter passed to extract_time_betas, to override coefficient names captured in fit_stan_survival_model. param timepoint_id_col: (optional) parameter passed to extract_time_betas, to override timepoint_id_col captured in fit_stan_survival_model. param timepoint_end_col: (optional) parameter passed to extract_time_betas to override timepoint_end_col captured in fit_stan_survival_model. ** Parameters controlling plot orientation/presentation **:
param trans: (optional) function to transform y-values plotted. Example: np.log type trans: function param by: (optional) list of columns by which to aggregate & color boxplots Defaults to: [‘model_cohort’, ‘coef’] type by: list param pal: (optional) palette to use for plotting. type pal: list of colors, matching length of by groups param y: (optional) column to put on the y-axis. Defaults to ‘beta’ type y: str param x: (optional) column to put in the x-axis. Defaults to ‘timepoint_end_col’ type x: str param num_ticks: (optional) how many ticks to show on the x-axis. See _plot_time_betas for details. param alpha: (optional) level of transparency for boxplots param fill: (optional) whether to fill in boxplots or just show outlines. Defaults to True param subplot: (optional) pyplot.subplots object to use, if provided. Useful if you want to overlay multiple values on the same plot. ** Returns **:
returns: Nothing. Plotted object is a side-effect.
-
survivalstan.utils.
prep_pp_data
(models, time_element='y_hat_time', event_element='y_hat_event', event_col='event_status', time_col='event_time', **kwargs)[source]¶ - Extract posterior-predicted values from each model included in the list of models given, optionally merged with
- covariates & meta-data provided in the input df.
Parameters:
param models: list of fit_stan_survival_model results from which to extract posterior-predicted values
type models: list
param time_element: (optional) name of parameter containing posterior-predicted event time for each subject Defaults to standard used in survivalstan models: y_hat_time.
type time_element: str
param event_element: (optional) name of parameter containing posterior-predicted event status for each subject Defaults to the standard used in survivalstan models: y_hat_event.
type event_element: str
param event_col: (optional) name to use for column containing posterior draw for event_status
type event_col: str
param time_col: (optional) name to use for column containing posterior draw for time to event
type time_col: str
param **kwargs: - **kwargs are passed to _prep_pp_data_single_model, allowing user to override
or specify default values given in the original call to fit_stan_survival_model. Parameters include: sample_col, sample_id_col to define names of sample description & id columns
as well as join_with giving name of dataframe to join with (options include df_nonmiss, x_df, or None).
Use join_with = None to disable merge with original dataframe.
Returns:
returns: pandas.DataFrame with one record per posterior draw (iter) for each subject, from each model optionally joined with original input data.
-
survivalstan.utils.
prep_pp_survival_data
(models, time_element='y_hat_time', event_element='y_hat_event', time_col='event_time', event_col='event_status', by=None, **kwargs)[source]¶ - Summarize posterior-predicted values into KM survival/censor rates
by group, for each model given in the list of models.
See prep_pp_data for details regarding process of extracting posterior-predicted values.
Parameters:
param models: list of fit_stan_survival_model results from which to extract posterior-predicted values
type models: list
param by: additional column or columns by which to summarize posterior-predicted values. Default is None, which results in draws summarized by [iter and model_cohort]. Values can include any covariates provided in the original df.
type by: str or list of strings
param time_element: (optional) name of parameter containing posterior-predicted event time for each subject Defaults to standard used in survivalstan models: y_hat_time.
type time_element: str
param event_element: (optional) name of parameter containing posterior-predicted event status for each subject Defaults to the standard used in survivalstan models: y_hat_event.
type event_element: str
param event_col: (optional) name to use for column containing posterior draw for event_status
type event_col: str
param time_col: (optional) name to use for column containing posterior draw for time to event
type time_col: str
param **kwargs: - **kwargs are passed to _prep_pp_data_single_model, allowing user to override
or specify default values given in the original call to fit_stan_survival_model. Parameters include: sample_col, sample_id_col to define names of sample description & id columns
as well as join_with giving name of dataframe to join with (options include df_nonmiss, x_df, or None).
Use join_with = None to disable merge with original dataframe.
Returns:
returns: pandas.DataFrame with one record per posterior draw (iter), timepoint, model_cohort, and by-groups.
-
survivalstan.utils.
print_stan_summary
(stan_fit, pars=None)[source]¶ Convenience function to print stan fit summary, for the set of parameters in pars.
- stan_fit:
- StanFit object for which posterior draws are desired to be summarized
- pars: (optional)
- list of strings used to filter parameters. Passed directly to pystan.summary. default: return all parameters
-
survivalstan.utils.
read_files
(path, pattern='*.stan', encoding='utf-8', resource=None)[source]¶ Reads file contents from a directory path into memory. Returns a dictionary of file names: file contents.
Is intended to be used to load a directory of stan files into an object.
- path (string):
- directory path (can be relative or absolute)
- pattern (string, optional):
- regex pattern applied to files on import defaults to “*.stan”
- encoding (string, optional):
- encoding to use when importing files defaults to “UTF-8”
- resource (string, optional):
- if given, path is relative to package install root used to load stan files provided by packages (e.g. those within a package library)
- The specifics of the return type depend on the value of resource.
if resource is None, returns contents of file as a character string
- otherwise, returns a “resource_string” which
acts as a character string but technically isn’t one.