Why it works

1 Introduction

1.1 What are we going to do in this Vignette

In this vignette, we’ll explain the underlying statistical model that the primarycensoreddist package. We’ll cover the following key points:

Introduction to the censored data problems in time to event analysis.
Discuss the relevant issues from censoring in epidemiological data.
Introduce the statistical model used in primarycensoreddist.
Distributions where we can derive the censored delay distribution analytically.

2 Censoring and right truncation problems in time to event analysis

Time to event analysis, also known as survival analysis, concerns estimating the distribution of delay times between events. A distinctive feature of the field are the methodological techniques used to deal with the missing data problems common in data sets of delay times^[1]. In primarycensoreddist we focus on two particular missing data problems:

Interval censoring. The primary (start) time and/or the secondary (end) time of the delay are unobserved but both are known to be within intervals.
Right truncation. Truncated delay data items are only reported if their secondary events are less than a known time, for example the current data collection time.

In statistical epidemiology, these missing data problems occur frequently in both data analysis and theoretical modelling. For a more detailed description of these problems in an epidemiological context see^[2,3].

In data analysis, events in epidemiology are commonly reported as occurring on a particular day or week (interval censoring). In an emerging outbreak, datasets can be incompletely observed (right truncation) and their can be a great deal of uncertainty around the precise timing of events (interval censoring).

In theoretical epidemiological modelling, it is often appropriate to model the evolution of an infectious disease as occurring in discrete time, for example in the EpiNow2 and EpiEstim modelling packages. This means appropriately discretising continuous distributions, such as the generation interval distribution. In primarycensoreddist we treat the discretisation of intrinsically continuous distributions as an interval censoring problem which allows us to simultaneously provide methods for both applied and theoretical contexts.

3 Statistical model used in `primarycensoreddist`

As described in Getting Started with primarycensoreddist, primarycensoreddist focuses on a subset of methods from time to event analysis that address data missingness problems commonly found in epidemiological datasets. We present the statistical problem as a double interval censoring problem, where both the primary event time and the secondary event times are interval censored. We can recover single interval censoring problems by reducing one of the intervals to a point. In particular, a key assumption we make is that the censoring window for events is known and independent of the event time. This is known as non-informative censoring.

The target for inference is the distribution of the delay time between the primary and secondary events. We assume that the delay time is a random variable \(T = S - P_{u}\) with distribution function \(F_T(t) = Pr(T < t)\) and density function \(f_T(t)\). In this treatment we assume that the delay time is shift-invariant, that is, the distribution of the delay time is the same regardless of the primary event time.

The (unconditional) primary event time is a random variable \(P_{u}\) with distribution function \(F_{P_{u}}(t)\) and density function \(f_{P_{u}}(t)\). The secondary event time is a random variable \(S\), but in this treatment we construct the secondary event time from the primary time and delay, therefore the marginal distribution of \(S\) is not considered.

The censoring window for each event is the interval within which each event is known to have occurred, respectively, \(P \in [t_P, t_P + w_P]\) and \(S \in [t_S, t_S + w_S]\). The lengths of the censoring windows are \(w_P\) and \(w_S\) respectively. The precise event times within their windows are unobserved. Note that since the primary event time is known up to the censoring window, we are predominantly interested in the conditional primary time \(P = P_{u} | \{ P_{u} \in [t_P, t_P + w_P]\}\) which has density function:

\[ \begin{aligned} f_{P}(p) &= {f_{P_{u}}(p) \over F_{P_{u}}(t_P + w_P) - F_{P_{u}}(t_P)}, \qquad &p \in [t_P, t_P + w_P],\\ f_{P}(p) &= 0, \qquad &\text{otherwise}. \end{aligned} \]

We measure the censored delay time \(T_c\) between the primary and secondary event windows from endpoint to endpoint: \(T_c = t_S + w_S - (t_P + w_P)\).

In our treatment below we focus on the survival function of the time after the end of the primary window to the secondary event time which we denote \(S_{+}\). We then use this to derive the distribution of the censored delay time \(T_c\). This is equivalent to, but differ in mathematical approach from other treatments of the censoring problems in epidemiology, such as^[2].

3.1 Censored delay time distribution

In this section, we explain how to derive the distribution of the censored delay time \(T_c\) from the distribution of the delay time \(T\) and the condition distribution of the primary event time \(P\).

3.1.1 Survival function of time from the end of the primary censoring window to the secondary event time

In order to reason upon the distribution of the censored delay time \(T_c\), it is useful to consider the time from the end (right) point of the primary censoring interval to the secondary time as a random variable,

\[ S_+ = S - (t_P + w_P) = T - ((t_P + w_P) - P) = T - C_P. \]

Where \(T\) is the delay distribution of interest and \(C_P = (t_P + w_P) - P\) is interval between the end (right) point of the primary censoring window and the primary event time; note that by definition \(C_P\) is not observed but we can relate its distribution to the distribution of \(P\): \(F_{C_P}(p) = Pr(C_P < p) = Pr(P > t_P + w_P - p)\).

With non-informative censoring, it is possible to derive the upper distribution function of \(S_+\), or survival function of \(S_+\), from the distribution of \(T\) and the distribution of \(C_P\).

\[ \begin{equation} \begin{split} Q_{S_+}(t) &= Pr(S_+ > t) \\ &= Pr(T > C_P + t) \\ &= \mathbb{E}_{C_P} \Big[Q_T(t + C_P)\Big] \\ &= \int_0^{w_P} Q_T(t + p) f_{C_P}(p) dp. \end{split} \end{equation} \]

Using integration by parts gives: \[ Q_{S_+}(t) = Q_T(t + w_P) + \int_0^{w_P} f_T(t+p) F_{C_P}(p) dp. \tag{3.1} \]

Where we have used that \(Q^{'}_{T} = - f_T\), \(Q_T\) is the survival function of the actual delay distribution of interest and \(w_P\) is the length of the primary censoring window.

Equation (3.1) is the key equation in this note and is used to derive the distribution of the censored delay time \(T_c\). It has the interpretation that the probability that the secondary event time is greater than \(t\) after the end of the primary censoring window is the sum of two disjoint event probabilities:

The probability that the actual delay time \(T\) is greater than \(t + w_P\).
The probability that the actual delay time \(T\) is between \(t\) and \(t + w_P\), and the primary event time \(P\) occurred sufficiently close to the end of the primary censor window that the secondary even occurred more than time \(t\) after the end of the primary window.

3.1.2 Probability of secondary event time within a secondary censoring window

Having constructed the survival function of \(S_+\) with equation (3.1), using numerical quadrature or in some other way, we can calculate the probability mass of a secondary event time falling within a observed secondary censoring window of length \(w_S\) that begins at time \(n\) after the primary censoring window. This gives the censored delay time probability by integrating over censored values:

\[ Pr(S_+ \in [n, n + w_S)) = Q_{S_+}(n) - Q_{S_+}(n + w_S). \tag{3.2} \]

Note that the censored secondary event time can also occur within the primary censoring window. This happens with probability, \[ Q_{S_+}(-w_P) - Q_{S_+}(0) = 1 - Q_T(w_P) - \int_0^{w_P} f_T(p) F_{C_P}(p) dp = Pr(T< C). \]

3.1.3 Discrete censored delay distribution

A common situation is when every primary and secondary event window is a single time unit, \(w_P = w_S = 1\), and data arrives in non-overlapping intervals. For example, in the context of infectious disease modelling, we might have a data set of symptom onsets (primary event) and time of seeking health care (secondary event) which are both reported in days.

In this situation, we are interested in the probability of censored delays \(T_c\) in units of the event window. In this case equation (3.2) is:

\[ f_n = Pr(T_c = n) = Pr(S_+ \in [n - 1, n)) = Q_{S_+}(n-1) - Q_{S_+}(n )\qquad n = 0, 1, \dots. \tag{3.3} \]

Which is a discrete probability mass function of the secondary event time relative to the primary event time in units of the event window.

3.2 Analytic solutions for exponentially tilted primary event times

In epidemiological analysis, it is common for primary events to be arriving at exponentially increasing or decreasing rates, for example, incidence of new infections in a growing epidemic. In this case, the distribution of the primary event time within its censoring window is biased by the exponential growth or decay. If we assume a reference uniform distribution within a primary censoring window \([k, k + w_P)\) then the distribution of the primary event time within the censoring window is the exponential tilted uniform distribution:

\[ f_P(t) \propto \exp(r t) \mathbb{1}_{[k, k + w_P]}(t). \]

In this case, the distribution function for \(C_P\), that is the length of time left in the primary censor window after the primary event time, is given by:

\[ F_{C_P}(p; r) = { 1 - \exp(-r p) \over 1 - \exp(-r w_P)}, \qquad p \in [0, w_P]. \tag{3.4} \] Note that taking the limit \(\lim_r \rightarrow 0\) in equation (3.4) gives the uniform distribution function \(F_{C_P}(p, 0) = p / w_P\).

In the following it is convenient to use a (linear) difference operator defined as:

\[ \Delta_{w}f(t) = f(t + w) - f(t). \]

3.2.1 Uniform primary event time (\(r=0\))

Applying a uniform primary event time distribution to equation (3.1) gives:

\[ Q_{S_+}(t) = Q_T(t + w_P) + { 1 \over w_P} \int_0^{w_P} f_T(t+p) p~ dp. \]

This is analytically solvable whenever the upper distribution function of \(T\) is known and the mean of \(T\) is analytically solvable from its integral definition.

In each case considered below it is easier to change the integration variable:

\[ \begin{aligned} Q_{S_+}(t) &= Q_T(t + w_P) + { 1 \over w_P} \int_t^{t+w_P} f_T(z) (z-t)~ dz \\ &= Q_T(t + w_P) + { 1 \over w_P} \Big[ \int_t^{t+w_P} f_T(z) z~ dz - t \Delta_{w_P}F_T(t) \Big]. \end{aligned} \tag{3.5} \]

General partial expectation

Note that for any distribution with an analytically available distribution function \(F_T\) equation (3.5) can be solved so long as the partial expectation

\[ \int_t^{t+w_P} f_T(z) z~ dz \tag{3.6} \]

can be reduced to an analytic expression.

The insight here is that this will be possible for any distribution where the average of the distribution can be calculated analytically, which includes commonly used non-negative distributions such as the Gamma, Log-Normal and Weibull distributions.

General Discrete censored delay distribution

First, we note that equation (3.3) can be written using the difference operator: \(f_n = -\Delta_1 Q_{S_+}(n-1)\). We can insert this expression into equation (3.5) to give the discrete censored delay distribution for uniformly distributed primary event times:

\[ \begin{aligned} f_n &= \Delta_1\Big[(n-1) \Delta_1F_T(n-1)\Big] - \Delta_1Q_T(n) - \Delta_1\Big[ \int_{n-1}^n f_T(z) z ~dz \Big] \\ &= (n+1)F_T(n+1) + (n-1)F_T(n-1) - 2nF_T(n) - \Delta_1\Big[ \int_{n-1}^n f_T(z) z ~dz \Big]. \end{aligned} \tag{3.7} \]

We now consider some specific delay distributions.

3.2.1.1 Gamma distributed delay times

The Gamma distribution has the density function:

\[ f_T(z;k, \theta) = {1 \over \Gamma(k) \theta^k} z^{k-1} \exp(-z/\theta). \] Where \(\Gamma\) is the Gamma function.

The Gamma distribution has the distribution function:

\[ \begin{aligned} F_T(z;k, \theta) &= {\gamma(k, z/\theta) \over \Gamma(k)}, \qquad z\geq 0,\\ F_T(z;k, \theta) &= 0, \qquad z < 0. \end{aligned} \] Where \(\gamma\) is the lower incomplete gamma function.

Gamma partial expectation

We know that the full expectation of the Gamma distribution is \(\mathbb{E}[T] = k\theta\), which can be calculated as a standard integral. Doing the same integral for the partial expectation gives:

\[ \begin{aligned} \int_t^{t+w_P} f_T(z) z~ dz &= {1 \over \Gamma(k) \theta^k} \int_t^{t+w_P}z z^{k-1} \exp(-z/\theta)~dz \\ &= {\Gamma(k+1) \theta^{k+1} \over \Gamma(k) \theta^k} {1 \over \Gamma(k+1) \theta^{k+1}} \int_t^{t+w_P}z^{k} \exp(-z/\theta)~dz\\ &= k\theta \Delta_{w_P} F_T(t; k + 1, \theta). \end{aligned} \tag{3.8} \]

Survival function of \(S_{+}\) for Gamma distribution

By substituting equation (3.8) into equation (3.7) we can solve for the survival function of \(S_+\) in terms of analytically available functions:

\[ \begin{aligned} Q_{S_+}(t; k, \theta) &= Q_T(t + w_P; k, \theta) + { 1 \over w_P} \left( {1 \over \Gamma(k) \theta^k} \int_t^{t+w_P} z^{k-1} \exp(-z/\theta) (z-t)~ dz \right), \\ &= Q_T(t + w_P; k, \theta) + { 1 \over w_P} \big[ k \theta \Delta_{w_P}F_T(t; k+1, \theta) - t \Delta_{w_P}F_T(t; k, \theta) \big]. \end{aligned} \tag{3.9} \]

Gamma discrete censored delay distribution

By substituting (3.9) into (3.3) we get the discrete censored delay distribution in terms of analytically available functions: \[ \begin{aligned} f_n &= (n+1) F_T(n+1; k, \theta) + (n-1) F_T(n-1; k, \theta) - 2 n F_T(n; k, \theta) - k \theta \Delta_1^{(2)}F_T(n-1; k+1, \theta)\\ &= (n+1) F_T(n+1; k, \theta) + (n-1) F_T(n-1; k, \theta) - 2 n F_T(n; k, \theta) \\ &+ k \theta \Big( 2 F_T(n; k+1, \theta) - F_T(n-1; k+1, \theta) - F_T(n+1; k+1,\theta) \Big) \qquad n = 0, 1, \dots. \end{aligned} \]

Which was also found by Cori et al^[4].

3.2.1.2 Log-Normal distribution

The Log-Normal distribution has the density function:

\[ f_T(z;\mu, \sigma) = {1 \over z \sigma \sqrt{2\pi}} \exp\left( - {(\log(z) - \mu)^2 \over 2 \sigma^2} \right). \] And distribution function:

\[ F_T(z;\mu, \sigma) = \Phi\left( {\log(z) - \mu \over \sigma} \right). \] Where \(\Phi\) is the standard normal distribution function.

Log-Normal partial expectation

We know that the full expectation of the Log-Normal distribution is \(\mathbb{E}[T] = e^{\mu + \frac{1}{2} \sigma^2}\), which can be calculated by integration with the integration substitution \(y = (\ln z - \mu) / \sigma\). This has transformation Jacobian:

\[ \frac{dz}{dy} = \sigma e^{\sigma y + \mu}. \]

Doing the same integral for the partial expectation, and using the same integration substitution gives:

\[ \begin{aligned} \int_t^{t+w_P} z~ f_T(z; \mu, \sigma) dz &= {1 \over \sigma \sqrt{2\pi}} \int_t^{t+w_P} \exp\left( - {(\log(z) - \mu)^2 \over 2 \sigma^2} \right) dz \\ &= {1 \over \sqrt{2\pi}} \int_{(\ln t - \mu)/\sigma}^{(\ln(t+w_P) - \mu)/\sigma} e^{\sigma y + \mu} e^{-y^2/2} dy\\ &= {e^{\mu + \frac{1}{2} \sigma^2} \over \sqrt{2 \pi} } \int_{(\ln t - \mu)/\sigma}^{(\ln(t+w_P) - \mu)/\sigma} e^{-(y- \sigma)^2/2} dy \\ &= e^{\mu + \frac{1}{2} \sigma^2} \Big[\Phi\Big({\ln(t+w_P) - \mu \over \sigma} - \sigma\Big) - \Phi\Big({\ln(t) - \mu \over \sigma} - \sigma\Big) \Big]\\ &= e^{\mu + \frac{1}{2} \sigma^2} \Delta_{w_P}F_T(t; \mu + \sigma^2, \sigma). \end{aligned} \tag{3.10} \]

Survival function of \(S_{+}\) for Log-Normal distribution

By substituting equation (3.10) into equation (3.7) we can solve for the survival function of \(S_+\) in terms of analytically available functions:

\[ \begin{aligned} Q_{S+}(t ;\mu, \sigma) &= Q_T(t + w_P;\mu, \sigma) + { 1 \over w_P} \int_t^{t+w_P} {1 \over z \sigma \sqrt{2\pi}} \exp\left( - {(\log(z) - \mu)^2 \over 2 \sigma^2} \right) (z-t)~ dz, \\ &= Q_T(t + w_P;\mu, \sigma) + { 1 \over w_P} \Big[ e^{\mu + \frac{1}{2} \sigma^2} \Delta_{w_P}F_T(t; \mu + \sigma^2, \sigma) - t\Delta_{w_P}F_T(t; \mu, \sigma) \Big] \end{aligned} \]

Log-Normal discrete censored delay distribution

By substituting (3.10) into (3.3) we get the discrete censored delay distribution in terms of analytically available functions:

\[ \begin{aligned} f_n &= (n+1) F_T(n+1; \mu, \sigma) + (n-1) F_T(n-1; \mu, \sigma) - 2 n F_T(n; \mu, \sigma) \\ &- e^{\mu + \frac{1}{2} \sigma^2} \Delta_1^{(2)}F_T(n-1;\mu + \sigma^2, \sigma) \\ &= (n+1) F_T(n+1; \mu, \sigma) + (n-1) F_T(n-1; \mu, \sigma) - 2 n F_T(n; \mu, \sigma) \\ &+ e^{\mu + \frac{1}{2} \sigma^2} \Big( 2 F_T(n; \mu + \sigma^2, \sigma) - F_T(n+1; \mu + \sigma^2, \sigma) - F_T(n-1; \mu + \sigma^2, \sigma) \Big)\qquad n = 0, 1, \dots. \end{aligned} \]

3.2.1.3 Weibull distribution

The Weibull distribution has the density function:

\[ f_T(z;\lambda,k) = \begin{cases} \frac{k}{\lambda}\left(\frac{z}{\lambda}\right)^{k-1}e^{-(z/\lambda)^{k}}, & z\geq0 ,\\ 0, & z<0, \end{cases} \] And distribution function:

\[ F_T(z;\lambda,k))=\begin{cases}1 - e^{-(z/\lambda)^k}, & z\geq0,\\ 0, & z<0.\end{cases} \] Where \(\Phi\) is the standard normal distribution function.

Weibull partial expectation

We know that the full expectation of the Weibull distribution is \(\mathbb{E}[T] = \lambda \Gamma(1 + 1/k)\), which can be calculated by integration using the integration substitution \(y = (z / \lambda)^k\), which has transformation Jacobian:

\[ \frac{dz}{dy} = \frac{\lambda}{k}y^{1/k - 1}. \]

Doing the same integral for the partial expectation, and using the same integration substitution gives:

\[ \begin{aligned} \int_t^{t+w_P} z~ f_T(z; \lambda,k) dz &= \int_t^{t+w_P} \frac{kz}{\lambda}\left(\frac{z}{\lambda}\right)^{k-1}e^{-(z/\lambda)^{k}} dz \\ &= k\int_t^{t+w_P} \left(\frac{z}{\lambda}\right)^{k}e^{-(z/\lambda)^{k}} dz \\ &= \lambda k \int_{(t / \lambda)^k}^{((t + w_P) / \lambda)^k} y y^{1/k - 1} e^{-y} dy \\ &= \lambda\int_{(t / \lambda)^k}^{((t + w_P) / \lambda)^k} y^{1/k} e^{-y} dy\\ &= \lambda \Delta_{w_P} g(t; \lambda,k) \end{aligned} \tag{3.11} \]

Where

\[ g(t; \lambda, k) = \gamma(1 + 1/k, (t/\lambda)^k) = \frac{1}{k}\gamma(1/k, (t/\lambda)^k) - \frac{t}{\lambda}e^{-(t/\lambda)^k}. \] is a reparametrisation of the lower incomplete gamma function.

Survival function of \(S_{+}\) for Weibull distribution

By substituting equation (3.11) into equation (3.7) we can solve for the survival function of \(S_+\) in terms of analytically available functions:

\[ Q_{S+}(t ;\lambda,k) = Q_T(t + w_P; \lambda,k) + { 1 \over w_P} \Big[ \lambda \Delta_{w_P} g(t; \lambda,k) - t\Delta_{w_P}F_T(t; \lambda,k)\Big]. \]

Weibull discrete censored delay distribution

By substituting (3.11) into (3.3) we get the discrete censored delay distribution in terms of analytically available functions:

\[ \begin{aligned} f_n &= (n+1)F_T(n+1) + (n-1)F_T(n-1) - 2nF_T(n) - \Delta_1\Big[ \int_{n-1}^n f_T(z) z ~dz \Big] \\ &= (n+1)F_T(n+1) + (n-1)F_T(n-1) - 2nF_T(n) - \lambda \Delta_1^{(2)} g(n-1; \lambda,k) \\ &= (n+1)F_T(n+1) + (n-1)F_T(n-1) - 2nF_T(n) \\ &+ \lambda [2 g(n; \lambda,k) - g(n+1; \lambda,k) - g(n-1; \lambda,k)] \qquad n = 0, 1, \dots. \end{aligned} \]

Which was also found by Cori et al^[4].

References

1. Leung, K.-M., Elashoff, R. M., & Afifi, A. A. (1997). Censoring issues in survival analysis. Annual Review of Public Health, 18(1), 83–104.

2. Park, S. W., Akhmetzhanov, A. R., Charniga, K., Cori, A., Davies, N. G., Dushoff, J., Funk, S., Gostic, K., Grenfell, B., Linton, N. M., Lipsitch, M., Lison, A., Overton, C. E., Ward, T., & Abbott, S. (2024). Estimating epidemiological delay distributions for infectious diseases. bioRxiv. https://doi.org/10.1101/2024.01.12.24301247

3. Charniga, K., Park, S. W., Akhmetzhanov, A. R., Cori, A., Dushoff, J., Funk, S., Gostic, K. M., Linton, N. M., Lison, A., Overton, C. E., Pulliam, J. R. C., Ward, T., Cauchemez, S., & Abbott, S. (2024). Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data. arXiv [Stat.ME]. https://arxiv.org/abs/2405.08841

4. Cori, A., Ferguson, N. M., Fraser, C., & Cauchemez, S. (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. American Journal of Epidemiology, 178(9), 1505–1512.

1 Introduction

1.1 What are we going to do in this Vignette

2 Censoring and right truncation problems in time to event analysis

3 Statistical model used in primarycensoreddist

3.1 Censored delay time distribution

3.1.1 Survival function of time from the end of the primary censoring window to the secondary event time

3.1.2 Probability of secondary event time within a secondary censoring window

3.1.3 Discrete censored delay distribution

3.2 Analytic solutions for exponentially tilted primary event times

3.2.1 Uniform primary event time (\(r=0\))

3.2.1.1 Gamma distributed delay times

3.2.1.2 Log-Normal distribution

3.2.1.3 Weibull distribution

References

3 Statistical model used in `primarycensoreddist`