Introduction to mcblog package

Introduction

This vignette introduces the usage of the mcblog package for the estimation of worse-entity logistic regression in bilateral diseases adjusted for entity-specific misclassification and missing disease classification in single entities as introduced in Guenther et al. (2020)1.

Sample data of true and error-prone binary worse-entity and single-entity disease stages

To illustrate the usage of the mcblog package and the consequences of ignoring misclassification in logistic regression of a binary worse-entity outcome we sample synthetic (entity-specific) data for 2000 subjects that suffer from misclassification and assume that the true (entity-specific) outcomes are available for a subset of 500 subjects.

In a first step, we sample the true entity-specific binary disease stages based on the true data model, an assumed logistic regression for the worse-entity outcome \(Y:=max(Z_1, Z_2), \; Z_1, Z_2 \in \{0,1\}\), where \(Y\) is associated with \(X\) based on \(P(Y=1|X) = 1/(1+exp(-X))\). To derive the true entity-specific disease stages, we assume that \(P(Z_1=1, Z_2=1|Y=1) = \delta\), with \(\delta=0.75\) and symmetric probabilities in the single entities, \(P(Z_1=1,Z_2=0|Y=1)=P(Z_1=0,Z_2=1|Y=1)\):

From the true entity-specific disease stages \((Z_1, Z_2)\) we sample error-prone entity-specific disease stages \((Z_1^*, Z_2^*)\) based on two different scenarios of the misclassification process and derive the error-prone worse-entity outcome \(Y^*=max(Z_1^*, Z_2^*)\). In the first scenario we sample \(Z_l^*\) based on a fixed sensitivity and specificity \(P(Z_l^*=1|Z_l=1)=0.8\) and \(P(Z_l^*=0|Z_l=0)=0.8\). In the second scenario, the sensitivity is still \(0.8\) but the specificity depends on the covariate X via an assumed logistic regression model \(P(Z_l^*=0|Z_l=0, X) = 1/(1+exp(-(1.5+0.5*X)))\).

We assume that the true disease single-entity disease stages are available for 25% of the subjects (validation data) and set the true single-entity disease stages for the other 75% with as missing. Furthermore, we remove one of the two error-prone single entity disease stages in 100 randomly selected individuals (corresponding to a missing single entity classification).

We can now compare the true person-specific worse-entity outcome \(Y\) (rows) to the error-prone outcomes \(Y^*_1\) and \(Y^*_2\) (columns):

0 1
0 661 359
1 79 901
0 1
0 640 380
1 80 900

The empirical person-specific sensitivity and specificity are given by:

0 1
0 0.65 0.35
1 0.08 0.92
0 1
0 0.63 0.37
1 0.08 0.92

While the person-specific sensitivity and specificity in both scenarios are similar, note that the specificity of \(Y^*_2\) is associated with \(X\) as can be seen by comparing the fraction of false positives in association with X:

For \(Y^*_1\) the specificity and consequently the fraction of false-positives varies rather constant around \(0.35\), for \(Y^*_2\), the specificity increases with \(X\) and the probability of false-positives clearly decreases with higher values of \(X\).

Estimate logistic regression models

Simple logistic regression

To illustrate the effect of ignoring response misclassification in (bilateral) logistic regression, we firstly estimate standard logistic regression of the outcomes \(Y\) (true outcome, typically unobserved in studies we are concerned with), \(Y^*_1\) (non-differential misclassification), and \(Y^*_2\) (differential misclassification) on \(X\).

The regression of \(Y\) on \(X\) yields an unbiased association estimate of \(\sim 1\) as expected based on the data generating process.

Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07 0.05 -1.41 0.16
x 0.95 0.06 16.62 0.00

The association estimates of \(Y^*_1\) and \(Y^*_2\) are both downward biased compared to the true outcome \(Y\):

Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.55 0.05 11.54 0
x 0.51 0.05 10.16 0
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.58 0.05 12.32 0
x 0.28 0.05 5.88 0

Bias in \(Y^*_2\) is bigger than in \(Y^*_1\) due to the differential misclassification, even if the average sensitivity and specificity of \(Y^*_1\) and \(Y^*_2\) are similar. The bigger number of false-positives for low values of \(X\) conceils the true, positive association of \(X\) and \(Y\) additionally to the non-differential misclassification in \(Y_1^*\).

Developed MLA

We developed a maximum likelihood approach to adjust for entity-specific response misclassification in bilateral disease data. It is implemented in the mcblog::est_mcblog function which takes the following arguments:

MLA1 (assuming constant misclassification probabilities)

In a first step, we estimate the maximum likelihood approach assuming constant misclassification probabilities (and a constant \(\delta\)) for both misclassification scenarios.

beta se t_val p
y_(Intercept) -0.05 0.08 -0.69 0.49
y_x 0.97 0.08 11.86 0.00
delta_(Intercept) 1.22 0.14 8.70 0.00
sens_(Intercept) 1.40 0.09 14.85 0.00
spec_(Intercept) 1.42 0.08 18.10 0.00

The MLA yields coefficient estimates for all four parts of the model, estimated on logit scale. The estimated association of \(X\) and worse-entity disease \(Y\) is given by \(\hat{\beta}_{y\_x}=0.97\) and is unbiased but has a bigger associated standard error compared to logistic regression of the true \(Y\) on \(X\). To interpret the estimated coeficients for \(\delta\), and the sensitivity and specificity it is necessary to transform the etimated intercept from logit-scale to a probability. This yields \(\hat{\delta}=\text{Logist}(1.22)=0.77\), \(\widehat{\text{sens}}=0.80\), \(\widehat{\text{spec}}=0.81\), corresponding closely to the true parameters of the data generating process.

beta se t_val p
y_(Intercept) -0.15 0.08 -1.88 0.06
y_x 0.77 0.08 9.81 0.00
delta_(Intercept) 1.21 0.14 8.54 0.00
sens_(Intercept) 1.44 0.10 14.47 0.00
spec_(Intercept) 1.25 0.07 17.03 0.00

In misclassification scenario 2, the estimated \(\widehat{\text{sens}}=0.81\) and specificity \(\widehat{\text{spec}}=0.78\) correspond to the average entity-specific sensitivity/specificity in the data, but the estimated association of X and worse-entity disease \(Y\), \(\hat{\beta}_{y\_x}=0.77\) is still downward biased since the association of \(X\) and the specifificty is left unaccounted.

MLA2 (assuming association of misclassification probabilities with X)

We now estimate the MLA allowing for an association of the sensitivity/specificity with covariate \(X\).

beta se t_val p
y_(Intercept) -0.03 0.08 -0.44 0.66
y_x 0.96 0.09 10.41 0.00
delta_(Intercept) 1.22 0.14 8.70 0.00
sens_(Intercept) 1.34 0.10 13.22 0.00
sens_x 0.12 0.10 1.22 0.22
spec_(Intercept) 1.46 0.09 16.09 0.00
spec_x 0.08 0.08 1.03 0.30

In misclassification scenario 1, no strong evidence was found for an association of the sensitivity/specificity with \(X\) (as expected given no association in the data generating process). The estimated association of \(X\) with the worse-entity disease \(Y\) is unbiased with \(\hat{\beta}_{y\_x}=0.96\) but has a bigger standard error compared to MLA1 (0.09 vs. 0.08) due to the more complex model.

beta se t_val p
y_(Intercept) -0.11 0.08 -1.38 0.17
y_x 1.01 0.09 11.23 0.00
delta_(Intercept) 1.19 0.14 8.61 0.00
sens_(Intercept) 1.42 0.11 13.01 0.00
sens_x 0.03 0.10 0.28 0.78
spec_(Intercept) 1.53 0.09 16.11 0.00
spec_x 0.53 0.08 6.71 0.00

In scenario 2, we successfully detect the association of \(X\) with the specificity and obtain an unbiased estimate of \(\hat{\beta}_{y\_x}=1.01\).


  1. Guenther, F., Brandl, C., Winkler, T. W., Wanner, V., Stark, K., Küchenhoff, H., & Heid, I. M. (2020). Chances and challenges of machine learning based disease classification in genetic association studies illustrated on age-related macular degeneration. Genetic Epidemiology.