Introduction to mcblog package

Introduction

This vignette introduces the usage of the mcblog package for the estimation of worse-entity logistic regression in bilateral diseases adjusted for entity-specific misclassification and missing disease classification in single entities as introduced in Guenther et al. (2020)¹.

Sample data of true and error-prone binary worse-entity and single-entity disease stages

To illustrate the usage of the mcblog package and the consequences of ignoring misclassification in logistic regression of a binary worse-entity outcome we sample synthetic (entity-specific) data for 2000 subjects that suffer from misclassification and assume that the true (entity-specific) outcomes are available for a subset of 500 subjects.

In a first step, we sample the true entity-specific binary disease stages based on the true data model, an assumed logistic regression for the worse-entity outcome \(Y:=max(Z_1, Z_2), \; Z_1, Z_2 \in \{0,1\}\), where \(Y\) is associated with \(X\) based on \(P(Y=1|X) = 1/(1+exp(-X))\). To derive the true entity-specific disease stages, we assume that \(P(Z_1=1, Z_2=1|Y=1) = \delta\), with \(\delta=0.75\) and symmetric probabilities in the single entities, \(P(Z_1=1,Z_2=0|Y=1)=P(Z_1=0,Z_2=1|Y=1)\):

library(tidyverse)
set.seed(21479)
# Sample standard normal continuous covariate x
x = rnorm(2000, 0, 1)
# Sample true worse-entity outcome y based on logistic regression model with logOR(X)=1
y = rbinom(2000, 1, plogis(x))
# Sample true single-entity outcomes z_1, z_2 based on P(Z_1=1, Z_2=2|Y=1)=delta
delta=.75
# Select all subjects with (sampled) true worse-entity outcome y=1
ind_y_1 = which(y==1)
# Sample fraction of subjects with disease in both entities (z1=1, z2=1) from 
# all subjects with y=1
ind_z1z2_1 = sample(ind_y_1, replace = FALSE, size = floor(delta*length(ind_y_1)))
# Get all individuals with disease in one of two entities
ind_onez_1 = setdiff(ind_y_1, ind_z1z2_1)
# Sample whether z1 or z2 equals 1
ind_z1_1_z2_0 = sample(ind_onez_1, replace = FALSE, size = floor(length(ind_onez_1)/2))
ind_z1_0_z2_1 = setdiff(ind_onez_1, ind_z1_1_z2_0)
# Define true entity-specific observations z1 and z2
z1 = rep(0, 2000)
z2 = rep(0, 2000)
z1[ind_z1z2_1] = 1
z2[ind_z1z2_1] = 1
z1[ind_z1_1_z2_0] = 1
z2[ind_z1_0_z2_1] = 1

From the true entity-specific disease stages \((Z_1, Z_2)\) we sample error-prone entity-specific disease stages \((Z_1^*, Z_2^*)\) based on two different scenarios of the misclassification process and derive the error-prone worse-entity outcome \(Y^*=max(Z_1^*, Z_2^*)\). In the first scenario we sample \(Z_l^*\) based on a fixed sensitivity and specificity \(P(Z_l^*=1|Z_l=1)=0.8\) and \(P(Z_l^*=0|Z_l=0)=0.8\). In the second scenario, the sensitivity is still \(0.8\) but the specificity depends on the covariate X via an assumed logistic regression model \(P(Z_l^*=0|Z_l=0, X) = 1/(1+exp(-(1.5+0.5*X)))\).

# Sample error-prone single entity ourcomes z*_1, z*_2 based on two different assumptions 
# regarding sensitivity/specificity:
# Scenario 1: sens=spec=.8
z1_star_1 = rep(NA, 2000)
z2_star_1 = rep(NA, 2000)
z1_star_1[which(z1==1)] = rbinom(sum(z1==1), 1, .8)
z1_star_1[which(z1==0)] = rbinom(sum(z1==0), 1, 1-0.8)
z2_star_1[which(z2==1)] = rbinom(sum(z2==1), 1, .8)
z2_star_1[which(z2==0)] = rbinom(sum(z2==0), 1, 1-0.8)
# Scenario2: sens=.8, spec=plogis(1.5+0.5*x)
z1_star_2 = rep(NA, 2000)
z2_star_2 = rep(NA, 2000)
z1_star_2[which(z1==1)] = rbinom(sum(z1==1), 1, .8)
z1_star_2[which(z1==0)] = rbinom(sum(z1==0), 1, 1-plogis(1.5+.5*x[which(z1==0)]))
z2_star_2[which(z2==1)] = rbinom(sum(z2==1), 1, .8)
z2_star_2[which(z2==0)] = rbinom(sum(z2==0), 1, 1-plogis(1.5+.5*x[which(z2==0)]))

We assume that the true disease single-entity disease stages are available for 25% of the subjects (validation data) and set the true single-entity disease stages for the other 75% with as missing. Furthermore, we remove one of the two error-prone single entity disease stages in 100 randomly selected individuals (corresponding to a missing single entity classification).

# Select 25% of subjects into validation data (i.e., remove true single-entity 
# observations for other 75%)
main_stud_ind = sample(1:2000, 1500, replace = FALSE)
z1[main_stud_ind] = NA
z2[main_stud_ind] = NA
# Remove error-prone single-entity disease of 50 subjects in entity 1 and entity 2
rm_z1star = sample(1:1000, 50, replace = FALSE)
rm_z2star = sample(1001:2000, 50, replace = FALSE)
z1_star_1[rm_z1star] = NA
z1_star_2[rm_z1star] = NA
z2_star_1[rm_z2star] = NA
z2_star_2[rm_z2star] = NA
# Derive error-prone worse-entity ourcome for both scenarios
y_star_1 = pmax(z1_star_1, z2_star_1, na.rm = TRUE)
y_star_2 = pmax(z1_star_2, z2_star_2, na.rm = TRUE)

We can now compare the true person-specific worse-entity outcome \(Y\) (rows) to the error-prone outcomes \(Y^*_1\) and \(Y^*_2\) (columns):

# Compare true worse-entity disease y and error-prone versions y_star_1/y_star_2
kable(table(y, y_star_1))

	0	1
0	661	359
1	79	901

kable(table(y, y_star_2))

	0	1
0	640	380
1	80	900

The empirical person-specific sensitivity and specificity are given by:

# Compare true worse-entity disease y and error-prone versions y_star_1/y_star_2
kable(prop.table(table(y, y_star_1), margin = 1), digits = 2)

	0	1
0	0.65	0.35
1	0.08	0.92

kable(prop.table(table(y, y_star_2), margin = 1), digits = 2)

	0	1
0	0.63	0.37
1	0.08	0.92

While the person-specific sensitivity and specificity in both scenarios are similar, note that the specificity of \(Y^*_2\) is associated with \(X\) as can be seen by comparing the fraction of false positives in association with X:

tibble(y, y_star_1, y_star_2, x) %>%
  filter(y==0) %>% group_by(x_cut = cut(x, breaks = seq(-4,4, by = 1))) %>%
  summarise(fp_y_star_1 = mean(y_star_1==1),
            fp_y_star_2 = mean(y_star_2==1)) %>% 
  pivot_longer(cols = c("fp_y_star_1", "fp_y_star_2")) %>%
  ggplot() + geom_bar(aes(x_cut, value, fill = name, group = name), stat="identity", 
                      width=.75, position = "dodge", alpha = .5) +
  xlab("X") + ylab("P(false-positive), person-specific") + 
  theme_bw()

For \(Y^*_1\) the specificity and consequently the fraction of false-positives varies rather constant around \(0.35\), for \(Y^*_2\), the specificity increases with \(X\) and the probability of false-positives clearly decreases with higher values of \(X\).

Estimate logistic regression models

Simple logistic regression

To illustrate the effect of ignoring response misclassification in (bilateral) logistic regression, we firstly estimate standard logistic regression of the outcomes \(Y\) (true outcome, typically unobserved in studies we are concerned with), \(Y^*_1\) (non-differential misclassification), and \(Y^*_2\) (differential misclassification) on \(X\).

logreg_true = glm(y~x, family = "binomial")
logreg_y_star_1 = glm(y_star_1~x, family = "binomial")
logreg_y_star_2 = glm(y_star_2~x, family = "binomial")

The regression of \(Y\) on \(X\) yields an unbiased association estimate of \(\sim 1\) as expected based on the data generating process.

kable(summary(logreg_true)$coef, digits = 2)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.07	0.05	-1.41	0.16
x	0.95	0.06	16.62	0.00

confint(logreg_true)[2,]
#>     2.5 %    97.5 % 
#> 0.8389864 1.0629615

The association estimates of \(Y^*_1\) and \(Y^*_2\) are both downward biased compared to the true outcome \(Y\):

kable(summary(logreg_y_star_1)$coef, digits = 2)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.55	0.05	11.54	0
x	0.51	0.05	10.16	0

confint(logreg_y_star_1)[2,]
#>     2.5 %    97.5 % 
#> 0.4091556 0.6043304
kable(summary(logreg_y_star_2)$coef, digits = 2)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.58	0.05	12.32	0
x	0.28	0.05	5.88	0

confint(logreg_y_star_2)[2,]
#>     2.5 %    97.5 % 
#> 0.1860043 0.3714191

Bias in \(Y^*_2\) is bigger than in \(Y^*_1\) due to the differential misclassification, even if the average sensitivity and specificity of \(Y^*_1\) and \(Y^*_2\) are similar. The bigger number of false-positives for low values of \(X\) conceils the true, positive association of \(X\) and \(Y\) additionally to the non-differential misclassification in \(Y_1^*\).

Developed MLA

We developed a maximum likelihood approach to adjust for entity-specific response misclassification in bilateral disease data. It is implemented in the mcblog::est_mcblog function which takes the following arguments:

z1_star/z2_star: a vector of observed binary, error-prone single entity outcomes of length \(N\), NA if missing in one of two entities
z1/z2: a vector of observed binary gold-standard single entity outcomes of length \(N\) where available (validation data). Should be NA for each subject that is not part of the validation data or if disease stage is missing in one of two entites
formula/formula_delta/formula_sens/formula_spec: formulas to specify the model for the worse-entity outcome, the probability for disease in both entites given disease in at least one, the sensitivity (probability of true positive single-entity classification) and specificity (probability of true negative single-entity classification). All probabilities are modelled based on the logit link (corresponding to standard logistic regression for the worse-entity outcome in case of no misclassification)
data: a dataframe with \(N\) rows containing all covariates specified in the formulas

MLA1 (assuming constant misclassification probabilities)

In a first step, we estimate the maximum likelihood approach assuming constant misclassification probabilities (and a constant \(\delta\)) for both misclassification scenarios.

library(mcblog)
# Estimate MC-adjusted logistic regression assuming constant
# delta, sens, and spec estimated from validation data
# Scenario 1
mla1_y_star_1 = est_mcblog(z1_star = z1_star_1, 
                           z2_star = z2_star_1, 
                           z1 = z1, 
                           z2 = z2, 
                           formula = ~x,
                           formula_delta = ~1, #constant delta, sens, spec
                           formula_sens = ~1,
                           formula_spec = ~1, 
                           data = data.frame(x))
# There exists also an mcblog:::summary.blogmc method, but for compilation it is better
# to directly access the estimated coefficients in the blogmc objects (first list entry)
kable(mla1_y_star_1[[1]], digits = 2)

	beta	se	t_val	p
y_(Intercept)	-0.05	0.08	-0.69	0.49
y_x	0.97	0.08	11.86	0.00
delta_(Intercept)	1.22	0.14	8.70	0.00
sens_(Intercept)	1.40	0.09	14.85	0.00
spec_(Intercept)	1.42	0.08	18.10	0.00

The MLA yields coefficient estimates for all four parts of the model, estimated on logit scale. The estimated association of \(X\) and worse-entity disease \(Y\) is given by \(\hat{\beta}_{y\_x}=0.97\) and is unbiased but has a bigger associated standard error compared to logistic regression of the true \(Y\) on \(X\). To interpret the estimated coeficients for \(\delta\), and the sensitivity and specificity it is necessary to transform the etimated intercept from logit-scale to a probability. This yields \(\hat{\delta}=\text{Logist}(1.22)=0.77\), \(\widehat{\text{sens}}=0.80\), \(\widehat{\text{spec}}=0.81\), corresponding closely to the true parameters of the data generating process.

# Scenario 2
mla1_y_star_2 = est_mcblog(z1_star = z1_star_2, 
                           z2_star = z2_star_2, 
                           z1 = z1, 
                           z2 = z2, 
                           formula = ~x,
                           formula_delta = ~1,
                           formula_sens = ~1,
                           formula_spec = ~1, 
                           data = data.frame(x))
kable(mla1_y_star_2[[1]], digits = 2)

	beta	se	t_val	p
y_(Intercept)	-0.15	0.08	-1.88	0.06
y_x	0.77	0.08	9.81	0.00
delta_(Intercept)	1.21	0.14	8.54	0.00
sens_(Intercept)	1.44	0.10	14.47	0.00
spec_(Intercept)	1.25	0.07	17.03	0.00

In misclassification scenario 2, the estimated \(\widehat{\text{sens}}=0.81\) and specificity \(\widehat{\text{spec}}=0.78\) correspond to the average entity-specific sensitivity/specificity in the data, but the estimated association of X and worse-entity disease \(Y\), \(\hat{\beta}_{y\_x}=0.77\) is still downward biased since the association of \(X\) and the specifificty is left unaccounted.

MLA2 (assuming association of misclassification probabilities with X)

We now estimate the MLA allowing for an association of the sensitivity/specificity with covariate \(X\).

# Estimate MC-adjusted logistic regression assuming varying
# sens, and spec estimated from validation data and constant delta
# Scenario 1
mla2_y_star_1 = est_mcblog(z1_star = z1_star_1, 
                            z2_star = z2_star_1, 
                            z1 = z1, 
                            z2 = z2, 
                            formula = ~x,
                            formula_delta = ~1,
                            formula_sens = ~x,
                            formula_spec = ~x, 
                            data = data.frame(x))
kable(mla2_y_star_1[[1]], digits = 2)

	beta	se	t_val	p
y_(Intercept)	-0.03	0.08	-0.44	0.66
y_x	0.96	0.09	10.41	0.00
delta_(Intercept)	1.22	0.14	8.70	0.00
sens_(Intercept)	1.34	0.10	13.22	0.00
sens_x	0.12	0.10	1.22	0.22
spec_(Intercept)	1.46	0.09	16.09	0.00
spec_x	0.08	0.08	1.03	0.30

In misclassification scenario 1, no strong evidence was found for an association of the sensitivity/specificity with \(X\) (as expected given no association in the data generating process). The estimated association of \(X\) with the worse-entity disease \(Y\) is unbiased with \(\hat{\beta}_{y\_x}=0.96\) but has a bigger standard error compared to MLA1 (0.09 vs. 0.08) due to the more complex model.

# Scenario 2
mla2_y_star_2 = est_mcblog(z1_star = z1_star_2, 
                            z2_star = z2_star_2, 
                            z1 = z1, 
                            z2 = z2, 
                            formula = ~x,
                            formula_delta = ~1,
                            formula_sens = ~x,
                            formula_spec = ~x, 
                            data = data.frame(x))
kable(mla2_y_star_2[[1]], digits = 2)

	beta	se	t_val	p
y_(Intercept)	-0.11	0.08	-1.38	0.17
y_x	1.01	0.09	11.23	0.00
delta_(Intercept)	1.19	0.14	8.61	0.00
sens_(Intercept)	1.42	0.11	13.01	0.00
sens_x	0.03	0.10	0.28	0.78
spec_(Intercept)	1.53	0.09	16.11	0.00
spec_x	0.53	0.08	6.71	0.00

In scenario 2, we successfully detect the association of \(X\) with the specificity and obtain an unbiased estimate of \(\hat{\beta}_{y\_x}=1.01\).

Guenther, F., Brandl, C., Winkler, T. W., Wanner, V., Stark, K., Küchenhoff, H., & Heid, I. M. (2020). Chances and challenges of machine learning based disease classification in genetic association studies illustrated on age-related macular degeneration. Genetic Epidemiology.↩