# Introduction to mcblog package

## Introduction

This vignette introduces the usage of the mcblog package for the estimation of worse-entity logistic regression in bilateral diseases adjusted for entity-specific misclassification and missing disease classification in single entities as introduced in Guenther et al. (2020)1.

### Sample data of true and error-prone binary worse-entity and single-entity disease stages

To illustrate the usage of the mcblog package and the consequences of ignoring misclassification in logistic regression of a binary worse-entity outcome we sample synthetic (entity-specific) data for 2000 subjects that suffer from misclassification and assume that the true (entity-specific) outcomes are available for a subset of 500 subjects.

In a first step, we sample the true entity-specific binary disease stages based on the true data model, an assumed logistic regression for the worse-entity outcome $$Y:=max(Z_1, Z_2), \; Z_1, Z_2 \in \{0,1\}$$, where $$Y$$ is associated with $$X$$ based on $$P(Y=1|X) = 1/(1+exp(-X))$$. To derive the true entity-specific disease stages, we assume that $$P(Z_1=1, Z_2=1|Y=1) = \delta$$, with $$\delta=0.75$$ and symmetric probabilities in the single entities, $$P(Z_1=1,Z_2=0|Y=1)=P(Z_1=0,Z_2=1|Y=1)$$:

library(tidyverse)
set.seed(21479)
# Sample standard normal continuous covariate x
x = rnorm(2000, 0, 1)
# Sample true worse-entity outcome y based on logistic regression model with logOR(X)=1
y = rbinom(2000, 1, plogis(x))
# Sample true single-entity outcomes z_1, z_2 based on P(Z_1=1, Z_2=2|Y=1)=delta
delta=.75
# Select all subjects with (sampled) true worse-entity outcome y=1
ind_y_1 = which(y==1)
# Sample fraction of subjects with disease in both entities (z1=1, z2=1) from
# all subjects with y=1
ind_z1z2_1 = sample(ind_y_1, replace = FALSE, size = floor(delta*length(ind_y_1)))
# Get all individuals with disease in one of two entities
ind_onez_1 = setdiff(ind_y_1, ind_z1z2_1)
# Sample whether z1 or z2 equals 1
ind_z1_1_z2_0 = sample(ind_onez_1, replace = FALSE, size = floor(length(ind_onez_1)/2))
ind_z1_0_z2_1 = setdiff(ind_onez_1, ind_z1_1_z2_0)
# Define true entity-specific observations z1 and z2
z1 = rep(0, 2000)
z2 = rep(0, 2000)
z1[ind_z1z2_1] = 1
z2[ind_z1z2_1] = 1
z1[ind_z1_1_z2_0] = 1
z2[ind_z1_0_z2_1] = 1

From the true entity-specific disease stages $$(Z_1, Z_2)$$ we sample error-prone entity-specific disease stages $$(Z_1^*, Z_2^*)$$ based on two different scenarios of the misclassification process and derive the error-prone worse-entity outcome $$Y^*=max(Z_1^*, Z_2^*)$$. In the first scenario we sample $$Z_l^*$$ based on a fixed sensitivity and specificity $$P(Z_l^*=1|Z_l=1)=0.8$$ and $$P(Z_l^*=0|Z_l=0)=0.8$$. In the second scenario, the sensitivity is still $$0.8$$ but the specificity depends on the covariate X via an assumed logistic regression model $$P(Z_l^*=0|Z_l=0, X) = 1/(1+exp(-(1.5+0.5*X)))$$.

# Sample error-prone single entity ourcomes z*_1, z*_2 based on two different assumptions
# regarding sensitivity/specificity:
# Scenario 1: sens=spec=.8
z1_star_1 = rep(NA, 2000)
z2_star_1 = rep(NA, 2000)
z1_star_1[which(z1==1)] = rbinom(sum(z1==1), 1, .8)
z1_star_1[which(z1==0)] = rbinom(sum(z1==0), 1, 1-0.8)
z2_star_1[which(z2==1)] = rbinom(sum(z2==1), 1, .8)
z2_star_1[which(z2==0)] = rbinom(sum(z2==0), 1, 1-0.8)
# Scenario2: sens=.8, spec=plogis(1.5+0.5*x)
z1_star_2 = rep(NA, 2000)
z2_star_2 = rep(NA, 2000)
z1_star_2[which(z1==1)] = rbinom(sum(z1==1), 1, .8)
z1_star_2[which(z1==0)] = rbinom(sum(z1==0), 1, 1-plogis(1.5+.5*x[which(z1==0)]))
z2_star_2[which(z2==1)] = rbinom(sum(z2==1), 1, .8)
z2_star_2[which(z2==0)] = rbinom(sum(z2==0), 1, 1-plogis(1.5+.5*x[which(z2==0)]))

We assume that the true disease single-entity disease stages are available for 25% of the subjects (validation data) and set the true single-entity disease stages for the other 75% with as missing. Furthermore, we remove one of the two error-prone single entity disease stages in 100 randomly selected individuals (corresponding to a missing single entity classification).

# Select 25% of subjects into validation data (i.e., remove true single-entity
# observations for other 75%)
main_stud_ind = sample(1:2000, 1500, replace = FALSE)
z1[main_stud_ind] = NA
z2[main_stud_ind] = NA
# Remove error-prone single-entity disease of 50 subjects in entity 1 and entity 2
rm_z1star = sample(1:1000, 50, replace = FALSE)
rm_z2star = sample(1001:2000, 50, replace = FALSE)
z1_star_1[rm_z1star] = NA
z1_star_2[rm_z1star] = NA
z2_star_1[rm_z2star] = NA
z2_star_2[rm_z2star] = NA
# Derive error-prone worse-entity ourcome for both scenarios
y_star_1 = pmax(z1_star_1, z2_star_1, na.rm = TRUE)
y_star_2 = pmax(z1_star_2, z2_star_2, na.rm = TRUE)

We can now compare the true person-specific worse-entity outcome $$Y$$ (rows) to the error-prone outcomes $$Y^*_1$$ and $$Y^*_2$$ (columns):

# Compare true worse-entity disease y and error-prone versions y_star_1/y_star_2
kable(table(y, y_star_1))
0 1
0 661 359
1 79 901
kable(table(y, y_star_2))
0 1
0 640 380
1 80 900

The empirical person-specific sensitivity and specificity are given by:

# Compare true worse-entity disease y and error-prone versions y_star_1/y_star_2
kable(prop.table(table(y, y_star_1), margin = 1), digits = 2)
0 1
0 0.65 0.35
1 0.08 0.92
kable(prop.table(table(y, y_star_2), margin = 1), digits = 2)
0 1
0 0.63 0.37
1 0.08 0.92

While the person-specific sensitivity and specificity in both scenarios are similar, note that the specificity of $$Y^*_2$$ is associated with $$X$$ as can be seen by comparing the fraction of false positives in association with X:

tibble(y, y_star_1, y_star_2, x) %>%
filter(y==0) %>% group_by(x_cut = cut(x, breaks = seq(-4,4, by = 1))) %>%
summarise(fp_y_star_1 = mean(y_star_1==1),
fp_y_star_2 = mean(y_star_2==1)) %>%
pivot_longer(cols = c("fp_y_star_1", "fp_y_star_2")) %>%
ggplot() + geom_bar(aes(x_cut, value, fill = name, group = name), stat="identity",
width=.75, position = "dodge", alpha = .5) +
xlab("X") + ylab("P(false-positive), person-specific") +
theme_bw() For $$Y^*_1$$ the specificity and consequently the fraction of false-positives varies rather constant around $$0.35$$, for $$Y^*_2$$, the specificity increases with $$X$$ and the probability of false-positives clearly decreases with higher values of $$X$$.

### Estimate logistic regression models

#### Simple logistic regression

To illustrate the effect of ignoring response misclassification in (bilateral) logistic regression, we firstly estimate standard logistic regression of the outcomes $$Y$$ (true outcome, typically unobserved in studies we are concerned with), $$Y^*_1$$ (non-differential misclassification), and $$Y^*_2$$ (differential misclassification) on $$X$$.

logreg_true = glm(y~x, family = "binomial")
logreg_y_star_1 = glm(y_star_1~x, family = "binomial")
logreg_y_star_2 = glm(y_star_2~x, family = "binomial")

The regression of $$Y$$ on $$X$$ yields an unbiased association estimate of $$\sim 1$$ as expected based on the data generating process.

kable(summary(logreg_true)$coef, digits = 2) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.07 0.05 -1.41 0.16 x 0.95 0.06 16.62 0.00 confint(logreg_true)[2,] #> 2.5 % 97.5 % #> 0.8389864 1.0629615 The association estimates of $$Y^*_1$$ and $$Y^*_2$$ are both downward biased compared to the true outcome $$Y$$: kable(summary(logreg_y_star_1)$coef, digits = 2)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.55 0.05 11.54 0
x 0.51 0.05 10.16 0
confint(logreg_y_star_1)[2,]
#>     2.5 %    97.5 %
#> 0.4091556 0.6043304
kable(summary(logreg_y_star_2)\$coef, digits = 2)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.58 0.05 12.32 0
x 0.28 0.05 5.88 0
confint(logreg_y_star_2)[2,]
#>     2.5 %    97.5 %
#> 0.1860043 0.3714191

Bias in $$Y^*_2$$ is bigger than in $$Y^*_1$$ due to the differential misclassification, even if the average sensitivity and specificity of $$Y^*_1$$ and $$Y^*_2$$ are similar. The bigger number of false-positives for low values of $$X$$ conceils the true, positive association of $$X$$ and $$Y$$ additionally to the non-differential misclassification in $$Y_1^*$$.

### Developed MLA

We developed a maximum likelihood approach to adjust for entity-specific response misclassification in bilateral disease data. It is implemented in the mcblog::est_mcblog function which takes the following arguments:

• z1_star/z2_star: a vector of observed binary, error-prone single entity outcomes of length $$N$$, NA if missing in one of two entities
• z1/z2: a vector of observed binary gold-standard single entity outcomes of length $$N$$ where available (validation data). Should be NA for each subject that is not part of the validation data or if disease stage is missing in one of two entites
• formula/formula_delta/formula_sens/formula_spec: formulas to specify the model for the worse-entity outcome, the probability for disease in both entites given disease in at least one, the sensitivity (probability of true positive single-entity classification) and specificity (probability of true negative single-entity classification). All probabilities are modelled based on the logit link (corresponding to standard logistic regression for the worse-entity outcome in case of no misclassification)
• data: a dataframe with $$N$$ rows containing all covariates specified in the formulas

#### MLA1 (assuming constant misclassification probabilities)

In a first step, we estimate the maximum likelihood approach assuming constant misclassification probabilities (and a constant $$\delta$$) for both misclassification scenarios.

library(mcblog)
# Estimate MC-adjusted logistic regression assuming constant
# delta, sens, and spec estimated from validation data
# Scenario 1
mla1_y_star_1 = est_mcblog(z1_star = z1_star_1,
z2_star = z2_star_1,
z1 = z1,
z2 = z2,
formula = ~x,
formula_delta = ~1, #constant delta, sens, spec
formula_sens = ~1,
formula_spec = ~1,
data = data.frame(x))
# There exists also an mcblog:::summary.blogmc method, but for compilation it is better
# to directly access the estimated coefficients in the blogmc objects (first list entry)
kable(mla1_y_star_1[], digits = 2)
beta se t_val p
y_(Intercept) -0.05 0.08 -0.69 0.49
y_x 0.97 0.08 11.86 0.00
delta_(Intercept) 1.22 0.14 8.70 0.00
sens_(Intercept) 1.40 0.09 14.85 0.00
spec_(Intercept) 1.42 0.08 18.10 0.00

The MLA yields coefficient estimates for all four parts of the model, estimated on logit scale. The estimated association of $$X$$ and worse-entity disease $$Y$$ is given by $$\hat{\beta}_{y\_x}=0.97$$ and is unbiased but has a bigger associated standard error compared to logistic regression of the true $$Y$$ on $$X$$. To interpret the estimated coeficients for $$\delta$$, and the sensitivity and specificity it is necessary to transform the etimated intercept from logit-scale to a probability. This yields $$\hat{\delta}=\text{Logist}(1.22)=0.77$$, $$\widehat{\text{sens}}=0.80$$, $$\widehat{\text{spec}}=0.81$$, corresponding closely to the true parameters of the data generating process.

# Scenario 2
mla1_y_star_2 = est_mcblog(z1_star = z1_star_2,
z2_star = z2_star_2,
z1 = z1,
z2 = z2,
formula = ~x,
formula_delta = ~1,
formula_sens = ~1,
formula_spec = ~1,
data = data.frame(x))
kable(mla1_y_star_2[], digits = 2)
beta se t_val p
y_(Intercept) -0.15 0.08 -1.88 0.06
y_x 0.77 0.08 9.81 0.00
delta_(Intercept) 1.21 0.14 8.54 0.00
sens_(Intercept) 1.44 0.10 14.47 0.00
spec_(Intercept) 1.25 0.07 17.03 0.00

In misclassification scenario 2, the estimated $$\widehat{\text{sens}}=0.81$$ and specificity $$\widehat{\text{spec}}=0.78$$ correspond to the average entity-specific sensitivity/specificity in the data, but the estimated association of X and worse-entity disease $$Y$$, $$\hat{\beta}_{y\_x}=0.77$$ is still downward biased since the association of $$X$$ and the specifificty is left unaccounted.

#### MLA2 (assuming association of misclassification probabilities with X)

We now estimate the MLA allowing for an association of the sensitivity/specificity with covariate $$X$$.

# Estimate MC-adjusted logistic regression assuming varying
# sens, and spec estimated from validation data and constant delta
# Scenario 1
mla2_y_star_1 = est_mcblog(z1_star = z1_star_1,
z2_star = z2_star_1,
z1 = z1,
z2 = z2,
formula = ~x,
formula_delta = ~1,
formula_sens = ~x,
formula_spec = ~x,
data = data.frame(x))
kable(mla2_y_star_1[], digits = 2)
beta se t_val p
y_(Intercept) -0.03 0.08 -0.44 0.66
y_x 0.96 0.09 10.41 0.00
delta_(Intercept) 1.22 0.14 8.70 0.00
sens_(Intercept) 1.34 0.10 13.22 0.00
sens_x 0.12 0.10 1.22 0.22
spec_(Intercept) 1.46 0.09 16.09 0.00
spec_x 0.08 0.08 1.03 0.30

In misclassification scenario 1, no strong evidence was found for an association of the sensitivity/specificity with $$X$$ (as expected given no association in the data generating process). The estimated association of $$X$$ with the worse-entity disease $$Y$$ is unbiased with $$\hat{\beta}_{y\_x}=0.96$$ but has a bigger standard error compared to MLA1 (0.09 vs. 0.08) due to the more complex model.

# Scenario 2
mla2_y_star_2 = est_mcblog(z1_star = z1_star_2,
z2_star = z2_star_2,
z1 = z1,
z2 = z2,
formula = ~x,
formula_delta = ~1,
formula_sens = ~x,
formula_spec = ~x,
data = data.frame(x))
kable(mla2_y_star_2[], digits = 2)
beta se t_val p
y_(Intercept) -0.11 0.08 -1.38 0.17
y_x 1.01 0.09 11.23 0.00
delta_(Intercept) 1.19 0.14 8.61 0.00
sens_(Intercept) 1.42 0.11 13.01 0.00
sens_x 0.03 0.10 0.28 0.78
spec_(Intercept) 1.53 0.09 16.11 0.00
spec_x 0.53 0.08 6.71 0.00

In scenario 2, we successfully detect the association of $$X$$ with the specificity and obtain an unbiased estimate of $$\hat{\beta}_{y\_x}=1.01$$.

1. Guenther, F., Brandl, C., Winkler, T. W., Wanner, V., Stark, K., Küchenhoff, H., & Heid, I. M. (2020). Chances and challenges of machine learning based disease classification in genetic association studies illustrated on age-related macular degeneration. Genetic Epidemiology.