class: center, middle, inverse, title-slide .title[ # PSY 504: Advanced Statistics ] .subtitle[ ## Missing Data in R ] .author[ ### Jason Geller, Ph.D. (he/him/his) ] .institute[ ### Princeton University ] .date[ ### Updated:2023-02-13 ] --- # Today - Screening data for missingness - Missingness patterns - Diagnosing missing data mechanisms in R - Missing data methods in R - Listwise deletion - Casewise deletion - Nonconditional and conditional imputation - Multiple imputation --- # Packages Install the **mice** package ```r install.packages("mice") ``` Load these packages: ```r library(tidyverse) library(broom) #tidy statistics library(ggmice) #graph missing data library(mice) # dealing and visualizing missing data library(naniar) # missing data + visualization library(finalfit)# missing data visualization ``` --- # `nhanes` Data .pull-left[
age
bmi
hyp
chl
1
2
22.7
1
187
1
1
187
3
1
20.4
1
113
3
184
] .pull-right[ - `age`: age group (1=20-39, 2=40-59, 3=60+) - `bmi`: body mass index (kg/m**2) - `hyp`: hypertensive (1=no,2=yes) - `chl`: total serum cholesterol (mg/dL) ] --- # Exploratory Data Analysis (EDA) - Always look at your data - Look for errors - Explore data using descriptive statistics and graphs - What variables have missing data? How much is missing in total? By variable? --- # EDA ```r #library(naniar) vis_miss(nhanes) ``` <img src="missing_data_in_r_files/figure-html/unnamed-chunk-5-1.png" width="70%" style="display: block; margin: auto;" /> --- # Missing Patterns ```r #library(mice) md.pattern(nhanes, plot=FALSE) ``` ``` ## age hyp bmi chl ## 13 1 1 1 1 0 ## 3 1 1 1 0 1 ## 1 1 1 0 1 1 ## 1 1 0 0 1 2 ## 7 1 0 0 0 3 ## 0 8 9 10 27 ``` --- # Missing Patterns .pull-left[ ```r #md.pattern will give you a table nhanes %>% # create missing data pattern plot plot_pattern() ``` <img src="missing_data_in_r_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] <br> <br> .pull-right[ <img src="missing_patterns.png" width="100%" style="display: block; margin: auto;" /> ] --- # Missing Patterns .pull-left[ - Univariate: one variable with missing data - Monotone: patterns in the data can be arranged - Associated with a longitudinal studies where members drop out and never return - Non-monotone: missingness of one variable does not affect the missingness of any other variables - Look for islands of missingness ] .pull-right[ <img src="new_patterns.webp" width="100%" style="display: block; margin: auto;" /> ] --- background-image: url(cake.jpeg) background-position: center background-size: 40% --- # Is it MCAR or MAR? – Create a dummy coded variable where 1 = score missing and 0 = score not missing on missing variable - If these variables are related to other variables in dataset - MAR ```r nhanes_r <- nhanes %>% #can also use case_when #if missing 1 else 0 mutate(bmi_1 = ifelse(is.na(bmi), 1, 0)) head(nhanes_r) ```
age
bmi
hyp
chl
bmi_1
1
1
2
22.7
1
187
0
1
1
187
1
3
1
1
20.4
1
113
0
3
184
1
```r #library(naniar) head(bind_shadow(nhanes)) ```
age
bmi
hyp
chl
age_NA
bmi_NA
hyp_NA
chl_NA
1
!NA
NA
NA
NA
2
22.7
1
187
!NA
!NA
!NA
!NA
1
1
187
!NA
NA
!NA
!NA
3
!NA
NA
NA
NA
1
20.4
1
113
!NA
!NA
!NA
!NA
3
184
!NA
NA
NA
!NA
--- # Is it MCAR or MAR? .pull-left[ ```r library(finalfit) explanatory = c("hyp", "age") dependent = "bmi" misspairs <- nhanes %>% missing_pairs(explanatory, dependent) # mice function visualize data ``` ] .pull-right[ <img src="missing_data_in_r_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Categorical Missingness - `nhanes2` treats `age` and `hyp` as factors <img src="missing_data_in_r_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> --- # Testing - Logistic regression approach ```r #glm approch nhanes_r_glm <-glm(bmi_1~age+hyp, family=binomial,data=nhanes_r) #easystats package to get paramters from glm model in a tidy way parameters::model_parameters(nhanes_r_glm) ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
z
df_error
p
(Intercept)
16.8
1.34e+04
0.95
-1.75e+03
0.00126
Inf
0.999
age
-18.3
8.57e+03
0.95
1.77e+03
-0.00213
Inf
0.998
hyp
-0.484
1.59e+04
0.95
-3.05e-05
Inf
1
--- # Testing - *t*-test ```r model <- t.test(age~bmi_1,data=nhanes_r) tidy(model) ```
estimate
estimate1
estimate2
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.146
1.81
1.67
0.41
0.688
16.2
-0.608
0.9
Welch Two Sample t-test
two.sided
--- class: middle # Methods for dealing with MCAR --- # Listwise Deletion .pull-left[ ```r #create data frame nhanes ```
age
bmi
hyp
chl
1
2
22.7
1
187
1
1
187
3
1
20.4
1
113
3
184
1
22.5
1
118
1
30.1
1
187
2
22
1
238
2
1
2
3
21.7
1
206
2
28.7
2
204
1
29.6
1
1
3
27.2
2
284
2
26.3
2
199
1
35.3
1
218
3
25.5
2
1
1
33.2
1
229
1
27.5
1
131
3
24.9
1
2
27.4
1
186
] .pull-right[ ```r nhanes %>% drop_na() ```
age
bmi
hyp
chl
2
22.7
1
187
1
20.4
1
113
1
22.5
1
118
1
30.1
1
187
2
22
1
238
3
21.7
1
206
2
28.7
2
204
3
27.2
2
284
2
26.3
2
199
1
35.3
1
218
1
33.2
1
229
1
27.5
1
131
2
27.4
1
186
] --- # Listwise Deletion: Pros and Cons - Pros: - Produces the correct parameter estimates if missingness is MCAR - If not, biased - Cons: - Can result in a lot of data loss --- # Casewise (Pairwise) Deletion - In each comparison, delete only observations if the missing data is relevant to this comparison ```r #create data frame nhanes ```
age
bmi
hyp
chl
1
2
22.7
1
187
1
1
187
3
1
20.4
1
113
3
184
1
22.5
1
118
1
30.1
1
187
2
22
1
238
2
1
2
3
21.7
1
206
2
28.7
2
204
1
29.6
1
1
3
27.2
2
284
2
26.3
2
199
1
35.3
1
218
3
25.5
2
1
1
33.2
1
229
1
27.5
1
131
3
24.9
1
2
27.4
1
186
--- # Casewise Deletion: Pros and Cons Pros: - Avoids data loss - Non-biased - Only for MCAR Cons: - But, results not completely consistent or comparable--based on different observations --- class: middle # Methods for MAR --- # Unconditional (Mean) Imputation - Replace missing values with the mean of the observed values - Reduces variance - Increases Type 1 error rate .pull-left[ ```r d <- c(5, 8, 3, NA, NA) #calc mean remove NAs d_mean <- mean(d, na.rm = TRUE) mean(d_mean) ``` ``` ## [1] 5.333333 ``` ```r sd(d, na.rm=TRUE) ``` ``` ## [1] 2.516611 ``` ] .pull-right[ ```r d_mean_imp <- ifelse(is.na(d), d_mean, d) # add mean sd(d_mean_imp) ``` ``` ## [1] 1.779513 ``` ```r #NANIRE Package #impute_mean(d) %>% head() # Do this in Mice #complete(mice(data, m=1, method="mean")) ``` ] --- class: center middle # DO NOT DO THIS! --- # Conditional Imputation (Regression) .pull-left[ - Use the Y to replace the variable - All the other related variables in the data set are used to predict the values of the variable with missing data - Missing scores have the predicted values provided to replace them ```r library(mice) library(ggmice) m=1 imp <- mice(nhanes, m = m, seed = 24415, method="norm.predict", maxit=1) ``` ``` ## ## iter imp variable ## 1 1 bmi hyp chl ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="reg_imp.png" alt="Baraldi & Enders (2010)" width="100%" /> <p class="caption">Baraldi & Enders (2010)</p> </div> ] --- # Stochastic Regression - Regression imputation with added error variance to predicted values .pull-left[ ```r m=1 imp <- mice(nhanes, m = m, seed = 24415, method="norm.nob", maxit=1) ``` ``` ## ## iter imp variable ## 1 1 bmi hyp chl ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="stoch_reg.png" alt="Baraldi & Enders (2010)" width="100%" /> <p class="caption">Baraldi & Enders (2010)</p> </div> ] --- # Multiple Imputation - Basically doing conditional imputation several times 1. We make several multiply imputed data sets with the `mice()` function 2. We fit our model of choice to each version of the data with the `with()` function 3. We then pool (i.e., combine) the results with the `pool()` function. <img src="MIsteps.png" width="80%" style="display: block; margin: auto;" /> --- # 1. Impute with `Mice` ```r m=5 # impute several data sets imp <- mice(nhanes, m = m, seed = 24415, method="pmm", print = FALSE) ``` - What is `imp`? ```r str(imp, max.level = 1) ``` ``` ## List of 22 ## $ data :'data.frame': 25 obs. of 4 variables: ## $ imp :List of 4 ## $ m : num 5 ## $ where : logi [1:25, 1:4] FALSE FALSE FALSE FALSE FALSE FALSE ... ## ..- attr(*, "dimnames")=List of 2 ## $ blocks :List of 4 ## ..- attr(*, "calltype")= Named chr [1:4] "type" "type" "type" "type" ## .. ..- attr(*, "names")= chr [1:4] "age" "bmi" "hyp" "chl" ## $ call : language mice(data = nhanes, m = m, method = "pmm", printFlag = FALSE, seed = 24415) ## $ nmis : Named int [1:4] 0 9 8 10 ## ..- attr(*, "names")= chr [1:4] "age" "bmi" "hyp" "chl" ## $ method : Named chr [1:4] "" "pmm" "pmm" "pmm" ## ..- attr(*, "names")= chr [1:4] "age" "bmi" "hyp" "chl" ## $ predictorMatrix: num [1:4, 1:4] 0 1 1 1 1 0 1 1 1 1 ... ## ..- attr(*, "dimnames")=List of 2 ## $ visitSequence : chr [1:4] "age" "bmi" "hyp" "chl" ## $ formulas :List of 4 ## $ post : Named chr [1:4] "" "" "" "" ## ..- attr(*, "names")= chr [1:4] "age" "bmi" "hyp" "chl" ## $ blots :List of 4 ## $ ignore : logi [1:25] FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ seed : num 24415 ## $ iteration : num 5 ## $ lastSeedValue : int [1:626] 10403 409 -1884349714 -394216038 1011282966 -909482523 1913636358 -1791339674 1393248822 -1412788497 ... ## $ chainMean : num [1:4, 1:5, 1:5] NaN 27.31 1.25 178.1 NaN ... ## ..- attr(*, "dimnames")=List of 3 ## $ chainVar : num [1:4, 1:5, 1:5] NA 35.271 0.214 1503.211 NA ... ## ..- attr(*, "dimnames")=List of 3 ## $ loggedEvents : NULL ## $ version :Classes 'package_version', 'numeric_version' hidden list of 1 ## $ date : Date[1:1], format: "2023-02-13" ## - attr(*, "class")= chr "mids" ``` --- # 1. Impute with `Mice` What is `imp` within `imp`? ```r str(imp$imp, max.level = 1) ``` ``` ## List of 4 ## $ age:'data.frame': 0 obs. of 5 variables: ## $ bmi:'data.frame': 9 obs. of 5 variables: ## $ hyp:'data.frame': 8 obs. of 5 variables: ## $ chl:'data.frame': 10 obs. of 5 variables: ``` ```r #get the imputed values for that var imp$imp$bmi ```
1
2
3
4
5
22.7
27.4
22
27.2
27.2
27.5
26.3
26.3
33.2
28.7
26.3
27.5
21.7
30.1
22.5
26.3
22.5
25.5
22.5
27.2
29.6
22
22
27.2
29.6
22
30.1
22.5
35.3
27.2
27.2
20.4
25.5
20.4
24.9
22
29.6
27.4
33.2
29.6
29.6
35.3
28.7
33.2
22
```r #get the imputed datasets out #complete(imp, "full") ``` --- # PMM .pull-left[ - Predictive mean matching - PMM involves selecting a data point from the original, non-missing data which has a predicted value close to the predicted value of the missing sample. ] <br> <br> .pull-right[ <img src="PMM.PNG" width="100%" style="display: block; margin: auto;" /> ] --- # 2. Model with `Mice` We'll fit a simple statistical model ```r #fit the model to each set of imputaed data fit <- with(data = imp, expr = lm(bmi ~ age)) summary(fit) ```
term
estimate
std.error
statistic
p.value
nobs
(Intercept)
27.7
1.81
15.3
1.56e-13
25
age
-0.757
0.934
-0.81
0.426
25
(Intercept)
30.7
1.87
16.4
3.34e-14
25
age
-2.3
0.963
-2.39
0.0254
25
(Intercept)
28.4
1.75
16.2
4.25e-14
25
age
-1.44
0.901
-1.6
0.124
25
(Intercept)
32.2
2
16.1
5.3e-14
25
age
-2.65
1.03
-2.57
0.0172
25
(Intercept)
29.2
1.7
17.2
1.31e-14
25
age
-1.51
0.877
-1.72
0.0992
25
--- # 3. `Mice` Pool Results ```r #combine the results result <- pool(fit) tidy(result) ```
term
estimate
std.error
statistic
p.value
b
df
dfcom
fmi
lambda
m
riv
ubar
(Intercept)
29.6
2.7
11
4.72e-05
3.27
5.7
23
0.646
0.54
5
1.17
3.34
age
-1.73
1.25
-1.38
0.205
0.563
7.72
23
0.538
0.432
5
0.76
0.889
--- # Plot Imputations - Make sure they look similar to real data ```r # create stripplot with boxplot overlay ggmice(imp, aes(x = .imp, y = bmi)) + geom_jitter(height = 0) + geom_boxplot(fill = "white", alpha = 0.75, outlier.shape = NA) + labs(x = "Imputation number") ``` <img src="missing_data_in_r_files/figure-html/unnamed-chunk-33-1.png" width="60%" style="display: block; margin: auto;" /> --- # Distinguish Between NMAR and MAR - Pray you don't have to 😂 - It's complicated - Not many good techniques - Try and track down the missing data - Auxiliary variables - Collect more data for explaining missingness --- # Reporting Missing Data - Template from [Stepf van Buuren](https://stefvanbuuren.name/fimd/sec-reporting.html) The *percentage of missing values* across the nine variables varied between *0 and 34%*. In total *1601 out of 3801 records (42%)* were incomplete. Many girls had no score because the nurse felt that the measurement was “unnecessary,” or because the girl did not give permission. Older girls had many more missing data. We used *multiple imputation* to create and analyze *40 multiply imputed datasets*. Methodologists currently regard multiple imputation as a state-of-the-art technique because it improves accuracy and statistical power relative to other missing data techniques. *Incomplete variables were imputed under fully conditional specification, using the default settings of the mice 3.0 package (Van Buuren and Groothuis-Oudshoorn 2011)*. The parameters of substantive interest were estimated in each imputed dataset separately, and combined using Rubin’s rules. For comparison, *we also performed the analysis on the subset of complete cases.* --- # Report - *Amount of missing data* - *Reasons for missingness* - *Consequences* - *Method* - *Imputation model* - *Pooling* - *Software* - *Complete-case analysis* --- #Is it MCAR, MAR, NMAR? The post-experiment manipulation-check questionnaires for five participants were accidentally thrown away. -- - MCAR --- # Is it MCAR, MAR, NMAR? In a 2-day memory experiment, people who know they would do poorly on the memory test are discouraged and don’t want to return for the second session. -- - NMAR --- # Is it MCAR, MAR, NMAR? There was a problem with one of the auditory stimulus files in an ERP study of speech comprehension during noise, so we discarded any data from item #43. -- - MAR --- <br> <br> - Make sure you can run through all the code - Read Chapter 18 in *R in Action* - Missing lab on Wednesday