class: center, middle, inverse, title-slide .title[ # PSY 504: Advanced Statistics ] .subtitle[ ## Robust Methods or F**K My Data Violates Assumptions ] .author[ ### Jason Geller, Ph.D. (he/him/his) ] .institute[ ### Princeton University ] .date[ ### Updated:2023-02-07 ] --- # Today - Adding more tools to your toolbox - Trimming - Winsorizing - **Bootstrapping** - **Permutation Tests** - How to do these things in R --- # The Problem - Commonly used statistical tests make assumptions - No outliers -- - Normally distributed (residuals) -- - Homogeneity/Homoscedasticity --- # Population Paramter Estimates We know: - Best estimate of a population mean `$$\mu = \bar X$$` - The sd of sampling distribution of means is the standard error of mean `$$SE_{mean} = \frac{s_x}{\sqrt(n)}$$` - We can use this to generate 95% CIs `$$\bar X +- t_{\alpha}(s_{\bar X})$$` --- # Bias - Extreme scores and non-normality can bias our results - Expected value > or < population parameter - Bias population parameters: - `\(\mu\)` - `\(\sigma_2\)` - `\(SE_M\)` --- # Example <br> <br> .pull-left[ <img src="images/flu_study.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <div class="figure" style="text-align: center"> <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-3-1.png" alt="Carpenter and Geller (2019). Foreign language learning with pictures vs. word translations" width="60%" /> <p class="caption">Carpenter and Geller (2019). Foreign language learning with pictures vs. word translations</p> </div> ] --- <br> <br> <br> .pull-left[ ```r # visualize the distribution of the continuous variable for each group using a boxplot with jittered raw data points # and highlighting the outlier with a red square rt_fig <- rts_wide %>% ggplot(aes(name, value, colour = name, fill = name)) + stat_slab(trim = FALSE, colour = "white", position = position_nudge(y = 0.3), scale = 0.5) + geom_boxplot(width = 0.1, fill = "white", colour = "black", size = 0.75, outlier.colour = NA, position = position_nudge(y = 0.3)) + geom_point(shape = 1, stroke = 1, size = 2.5, alpha = 0.5, position = position_jitter(height = 0.1), show.legend = FALSE) + labs(x = "Condition", y = "RTs") + guides(colour = "none", fill = "none") + theme(axis.title.y = element_blank()) + labs(x = "Condition", y = "RTs") ``` ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # What can we do? -- - Transforming data? -- - Trimming the data? -- - Winsorizing? -- - Bootstrapping? -- - Permutation tests? --- # Trimming Data - Order data - Remove highest and lowest 5%, 10%, or 20% - Can use base `mean` function with trim argument `mean(x, trim=.2)` <img src="images/trim.jpg" width="90%" style="display: block; margin: auto;" /> --- # Back to our example - What happens if we trim 20%? ```r rts_wide_trim <- rts_wide %>% group_by(name) %>% summarise(mean=round(mean(value), 0), trim_20=round(mean(value, trim=.2), 0)) as.data.frame(rts_wide_trim, scientific = FALSE) ```
name
mean
trim_20
RTPicMean
8.48e+03
7.11e+03
RTWordMean
7.93e+03
6.93e+03
--- <br> <br> .pull-left[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Winsorizing - Instead of removing points, we can replace them with nearest score <img src="images/winsorize.jpg" width="100%" style="display: block; margin: auto;" /> --- # Winsorizing ```r library(datawizard) rts_wide_table <- rts_wide %>% group_by(name) %>% summarise(win=datawizard::winsorize(value)) knitr::kable(head(rts_wide_table), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> win </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 3797.62 </td> </tr> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 3797.62 </td> </tr> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 8029.71 </td> </tr> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 7186.90 </td> </tr> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 6852.52 </td> </tr> <tr> <td style="text-align:left;"> RTPicMean </td> <td style="text-align:right;"> 12521.29 </td> </tr> </tbody> </table> --- # Winsorized Correlations .pull-left[ ```r library(ggplot2) # Create a dataset with two variables that are not correlated set.seed(0) x <- rnorm(100) y <- rnorm(100) # Add an outlier to the dataset x <- c(x, 3) y <- c(y, 3) ``` ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> ] --- # Winsorized Correlations ```r #winsorized correlation 20% #easystats correlation package correlation::correlation(df) ```
Parameter1
Parameter2
r
CI
CI_low
CI_high
t
df_error
p
Method
n_Obs
x
y
0.211
0.95
0.016
0.39
2.15
99
0.0343
Pearson correlation
101
```r correlation::correlation(df, winsorize = .2) ```
Parameter1
Parameter2
r
CI
CI_low
CI_high
t
df_error
p
Method
n_Obs
x
y
0.144
0.95
-0.0527
0.33
1.45
99
0.15
Winsorized Pearson correlation
101
--- class: middle ## Always report robust correlations along with Pearson's *r* --- # Heteroskedasticity .pull-left[ - Trimming and winsorizing are good options for non-normal data with outliers - Not very good if error is not constant - **Linear modeling assumes predictor not correlated with residuals** ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Fit a lm model `$$sav_i = \beta_0 + \beta_1 inc_i + \epsilon_i$$` ```r library(broom) # Estimate the model model <- lm(sav ~ inc, data = saving) tidy(model) ```
term
estimate
std.error
statistic
p.value
(Intercept)
316
462
0.684
0.496
inc
0.141
0.0467
3.01
0.00361
```r # easystats check_heteroscedasticity(model) ``` ``` ## Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.010). ``` --- # Heteroskedasticity - We can use robust SEs - They work by changing the covariance matrix (the diagonal is the variance) `$$\boldsymbol{\sigma^2_B} = \begin{bmatrix}\mathrm{Var}(b_0) & \mathrm{Cov}(b_0,b_1) \\ \mathrm{Cov}(b_0,b_1) & \mathrm{Var}(b_1) \end{bmatrix}$$` --- # Heteroskedasticity Consistent Estimator - Downweights large residuals <img src="images/HC3-robust-standard-errors.png" width="50%" style="display: block; margin: auto;" /> --- # Getting Robust SEs in R ```r library(easystats) # model_paramters function #fit model first then read into function mp <- model_parameters(model, vcov = "HC3") mp ```
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
316
443
0.95
-567
1.2e+03
0.713
73
0.478
inc
0.141
0.0525
0.95
0.0359
0.245
2.68
73
0.00916
--- # Getting Robust SEs in R - Fit a model that includes robust estimation ```r #install.packages(c('estimatr')) library(estimatr) m1 <- lm_robust(sav ~ inc, data = saving, se_type = "HC3") tidy(m1) ```
term
estimate
std.error
statistic
p.value
conf.low
conf.high
df
outcome
(Intercept)
316
443
0.713
0.478
-567
1.2e+03
73
sav
inc
0.141
0.0525
2.68
0.00916
0.0359
0.245
73
sav
```r check_heteroscedasticity(m1) ``` ``` ## OK: Error variance appears to be homoscedastic (p > .999). ``` --- class: middle # Hey Teacher, leave those means alone (Bootstrapping and Permutation Tests)! --- # Bootstrap: Overview - What it is - A computer based method for deriving the probability distribution for any random variable - When to use it - You do not know the distribution of your variable(s) - How to do it - Run your analysis a bunch of times with a slightly different set of observations each time --- # Bootstrap: Overview .pull-left[ - Core mechanism: Sampling with replacement - Inference based "only" on collected data - Nonparametric - It is not magic though ] .pull-right[ <img src="images/bootstrap.jpg" width="100%" style="display: block; margin: auto;" /> ] --- # Percentile Bootstrap - Using the bootstrap sample we can calculate 95% CI .pull-left[ 1. Sort values 2. Find the 2.5th and 97.5th percentiles of the distribution of the statistic ] .pull-right[ <img src="images/bootstrap.jpg" width="100%" style="display: block; margin: auto;" /> ] --- # Bias Corrected and Accelerated (Bca) Bootstrap Confidence Intervals - Percentile method good when using robust estimates, but... 1. Biased (too wide or narrow) - When data is not representative of population 2. Skewed (un-trimmed) - **If original sample is skewed the bootstrap will be as well** --- # Bias Corrected and Accelerated (Bca) Bootstrap Confidence Intervals - Bias-correction `$$\begin{align*} \hat{z}_0 &= \Phi^{-1}\Bigg(\frac{1}{B}\sum_i \hat{\theta}^*_i < \hat\theta \Bigg) \end{align*}$$` ```r #$$\phi^-1 (.975)= #$$\hat \theta_{r}$$ = Mean of bootstraps #$$\hat \theta$$ = Mean of sample#$$ B $$ = # of bootstraps theat.hat <- mean(x) z0 <- qnorm(sum(boot1$t < theat.hat)/1000) ``` --- # Bias Corrected and Accelerated (Bca) bootstrap confidence intervals - Skew - Jacknifing `$$\begin{align*} \hat{a} &= \frac{\sum_i (\bar{\theta}^{-i} - \hat{\theta}^{-i} )^3 }{6 \cdot \big[\sum_i (\bar{\theta}^{-i} - \hat{\theta}^{-i} )^2 \big]^{3/2}} \end{align*}$$` ```r #jackknife throws x out and calulates mean then does it again and again jack = jackknife(samp, mean) # jacknife it # Compute the acceleration estimate uu <- mean(jack$jack.values)-jack$jack.values #jackknife mean - jackknife estimates a <- sum(uu*uu*uu) / (6 * sum(uu*uu))^1.5 ``` --- # BCa In R `$$\begin{align*} [\hat\theta_l, \hat\theta_u ] &= [ \hat\theta^*_{\alpha_1}, \hat\theta^*_{\alpha_2} ] \\ \alpha_1 &= \Phi\Bigg(\hat{z}_0 + \frac{\hat{z}_0+z_{\alpha}}{1-\hat{a}(\hat{z}_0+z_{\alpha})}\Bigg) \\ \alpha_2 &= \Phi\Bigg(\hat{z}_0 + \frac{\hat{z}_0+z_{1-\alpha}}{1-\hat{a}(\hat{z}_0+z_{1-\alpha})}\Bigg) \end{align*}$$` ```r #.05 two-tailed alpha = c(0.025, .975) # given an area, find the boundary value that determines this area. zalpha <- qnorm(alpha) tt <- pnorm(z0+ (z0+zalpha)/(1-a*(z0+zalpha))) confint.bca <- quantile(boot1$t, probs=tt) ``` --- class: middle # To be continued... - Bca will make an appearance when we talk about mediation --- # Bootstrapping Pros - You can use it to determine the probability of observing values of variables that come from any distribution - Including times when you have no clue which distribution describes your variable and you don’t even want to guess - Including distributions that are not currently “known” to science --- - The lack of assumptions about a variable’s distribution makes the bootstrapped probability estimates more accurate - Far better than trying to pigeon-hole data into a distribution that doesn’t describe them - Minimizes the influence of outliers without trivializing the inferential value of them (when using robust estimators like bca) --- # Bootstrapping Cons - It can be time-intensive - But mostly not an issue with modern day computers - Note: You can use Princeton HPC servers - RStudio Cloud! - https://researchcomputing.princeton.edu/support/knowledge-base/rrstudio --- # Permutation Tests - Extension of bootstrapping when you want to test hypotheses - Does not rely on classical probability distributions - Find statistic of interest `\(H_0\)` = Drug has no effect `\(H_1\)` = Drug has an effect --- # Permutation Tests - Simulate the null - Randomly reconstitute the two groups, disregarding their original group membership (**EXCHANGABILITY**) - Random without replacement - Do it many times (10,000) - How many times did you get a test statistic as large or larger? - *p*-value --- # Demonstration ```r knitr::include_url("https://www.jwilber.me/permutationtest/") ``` <iframe src="https://www.jwilber.me/permutationtest/" width="100%" height="400px" data-external="1"></iframe> --- # Permutation Test - Pros: - Makes very limited assumptions - Cons: 1. Permutation tests are inefficient compared to asymptotic tests. When you have only a few observations, it might be impossible to control alpha at the level you want and still do a permutation test 2. You have to assume that both samples have the exact same distribution not just the same value for one parameter 3. A permutation test assumes that the only difference between the two groups is the random assignment 4. Permutation tests can be intensive (depending on design) --- class: middle # Bootstrapping in R --- # Bootstrapping - Sampling with replacement ```r library(boot) # package dedicated to bootstrap methods library(simpleboot) # package with wrapper functions to boot, making user's life easier ``` ```r n <- 6 # sample size samp <- 1:n samp ``` ``` ## [1] 1 2 3 4 5 6 ``` ```r set.seed(21) # for reproducible results boot_samp <- sample(samp, size = n, replace = TRUE) # sample with replacement boot_samp ``` ``` ## [1] 1 3 1 2 5 3 ``` --- # Again ```r boot_samp <- sample(samp, size = n, replace = TRUE) # sample with replacement boot_samp ``` ``` ## [1] 3 4 2 6 6 6 ``` --- # Again ```r boot_samp <- sample(samp, size = n, replace = TRUE) # sample with replacement boot_samp ``` ``` ## [1] 3 6 2 3 4 5 ``` --- # Again ```r boot_samp <- sample(samp, size = n, replace = TRUE) # sample with replacement boot_samp ``` ``` ## [1] 6 2 6 1 2 2 ``` --- # Bootstrap Mean Estimates ```r set.seed(666) # define the function that will be used to calculate the mean mean_fun = function(data, indices) { return(mean(data[indices])) #indices to do bootstrapping } # use boot to bootstrap means 1000 times (R) #input is df, function for mean, and # of boot samples results = boot(samp, mean_fun, R=1000) # get means means=results$t results ``` ``` ## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = samp, statistic = mean_fun, R = 1000) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 3.5 0.0335 0.7032939 ``` ] --- # Plot Histogram <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-33-1.png" width="80%" style="display: block; margin: auto;" /> ] --- # Visualzie Bootstrap <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-34-1.png" width="80%" style="display: block; margin: auto;" /> --- # `Simpleboot` - # of bootstrap functions (regression and t.tests) .pull-left[ ```r set.seed(666) # use to bootstrap mean or another test statistics x=one.boot(samp,mean, R=1000) ``` ] .pull-right[ ```r hist(x) ``` <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Percentile CIs ```r alpha <- 0.05 nboot=1000 bvec <- sort(means) # sort bootstrap means in ascending order # define quantiles low <- round((alpha/2)*nboot) # 25 up <- nboot-low # 975 low <- low+1 # get confidence interval ci <- c(bvec[low],bvec[up]) round(ci, digits = 2) ``` ``` ## [1] 2.17 4.83 ``` --- # Percentile CIs ```r #can use boot.ci to get percentile and bca CIs set.seed(666) results_ci = boot.ci(results, type = "perc", R=10000) results_ci ``` ``` ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 1000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "perc", R = 10000) ## ## Intervals : ## Level Percentile ## 95% ( 2.167, 4.833 ) ## Calculations and Intervals on Original Scale ``` --- # BCa CIs ```r #can use boot.ci to get percentile and bca CIs set.seed(666) results_ci_bca = boot.ci(results, type = "bca", R=1000) results_ci_bca ``` ``` ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 1000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = results, type = "bca", R = 1000) ## ## Intervals : ## Level BCa ## 95% ( 2.000, 4.667 ) ## Calculations and Intervals on Original Scale ``` --- # Plot Means and CIs <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-40-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Illustrate 2 groups .pull-left[ ```r # Group 1 n1 <- 50 m <- 400 s <- 50 location <- log(m^2 / sqrt(s^2 + m^2)) shape <- sqrt(log(1 + (s^2 / m^2))) g1 <- rlnorm(n1, location, shape) # Group 2 n2 <- 70 m <- 500 s <- 70 location <- log(m^2 / sqrt(s^2 + m^2)) shape <- sqrt(log(1 + (s^2 / m^2))) g2 <- rlnorm(n2, location, shape) ``` ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-42-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Mean Differences ```r # bootstrap data use function from simpleboot set.seed(1) #two.boot from simpleboot boot.res <- two.boot(g1, g2, FUN=mean, R=nboot, trim=0.2) # get 95% confidence interval (percentile) boot.ci(boot.res, type = "perc") # get percentile bootstrap CI ``` ``` ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 1000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = boot.res, type = "perc") ## ## Intervals : ## Level Percentile ## 95% (-114.79, -65.34 ) ## Calculations and Intervals on Original Scale ``` --- # Challenge - Research Question: - Are smaller pepper hotter than longer peppers? - Pepper Joe measured the length and heat of 85 chillies peppers - DV - Heat of each chili (0-11) - IV - Length of each Chili ```r library(boot) library(data.table) library(simpleboot) chili<- fread("/Users/jasongeller/Documents/03-Robust_Methods/data/chillis.csv") ``` --- # Challenge - Task: - Create a function to run a regression with the `boot` function - Bootstrap a regression predicting chili heat from length - Use 10,000 bootstraps - Calculate BCa CIs --- # Report Bootstrap - We hypothesized that the length of a chilli was related to its hotness. We chose to bootstrap the distributions of model parameters to account for any deviations from normality in the data. We ran 10,000 bootstrap resamples for each parameter with a *random seed of 666* using the *boot package (Canty & Ripley, 2017)* in *R (R Core Team, 2022)*. The estimates for all parameters are provided in Table 1. Relevant to our hypothesis, the median value for the influence of length on hotness was -.15, 95% BCa CIs [-0.27, -0.08]. This interval does not include 0, which, when taken with the sign of the slope, suggests that shorter chillies are hotter. --- class: middle # Permutation Tests --- # Hypothesis Testing - Two Sample ```r g_control <- c(87,90,82,77,71,81,77,79,84,86,78,84,86,69,81,75,70,76,75,93) g_drug <- c(74,67,81,61,64,75,81,81,81,67,72,78,83,85,56,78,77,80,79,74) # our statistic of interest here is the difference between means df <- cbind(g_control,g_drug) %>% as.data.frame() %>% pivot_longer(g_control:g_drug, names_to = "Group", values_to = "Value") # get mean diff mean_diff <- df %>% group_by(Group) %>% summarise(mean_group = mean(Value)) %>% summarise(mean_diff=diff(mean_group)) # this will give you one value the mean diff value ``` --- # One Permutation Example ```r # Example of doing one permutation # use sample function one.perm <- df %>% mutate(perm_treatment = sample(Group, size = n(), replace = FALSE)) one.perm ```
Group
Value
perm_treatment
g_control
87
g_control
g_drug
74
g_control
g_control
90
g_control
g_drug
67
g_control
g_control
82
g_drug
g_drug
81
g_drug
g_control
77
g_control
g_drug
61
g_drug
g_control
71
g_control
g_drug
64
g_control
g_control
81
g_control
g_drug
75
g_drug
g_control
77
g_drug
g_drug
81
g_control
g_control
79
g_control
g_drug
81
g_drug
g_control
84
g_drug
g_drug
81
g_control
g_control
86
g_control
g_drug
67
g_drug
g_control
78
g_drug
g_drug
72
g_drug
g_control
84
g_drug
g_drug
78
g_drug
g_control
86
g_drug
g_drug
83
g_drug
g_control
69
g_control
g_drug
85
g_control
g_control
81
g_drug
g_drug
56
g_control
g_control
75
g_control
g_drug
78
g_drug
g_control
70
g_drug
g_drug
77
g_drug
g_control
76
g_control
g_drug
80
g_drug
g_control
75
g_control
g_drug
79
g_drug
g_control
93
g_control
g_drug
74
g_control
--- # Now Do This Many Times ```r library(infer) sample_size <- nrow(df) # length of dataset perm_reps <- 10000 # number of permutations you want to do many.perm <- df %>% # this function is in the infer package. What it is doing is creating rep_sample_n(size = sample_size, replace = FALSE, reps = perm_reps) %>% mutate(perm_treatment = sample(Group, size = n(), replace = FALSE)) %>% group_by(replicate, perm_treatment) many.perm ``` ``` ## # A tibble: 400,000 × 4 ## # Groups: replicate, perm_treatment [20,000] ## replicate Group Value perm_treatment ## <int> <chr> <dbl> <chr> ## 1 1 g_drug 79 g_drug ## 2 1 g_drug 81 g_control ## 3 1 g_control 69 g_drug ## 4 1 g_control 81 g_drug ## 5 1 g_control 75 g_control ## 6 1 g_drug 74 g_drug ## 7 1 g_drug 75 g_drug ## 8 1 g_control 86 g_control ## 9 1 g_control 78 g_control ## 10 1 g_control 81 g_control ## # … with 399,990 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- ```r many.perm.means <- many.perm %>% summarise(mean_group = mean(Value), .groups = "drop")%>% group_by(replicate) many.perm.means ``` ``` ## # A tibble: 20,000 × 3 ## # Groups: replicate [10,000] ## replicate perm_treatment mean_group ## <int> <chr> <dbl> ## 1 1 g_control 77.0 ## 2 1 g_drug 77.7 ## 3 2 g_control 79.1 ## 4 2 g_drug 75.6 ## 5 3 g_control 78.1 ## 6 3 g_drug 76.6 ## 7 4 g_control 77.4 ## 8 4 g_drug 77.4 ## 9 5 g_control 77.4 ## 10 5 g_drug 77.4 ## # … with 19,990 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # Calcuate Test Statistic .pull-left[ ```r many.perm.diffs <- many.perm.means %>% summarise(diff_value = diff(mean_group)) ``` ] .pull-right[ ```r ggplot(many.perm.diffs, aes(x = diff_value)) + geom_histogram(bins = 32, color = "white")+ labs(title = "Permuted sampling distribution") ``` <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-50-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # *p*-value .pull-left[ ```r many.perm.diffs_1 <- many.perm.diffs %>% mutate(abs_obs_dif = abs(pull(mean_diff)), # diff from our sample abs_perm_dif = abs(diff_value), #diffs from permutations as_or_more_extreme = abs_perm_dif >= abs_obs_dif) # count how many are as or more extreme ``` ```r # get pvalue mean(many.perm.diffs_1$as_or_more_extreme) ``` ``` ## [1] 0.0254 ``` ] .pull-right[ <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-53-1.png" width="100%" style="display: block; margin: auto;" /> ] --- # Let's do this using less code :) - `infer` package (https://infer.netlify.app/) - Step 1 ```r # use df above # take df above specify what you want df_diff <- df %>% specify(Value ~ Group) %>% calculate(stat = "diff in means") ``` --- # Step 2 - Create the null hypothesis ```r null_distn <- df %>% specify(Value ~ Group) %>% hypothesize(null = "independence") %>% generate(reps = 10000, type = "permute") %>% calculate(stat = "diff in means") ``` --- # Step 2 ```r null_distn %>% visualize() +shade_p_value(obs_stat = df_diff, direction = "two-sided") ``` <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-56-1.png" width="50%" style="display: block; margin: auto;" /> --- # Step 3 - Calculate p value
p_value
0.0284
--- # Permutations Base R ```r stat_obs <- mean(g_control)-mean(g_drug) n_perm = 10000 # how many simulated experiments? stat_perm = array(NA, n_perm) # create a list to store our permutation test values g_control_n = length(g_control) # length of g_control var g_drug_n = length(g_drug) # length of g_drug var g_bucket = c(g_control, g_drug) # combine g_control and g_drug g_bucket_n = length(g_bucket) # length of g_bucket var for (i in 1:n_perm) { # reconstitute both groups, ignoring original labels permuted_bucket <- sample(g_bucket,g_bucket_n,replace=FALSE) # sample both perm_control <- permuted_bucket[1:g_control_n] perm_drug <- permuted_bucket[(g_control_n+1):(g_control_n+g_drug_n)] stat_perm[i] <- mean(perm_control) - mean(perm_drug) } ``` --- # Histogram ```r # visualize the empirical permutation distribution of our statistic of interest x=hist(stat_perm, 50, xlab="mean(control) - mean(drug)", main="Permutation Test") abline(v=stat_obs, col="red", lwd=2) abline(v=-stat_obs, col="red", lwd=2) ``` <img src="02_Robust-Methods_files/figure-html/unnamed-chunk-59-1.png" width="50%" style="display: block; margin: auto;" /> ```r # how many times in the permutation tests did we observe a stat_perm bigger than or smaller than the stat_obs? p_perm0 <- length(which(stat_perm >= abs(stat_obs))) / n_perm p_perm1 <- length(which(stat_perm <= -1*abs(stat_obs))) / n_perm p_perm2 <- p_perm0 + p_perm1 ``` --- # `Permuco` R Package - No need to calculate by hand! ```r library(permuco) #lmperm() # linear models #aovperm() # anova models #perm.lmer() # lmms d <- data.frame(control=g_control, drug=g_drug) %>% pivot_longer(control:drug, names_to = "group") permuco::lmperm(value~group,data=d,np=10000) ``` ``` ## Table of marginal t-test of the betas ## Resampling test using freedman_lane to handle nuisance variables and 10000 permutations. ## Estimate Std. Error t value parametric Pr(>|t|) resampled Pr(<t) ## (Intercept) 80.05 1.624 49.283 4.762e-36 ## groupdrug -5.35 2.297 -2.329 2.528e-02 0.0122 ## resampled Pr(>t) resampled Pr(>|t|) ## (Intercept) ## groupdrug 0.9898 0.0251 ``` --- # Summary - Robust methods provide a useful tool for handling data that deviates from assumptions of normality and homoscedasticity - Robust methods can be more reliable and accurate than traditional methods when dealing with data that is highly skewed or has outliers - Despite their strengths, robust methods are not a panacea and can still be affected by certain types of departures from normality