Introduction to Clustering (with R)

Princeton University

Jason Geller, PH.D.

4/18/23

Clustering

  • Unsupervised learning

    • No labels/or correct answer

    • Goal: find structure

      • Clustering or Dimensionality Reduction

(Chelsea Parlett-Pelleriti)

Goal of clustering?

  • Given a set of data points, each described by a set of attributes, find clusters such that:

    • Intra-cluster similarity is maximized

    • Inter-cluster similarity is minimized

Two types of clustering

  • Hierarchical (agglomerative): Create a hierarchical decomposition of the set of objects using some criterion

  • Partitional (k-means): Construct various partitions (k) and then evaluate them by some criterion

Hierarchical relationships

  • Bottom-Up (agglomerative) clustering: Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together

Hierarchical steps

Distance metrics

  • Euclidean

\[d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}\]

  • Manhattan

\[d = |x2 - x1| + |y2 - y1|\]

Linkage

  • How close clusters are to one another

Linkage

  • Wards

    • Uses sum of squares to join clusters

Hierarchical steps

  • Calculate a distance matrix which contains distances between every pair

    Important

    • Preprocessing:

      • Rows are observations and columns are variables

      • Standardize variables (if not on same scale)

      • No missing data

Student Math Music Biology
StudentA 2 3 2
StudentB 1 3 2
StudentC 1 2 1
StudentD 2 4 4
StudentE 3 4 3

Hierarchical steps

students <- data.frame(
  Student = c("StudentA", "StudentB", "StudentC", "StudentD", "StudentE"),
  Math = c(2, 1, 1, 2, 3),
  Music = c(3, 3, 2, 4, 4),
  Biology = c(2, 2, 1, 4, 3)
)

rownames(students) <- students$Student # make row names speaker

diststudents <- dist(students, method = "euclidian") # create a distance matrix

diststudents
         StudentA StudentB StudentC StudentD
StudentB 1.154701                           
StudentC 2.000000 1.632993                  
StudentD 2.581989 2.828427 4.320494         
StudentE 2.000000 2.828427 4.000000 1.632993

Hierarchical steps

  • Find two points closest together
students2 <- matrix(c(1.5, 3, 2, 1,  2,  1, 2,  4,  4, 3,  4,  3),
  nrow = 4, byrow = T)
students2 <- as.data.frame(students2)
rownames(students2) <- c("Cluster1", "StudentC", "StudentD", "StudentE")
diststudents2 <- dist(students2, method = "euclidian")

diststudents2
         Cluster1 StudentC StudentD
StudentC 1.500000                  
StudentD 2.291288 3.741657         
StudentE 2.061553 3.464102 1.414214

Hierarchical steps

students3 <- matrix(c(1.5,3,2,1,2,1,2.5,4,3.5),
                    nrow = 3, byrow = T)
students3 <- as.data.frame(students3)
rownames(students3) <- c("Cluster1", "StudentC", "Cluster2")
diststudents3 <- dist(students3, 
                      method = "euclidian")

diststudents3
         Cluster1 StudentC
StudentC 1.500000         
Cluster2 2.061553 3.535534

Hierarchical steps

Hierarchical steps

Hierarchical steps

Hierarchical steps

Hierarchical steps

Hierarchical steps

Hierarchical steps

Hierarchical steps

Reading a dendrogram



  • Height (y-axis):

    • Similarity (distance)
  • x-axis: not meaningful (just arranged to look pretty)

Cutting a dendrogram

HAC in action

  • N = 84

  • The talkers included three American English regional dialects (New England dialect, the Southern dialect), three international English dialects (British English, Australian English, and Africaans), and nine nonnative accents (Mandarin, Korean, and Japanese from East Asia, Bengali, Gujarati, and Urdu from South Asia, and Indonesian, Tagalog, and Thai from Southeast Asia)

p_load(factoextra, dendextend, easystats)

clust_data = read_csv("https://raw.githubusercontent.com/jgeller112/clustering_project/main/data/class_wide_1.csv")

clust_data <- dplyr::select(clust_data, -...1, -`54`) # remove extra col sub 54 has weird formatting

clust_data <- as.data.frame(clust_data)

rownames(clust_data) <- clust_data$speaker # make row 

clust_data <- dplyr::select(clust_data,-speaker) # remove extra col sub 54 has weird formatting

head(clust_data)
871101112141516171819220232526272829330313233343536384404142434445464748495505152535556585967887909196105110111115121123125132133135148151152153155156157158159160161162163164165166167168169
15511121842219745111115177981081311011121515813718911591111111611121913111518486515174151712
65571442742639145423111218119810111412118111541854738918117111111161178691623691841065211476451424
1557743628333134613102178981011638818111125754738981675112616106861010423174811863231477151744
45511441749919435424891799610812871883112281037118973117111210111168691043235325963131475661425
15511542842619455624851799810212121181111165824718971057111211161128691042115125963141475651736
5551741849939575446151699610914132181163458337158886562118111611696984212538596384537676751

HAC in action

  • Calculate distance matrix

    dist_mat <- clust_data %>% 
      dist(., method="euclidean")
    
    dist_mat
                  bengali_9 bengali_13 bengali_16 gujarati_5 gujarati_13
    bengali_13    31.272992                                             
    bengali_16    30.528675  27.964263                                  
    gujarati_5    32.969683  24.677925  26.851443                       
    gujarati_13   31.176915  24.535688  28.178006  23.216374            
    gujarati_14   35.468296  31.464265  30.199338  26.962938   29.765752
    urdu_2        32.202484  30.248967  28.513155  22.671568   28.792360
    urdu_15       30.967725  27.477263  28.757608  21.540659   21.000000
    urdu_27       30.512293  27.910571  29.240383  22.090722   30.545049
    indonesian_1  40.755368  37.403208  35.085610  31.968735   36.537652
    indonesian_8  42.567593  40.373258  36.523965  34.770677   40.987803
    indonesian_10 37.215588  32.817678  31.480152  31.336879   33.749074
    tagalog_6     38.948684  37.403208  33.955854  35.972211   36.674242
    tagalog_9     40.137264  36.864617  32.449961  35.665109   37.483330
    tagalog_18    35.651087  36.728735  33.361655  35.355339   37.509999
    thai_2        37.854986  37.000000  36.207734  37.094474   37.696154
    thai_6        39.204592  38.275318  37.188708  37.762415   37.775654
    thai_7        41.448764  40.743098  37.788887  39.179076   38.470768
    japanese_11   37.934153  37.563280  33.808283  37.309516   37.027017
    japanese_12   38.600518  36.221541  31.937439  35.651087   38.026307
    japanese_26   38.948684  39.711459  35.227830  38.832976   41.629317
    korean_2      39.974992  35.552778  35.298725  37.349699   39.774364
    korean_24     39.370039  42.355637  35.298725  37.986840   40.963398
    korean_30     41.121770  38.794329  35.482390  35.213634   41.024383
    mandarin_14   42.708313  39.089641  36.932371  37.563280   39.749214
    mandarin_53   38.574603  41.036569  35.496479  37.054015   39.166312
    mandarin_63   40.472213  38.807216  36.823905  35.312887   39.623226
    english_21    45.803930  40.938979  38.961519  37.269290   38.858718
    english_89    47.801674  42.579338  40.779897  40.348482   41.073106
    english_103   48.031240  42.906876  40.779897  39.547440   41.629317
    english_428   45.552168  40.311289  39.204592  39.038443   39.179076
    english_212   49.122296  40.558600  40.877867  38.987177   39.204592
    english_357   48.672374  41.097445  40.804412  40.890097   40.062451
    english_288   48.259714  42.626283  41.773197  40.422766   40.755368
    english_171   43.977267  42.166337  40.669399  39.812058   41.448764
    english_126   48.197510  43.370497  41.749251  42.284749   42.059482
    english_3     50.318983  49.355851  47.010637  43.829214   47.476310
    english_73    44.988888  45.760245  44.226689  42.626283   43.174066
    english_153   49.325450  48.877398  46.206060  45.431267   46.893496
    english_2     50.695167  48.867167  46.368092  43.231933   46.561787
    english_38    49.618545  49.959984  45.122057  43.023250   47.979162
    english_460   51.361464  48.805737  46.368092  44.933284   47.265209
    africaans_2   50.606324  49.162994  45.749317  45.122057   46.936127
    africaans _5  50.438081  49.497475  46.216880  43.554563   47.413078
    africaans _42 51.623638  47.738873  47.106263  45.122057   46.421978
                  gujarati_14    urdu_2   urdu_15   urdu_27 indonesian_1
    bengali_13                                                          
    bengali_16                                                          
    gujarati_5                                                          
    gujarati_13                                                         
    gujarati_14                                                         
    urdu_2          28.053520                                           
    urdu_15         29.068884 25.845696                                 
    urdu_27         23.853721 21.400935 29.933259                       
    indonesian_1    29.748950 32.802439 32.000000 31.811947             
    indonesian_8    33.045423 35.623026 36.235342 36.510273    23.811762
    indonesian_10   29.782545 31.016125 34.146742 30.495901    29.966648
    tagalog_6       34.365681 37.013511 33.704599 37.121422    27.856777
    tagalog_9       31.984371 40.767634 36.166283 34.842503    29.933259
    tagalog_18      35.902646 35.888717 37.841776 35.327043    38.496753
    thai_2          32.908965 37.363083 40.422766 34.322005    33.526109
    thai_6          34.452866 38.832976 39.293765 35.888717    33.555923
    thai_7          34.813790 37.907783 41.436699 35.930488    33.541020
    japanese_11     37.349699 38.262253 37.336309 38.340579    35.383612
    japanese_12     37.067506 31.859065 33.541020 36.041643    31.764760
    japanese_26     34.161382 37.496667 40.199502 34.727511    29.866369
    korean_2        35.440090 36.810325 38.923001 34.597688    34.770677
    korean_24       34.985711 38.587563 37.696154 38.379682    32.634338
    korean_30       35.986108 34.957117 38.209946 35.496479    30.659419
    mandarin_14     33.645208 38.535698 40.162171 34.799425    29.681644
    mandarin_53     33.615473 36.891733 38.974351 33.481338    30.577770
    mandarin_63     33.105891 37.403208 37.134889 33.541020    31.160873
    english_21      36.932371 37.000000 38.613469 38.327536    36.290495
    english_89      39.230090 39.370039 40.149720 41.255303    37.682887
    english_103     39.635842 39.673669 41.279535 40.348482    38.236109
    english_428     38.974351 39.496835 40.274061 40.496913    39.293765
    english_212     37.456642 39.724048 40.718546 39.724048    40.914545
    english_357     38.948684 40.914545 42.355637 40.669399    41.036569
    english_288     38.820098 40.447497 41.761226 39.799497    40.644803
    english_171     39.522146 38.223030 38.691084 39.484174    38.065733
    english_126     38.483763 41.880783 43.657760 40.546270    41.133928
    english_3       44.181444 45.508241 45.530210 46.032597    42.860238
    english_73      43.104524 43.737855 42.953463 44.418465    41.677332
    english_153     44.710178 45.847574 46.195238 46.497312    42.473521
    english_2       43.034870 45.155288 46.032597 45.574115    43.462628
    english_38      43.794977 44.687806 46.076024 44.933284    42.579338
    english_460     43.772137 46.850827 48.425200 45.661800    43.783559
    africaans_2     45.000000 47.201695 46.130250 46.904158    44.000000
    africaans _5    43.840620 45.022217 45.749317 45.376205    42.836900
    africaans _42   43.806392 47.623524 47.265209 46.086874    42.801869
                  indonesian_8 indonesian_10 tagalog_6 tagalog_9 tagalog_18
    bengali_13                                                             
    bengali_16                                                             
    gujarati_5                                                             
    gujarati_13                                                            
    gujarati_14                                                            
    urdu_2                                                                 
    urdu_15                                                                
    urdu_27                                                                
    indonesian_1                                                           
    indonesian_8                                                           
    indonesian_10    29.137605                                             
    tagalog_6        28.442925     29.189039                               
    tagalog_9        31.096624     32.124757 27.712813                     
    tagalog_18       37.269290     32.984845 32.496154 33.316662           
    thai_2           38.483763     35.665109 35.972211 37.229021  37.841776
    thai_6           37.242449     34.409301 34.467376 35.805028  38.314488
    thai_7           37.389838     35.114100 34.626579 35.171011  34.132096
    japanese_11      34.626579     33.316662 33.970576 31.591138  26.645825
    japanese_12      32.434549     34.828150 33.301652 35.341194  34.568772
    japanese_26      34.539832     33.045423 29.732137 32.832910  32.310989
    korean_2         36.110940     33.926391 34.684290 33.301652  38.587563
    korean_24        32.124757     36.124784 31.511903 29.342802  32.109189
    korean_30        32.511536     34.785054 32.893768 35.665109  36.606010
    mandarin_14      34.146742     35.028560 32.388269 31.448370  34.423829
    mandarin_53      35.665109     33.985291 33.837849 35.623026  38.613469
    mandarin_63      29.427878     33.421550 30.773365 29.410882  33.837849
    english_21       35.693137     37.269290 36.592349 37.242449  40.012498
    english_89       37.854986     40.024992 37.682887 38.935845  42.166337
    english_103      38.794329     38.832976 39.012818 38.574603  42.520583
    english_428      40.385641     37.841776 36.959437 39.166312  40.841156
    english_212      40.410395     40.099875 39.597980 39.698866  43.081318
    english_357      42.178193     39.242834 38.522721 39.089641  42.379240
    english_288      40.187063     41.060930 40.249224 40.249224  43.104524
    english_171      38.340579     38.091994 35.566838 39.433488  41.194660
    english_126      41.000000     40.447497 39.446166 39.924930  43.358967
    english_3        40.767634     43.370497 43.943145 45.749317  46.270941
    english_73       42.332021     43.806392 40.755368 43.000000  43.646306
    english_153      41.097445     45.188494 42.544095 44.586994  45.978256
    english_2        41.231056     43.988635 44.373415 46.054316  46.335731
    english_38       40.987803     43.393548 43.943145 45.705580  46.421978
    english_460      43.312816     44.350874 44.215382 44.124823  45.683695
    africaans_2      42.438190     44.899889 42.261093 44.158804  45.934736
    africaans _5     40.693980     43.508620 43.255058 45.354162  45.880279
    africaans _42    42.508823     44.721360 43.703547 43.335897  46.840154
                     thai_2    thai_6    thai_7 japanese_11 japanese_12 japanese_26
    bengali_13                                                                     
    bengali_16                                                                     
    gujarati_5                                                                     
    gujarati_13                                                                    
    gujarati_14                                                                    
    urdu_2                                                                         
    urdu_15                                                                        
    urdu_27                                                                        
    indonesian_1                                                                   
    indonesian_8                                                                   
    indonesian_10                                                                  
    tagalog_6                                                                      
    tagalog_9                                                                      
    tagalog_18                                                                     
    thai_2                                                                         
    thai_6        19.183326                                                        
    thai_7        22.956481 21.656408                                              
    japanese_11   35.411862 33.136083 31.448370                                    
    japanese_12   36.619667 38.013156 35.014283   31.953091                        
    japanese_26   32.249031 29.832868 23.000000   28.495614   31.192948            
    korean_2      32.295511 28.301943 30.000000   32.787193   34.380227   29.883106
    korean_24     33.955854 32.171416 29.495762   29.171904   28.809721   27.294688
    korean_30     31.906112 30.594117 27.110883   32.280025   29.681644   24.124676
    mandarin_14   26.476405 26.286879 24.738634   32.634338   34.698703   29.103264
    mandarin_53   26.851443 25.942244 29.495762   35.369478   37.709415   31.859065
    mandarin_63   32.357379 27.730849 29.732137   29.274562   31.464265   29.748950
    english_21    38.923001 36.510273 36.742346   38.923001   38.755645   40.435133
    english_89    40.298883 39.471509 39.761791   41.665333   39.610605   43.428102
    english_103   41.761226 40.743098 40.902323   42.237424   41.436699   44.226689
    english_428   38.366652 36.878178 38.948684   40.546270   40.286474   42.684892
    english_212   39.572718 37.229021 38.587563   41.400483   43.347434   44.339599
    english_357   38.935845 37.682887 38.665230   41.880783   42.343831   43.772137
    english_288   39.242834 37.496667 39.025633   42.237424   43.965896   45.497253
    english_171   40.012498 37.802116 39.698866   40.681691   38.652296   41.194660
    english_126   38.392708 36.138622 38.353618   42.190046   44.011362   43.817805
    english_3     46.701178 46.335731 45.650849   45.596052   46.151923   47.212287
    english_73    43.023250 43.874822 44.519659   43.783559   43.497126   45.727453
    english_153   45.628938 45.276926 43.988635   45.299007   44.955534   45.541190
    english_2     45.088801 45.617979 45.563143   46.572524   47.770284   48.466483
    english_38    45.066617 46.572524 46.432747   47.169906   46.690470   48.238988
    english_460   43.920383 45.376205 44.136153   45.836667   47.560488   47.989582
    africaans_2   46.303348 46.432747 45.376205   45.891176   46.636895   47.286362
    africaans _5  45.398238 45.836667 45.453273   46.615448   46.626173   47.613023
    africaans _42 45.033321 45.276926 45.066617   46.882833   48.445846   48.104054
                   korean_2 korean_24 korean_30 mandarin_14 mandarin_53 mandarin_63
    bengali_13                                                                     
    bengali_16                                                                     
    gujarati_5                                                                     
    gujarati_13                                                                    
    gujarati_14                                                                    
    urdu_2                                                                         
    urdu_15                                                                        
    urdu_27                                                                        
    indonesian_1                                                                   
    indonesian_8                                                                   
    indonesian_10                                                                  
    tagalog_6                                                                      
    tagalog_9                                                                      
    tagalog_18                                                                     
    thai_2                                                                         
    thai_6                                                                         
    thai_7                                                                         
    japanese_11                                                                    
    japanese_12                                                                    
    japanese_26                                                                    
    korean_2                                                                       
    korean_24     32.588341                                                        
    korean_30     24.474477 27.404379                                              
    mandarin_14   25.961510 31.176915 29.681644                                    
    mandarin_53   29.461840 33.734256 33.301652   26.305893                        
    mandarin_63   27.276363 27.531800 29.916551   24.819347   26.944387            
    english_21    36.414283 39.089641 35.397740   34.380227   37.202150   31.906112
    english_89    39.509493 41.557190 38.548671   36.097091   39.761791   34.856850
    english_103   40.509258 42.461747 39.446166   37.563280   40.360872   36.097091
    english_428   39.686270 42.011903 39.749214   37.775654   38.923001   36.235342
    english_212   38.170669 43.806392 39.824616   37.429935   38.768544   35.028560
    english_357   39.382737 44.328321 41.400483   37.000000   38.794329   36.124784
    english_288   39.306488 43.692105 41.036569   36.455452   37.669616   34.655447
    english_171   38.574603 40.472213 38.091994   36.986484   38.470768   34.234486
    english_126   38.535698 43.439613 40.890097   36.152455   36.537652   34.568772
    english_3     45.563143 45.803930 41.581246   43.243497   41.809090   37.523326
    english_73    45.628938 45.519227 42.930176   42.801869   39.949969   37.040518
    english_153   45.088801 45.661800 41.689327   42.555846   40.902323   37.107951
    english_2     45.891176 47.116876 42.883563   43.289722   40.398020   38.000000
    english_38    46.065171 46.065171 42.130749   43.863424   39.395431   39.623226
    english_460   45.011110 46.840154 43.508620   41.472883   40.174619   37.603191
    africaans_2   46.850827 46.076024 43.150898   43.324358   40.828911   38.091994
    africaans _5  46.281746 46.173586 42.035699   43.127717   40.792156   37.894591
    africaans _42 46.249324 48.383882 44.788391   43.092923   40.975602   38.249183
                  english_21 english_89 english_103 english_428 english_212
    bengali_13                                                             
    bengali_16                                                             
    gujarati_5                                                             
    gujarati_13                                                            
    gujarati_14                                                            
    urdu_2                                                                 
    urdu_15                                                                
    urdu_27                                                                
    indonesian_1                                                           
    indonesian_8                                                           
    indonesian_10                                                          
    tagalog_6                                                              
    tagalog_9                                                              
    tagalog_18                                                             
    thai_2                                                                 
    thai_6                                                                 
    thai_7                                                                 
    japanese_11                                                            
    japanese_12                                                            
    japanese_26                                                            
    korean_2                                                               
    korean_24                                                              
    korean_30                                                              
    mandarin_14                                                            
    mandarin_53                                                            
    mandarin_63                                                            
    english_21                                                             
    english_89      9.746794                                               
    english_103    13.076697  12.569805                                    
    english_428    18.520259  18.547237   18.920888                        
    english_212    14.933185  16.733201   18.493242   17.549929            
    english_357    18.357560  17.663522   18.220867   11.832160   12.409674
    english_288    14.662878  16.852300   18.547237   20.149442   10.099505
    english_171    16.613248  17.578396   19.773720   21.702534   20.615528
    english_126    16.763055  19.183326   20.000000   17.888544   13.038405
    english_3      27.018512  27.622455   26.664583   33.985291   31.352831
    english_73     30.757113  30.675723   31.701735   31.733263   33.361655
    english_153    27.549955  27.459060   27.202941   32.924155   31.527766
    english_2      27.640550  28.407745   27.258026   33.926391   29.816103
    english_38     31.208973  32.109189   30.675723   36.207734   33.985291
    english_460    29.120440  30.215890   28.687977   32.572995   31.128765
    africaans_2    27.946377  27.712813   27.129320   32.588341   30.463092
    africaans _5   27.349589  27.184554   26.400758   33.391616   30.149627
    africaans _42  30.610456  30.757113   28.565714   33.406586   30.133038
                  english_357 english_288 english_171 english_126 english_3
    bengali_13                                                             
    bengali_16                                                             
    gujarati_5                                                             
    gujarati_13                                                            
    gujarati_14                                                            
    urdu_2                                                                 
    urdu_15                                                                
    urdu_27                                                                
    indonesian_1                                                           
    indonesian_8                                                           
    indonesian_10                                                          
    tagalog_6                                                              
    tagalog_9                                                              
    tagalog_18                                                             
    thai_2                                                                 
    thai_6                                                                 
    thai_7                                                                 
    japanese_11                                                            
    japanese_12                                                            
    japanese_26                                                            
    korean_2                                                               
    korean_24                                                              
    korean_30                                                              
    mandarin_14                                                            
    mandarin_53                                                            
    mandarin_63                                                            
    english_21                                                             
    english_89                                                             
    english_103                                                            
    english_428                                                            
    english_212                                                            
    english_357                                                            
    english_288     16.492423                                              
    english_171     22.869193   19.824228                                  
    english_126     13.784049   10.295630   20.952327                      
    english_3       34.249088   30.577770   30.724583   32.202484          
    english_73      34.044089   34.161382   29.866369   34.336569 22.671568
    english_153     32.588341   31.685959   31.921779   31.749016 13.527749
    english_2       32.969683   29.206164   32.031235   30.773365 10.954451
    english_38      36.428011   32.817678   33.466401   33.926391 15.491933
    english_460     33.060551   30.479501   33.852622   30.870698 17.776389
    africaans_2     32.093613   29.899833   31.543621   31.336879 14.317821
    africaans _5    32.878564   29.580399   30.033315   31.416556  9.380832
    africaans _42   30.757113   30.298515   35.284558   30.724583 21.470911
                  english_73 english_153 english_2 english_38 english_460
    bengali_13                                                           
    bengali_16                                                           
    gujarati_5                                                           
    gujarati_13                                                          
    gujarati_14                                                          
    urdu_2                                                               
    urdu_15                                                              
    urdu_27                                                              
    indonesian_1                                                         
    indonesian_8                                                         
    indonesian_10                                                        
    tagalog_6                                                            
    tagalog_9                                                            
    tagalog_18                                                           
    thai_2                                                               
    thai_6                                                               
    thai_7                                                               
    japanese_11                                                          
    japanese_12                                                          
    japanese_26                                                          
    korean_2                                                             
    korean_24                                                            
    korean_30                                                            
    mandarin_14                                                          
    mandarin_53                                                          
    mandarin_63                                                          
    english_21                                                           
    english_89                                                           
    english_103                                                          
    english_428                                                          
    english_212                                                          
    english_357                                                          
    english_288                                                          
    english_171                                                          
    english_126                                                          
    english_3                                                            
    english_73                                                           
    english_153    19.416488                                             
    english_2      23.194827   14.594520                                 
    english_38     24.657656   19.052559 13.564660                       
    english_460    21.307276   18.138357 17.088007  19.442222            
    africaans_2    21.610183   13.190906 14.035669  16.462078   17.804494
    africaans _5   21.679483   14.106736  8.944272  13.190906   17.549929
    africaans _42  26.438608   18.439089 18.734994  22.869193   21.563859
                  africaans_2 africaans _5
    bengali_13                            
    bengali_16                            
    gujarati_5                            
    gujarati_13                           
    gujarati_14                           
    urdu_2                                
    urdu_15                               
    urdu_27                               
    indonesian_1                          
    indonesian_8                          
    indonesian_10                         
    tagalog_6                             
    tagalog_9                             
    tagalog_18                            
    thai_2                                
    thai_6                                
    thai_7                                
    japanese_11                           
    japanese_12                           
    japanese_26                           
    korean_2                              
    korean_24                             
    korean_30                             
    mandarin_14                           
    mandarin_53                           
    mandarin_63                           
    english_21                            
    english_89                            
    english_103                           
    english_428                           
    english_212                           
    english_357                           
    english_288                           
    english_171                           
    english_126                           
    english_3                             
    english_73                            
    english_153                           
    english_2                             
    english_38                            
    english_460                           
    africaans_2                           
    africaans _5    11.618950             
    africaans _42   18.330303    19.313208

HAC in action

#hclust from stats package
hclust_avg <- hclust(dist_mat, method = 'ward.D2')

plot(hclust_avg, cex = 0.6, hang = -1)

Determine optimal # clusters

Elbow method

\[ minimize\Bigg(\sum^k_{k=1}W(C_k)\Bigg) \]

  • Optimal cluster size \(C_k\) where within-cluster sum of squares (W) is minimized

Silhouette

  • Measure of how good your clusters are (-1, 1)
    • Cohesion \(a(i)\)

    • Separability \(b(i)\)

\[ s(i) = \frac{b(i)-a(i)}{max{(a(i), b(i))}} \]

Gap

  • Compares total within-cluster variation for different values of k (number of clusters) with the expected variation if the data were randomly distributed (null)

\[ Gap_n(k) = E^*_n{log(W_k)} - log(W_k) \]

Determine optimal # clusters

# Plot cluster results
p1 <- fviz_nbclust(clust_data, FUN = hcut, method = "wss", 
                   k.max = 10) +
  ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(clust_data, FUN = hcut, method = "silhouette", 
                   k.max = 10) +
  ggtitle("(B) Silhouette method")

p3 <- fviz_nbclust(clust_data, FUN = hcut, method = "gap_stat", 
                   k.max = 10, nboot=100) +
  ggtitle("(C) Gap statistic")

Determine optimal # clusters

# Display plots side by side
gridExtra::grid.arrange(p1, p2,p3,  nrow = 1)

Determine optimal # clusters

  • Bootstrapping
#easystats
n <- n_clusters_hclust(clust_data, standardize=FALSE, distance_method = "euclidian", hclust_method = "ward.D2",  iterations = 500)

plot(n)

Visualize clusters

fviz_dend(x = hclust_avg, cex = 0.8, lwd = 0.8, k = 2,
          rect = TRUE, 
          rect_border = "gray", 
          rect_fill = FALSE)

Visualize clusters

fviz_dend( hclust_avg, cex = 0.8, lwd = 0.8, k = 2,
                 rect = TRUE,
                 k_colors = "jco",
                 rect_border = "jco",
                 rect_fill = TRUE,
                 type = "circular")

Goodness of fit

hclust_avg <- hcut(dist_mat,k=2)
fviz_silhouette(hclust_avg)
  cluster size ave.sil.width
1       1   27          0.20
2       2   18          0.42

Interpretation

fviz_dend(x = hclust_avg, cex = 0.8, lwd = 0.8, k = 4,
          rect = TRUE, 
          rect_border = "gray", 
          rect_fill = FALSE)
  • Cluster 1: English

  • Cluster 2: Non-English

Pros and cons: HAC

  • Pros:

    • Show all possible linkage between clusters

      • Understand the data much better
    • No need to preset any cluster values

  • Cons:

    • Subjective

    • Scaleability (can be computationally intensive with lots of data)

K-means

  1. Choose k random points to be cluster center
  2. For each data point, assign it to the cluster whose center is closest
  3. Recalculate centers
  4. Repeat 2 and 3 until:
    • Cluster membership does not change
    • Centers change only a tiny amount

K-means

K-means

K-means

K-means

K-means in action

data("iris") #read in data
# Remove species column (5) and scale the data

# easystats scale
iris.scaled <- datawizard::standardize(iris[, -5])
  • Iris dataset

    • We know there are 3 distinct type of flowers

EDA

EDA

Clustering

km.res <- kmeans(iris.scaled, 3)

km.res
K-means clustering with 3 clusters of sizes 33, 21, 96

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8135055   1.3145538   -1.2825372  -1.2156393
2   -1.3232208  -0.3718921   -1.1334386  -1.1111395
3    0.5690971  -0.3705265    0.6888118   0.6609378

Clustering vector:
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
  1   2   2   2   1   1   1   1   2   2   1   1   2   2   1   1   1   1   1   1 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
  1   1   1   1   1   2   1   1   1   2   2   1   1   1   2   2   1   1   2   1 
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
  1   2   2   1   1   2   1   2   1   1   3   3   3   3   3   3   3   2   3   3 
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
  2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
 81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  3   3   3   3   3   3   3   3   3   3   3   3   3   2   3   3   3   3   2   3 
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
  3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
  3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
141 142 143 144 145 146 147 148 149 150 
  3   3   3   3   3   3   3   3   3   3 

Within cluster sum of squares by cluster:
[1]  17.33362  23.15862 149.25899
 (between_SS / total_SS =  68.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
cluster_analysis(iris.scaled, n=3, method="kmeans") %>%
  plot()

Visualizing clusters

fviz_cluster(km.res,iris.scaled)

Determining Clusters

# Plot cluster results
p1 <- fviz_nbclust(iris.scaled,kmeans, method = "wss") +
  ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(iris.scaled, kmeans, method = "silhouette", 
                   k.max = 10) +
  ggtitle("(B) Silhouette method")

p3 <- fviz_nbclust(iris.scaled, FUN = hcut, method = "gap_stat", 
                  k.max = 10) +
  ggtitle("(C) Gap statistic")

Clusters

# Display plots side by side
gridExtra::grid.arrange(p1, p2,p3, nrow = 1)

Cluster Inference

rez_kmeans <- cluster_analysis(iris.scaled, standardize = FALSE, method = "kmeans")
Using solution with 2 clusters, supported by 15 out of 29 methods.
rez_kmeans
Clustern_ObsSum_SquaresSepal.LengthSepal.WidthPetal.LengthPetal.Width
1100174  0.506-0.4250.650.625
25047.4-1.01 0.85 -1.3 -1.25 
plot(rez_kmeans)

  • Cluster 1: Setosa

  • Cluster 2: Versicolor + Virginica

Shiloutte score

library(cluster)
km.res <- kmeans(iris.scaled, 2)
sil <- silhouette(km.res$cluster,dist(iris.scaled))

fviz_silhouette(sil)
  cluster size ave.sil.width
1       1   50          0.68
2       2  100          0.53

Getting cluster assignments

predict(rez_kmeans)
  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1
#add cluster assignment back into df 
iris.scaled$clus <- predict(rez_kmeans)

Pros and cons: K-means

  • Pros:

    • Simple

    • Fast

  • Cons:

    • Sensitive to outliers

    • # of clusters can change depending on order of data

    • Assumes spherical density

      • dbscan

      • k-mediods

Acknowledgements

Special thanks to Chelsea Parlett-Pelleriti for some of the content contained herein