Princeton University
4/18/23
Unsupervised learning
No labels/or correct answer
Goal: find structure
(Chelsea Parlett-Pelleriti)
Given a set of data points, each described by a set of attributes, find clusters such that:
Intra-cluster similarity is maximized
Inter-cluster similarity is minimized
Hierarchical (agglomerative): Create a hierarchical decomposition of the set of objects using some criterion
Partitional (k-means): Construct various partitions (k) and then evaluate them by some criterion
\[d = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}\]
\[d = |x2 - x1| + |y2 - y1|\]
Wards
Calculate a distance matrix which contains distances between every pair
Student | Math | Music | Biology |
StudentA | 2 | 3 | 2 |
StudentB | 1 | 3 | 2 |
StudentC | 1 | 2 | 1 |
StudentD | 2 | 4 | 4 |
StudentE | 3 | 4 | 3 |
students <- data.frame(
Student = c("StudentA", "StudentB", "StudentC", "StudentD", "StudentE"),
Math = c(2, 1, 1, 2, 3),
Music = c(3, 3, 2, 4, 4),
Biology = c(2, 2, 1, 4, 3)
)
rownames(students) <- students$Student # make row names speaker
diststudents <- dist(students, method = "euclidian") # create a distance matrix
diststudents
StudentA StudentB StudentC StudentD
StudentB 1.154701
StudentC 2.000000 1.632993
StudentD 2.581989 2.828427 4.320494
StudentE 2.000000 2.828427 4.000000 1.632993
students2 <- matrix(c(1.5, 3, 2, 1, 2, 1, 2, 4, 4, 3, 4, 3),
nrow = 4, byrow = T)
students2 <- as.data.frame(students2)
rownames(students2) <- c("Cluster1", "StudentC", "StudentD", "StudentE")
diststudents2 <- dist(students2, method = "euclidian")
diststudents2
Cluster1 StudentC StudentD
StudentC 1.500000
StudentD 2.291288 3.741657
StudentE 2.061553 3.464102 1.414214
Height (y-axis):
x-axis: not meaningful (just arranged to look pretty)
N = 84
The talkers included three American English regional dialects (New England dialect, the Southern dialect), three international English dialects (British English, Australian English, and Africaans), and nine nonnative accents (Mandarin, Korean, and Japanese from East Asia, Bengali, Gujarati, and Urdu from South Asia, and Indonesian, Tagalog, and Thai from Southeast Asia)
p_load(factoextra, dendextend, easystats)
clust_data = read_csv("https://raw.githubusercontent.com/jgeller112/clustering_project/main/data/class_wide_1.csv")
clust_data <- dplyr::select(clust_data, -...1, -`54`) # remove extra col sub 54 has weird formatting
clust_data <- as.data.frame(clust_data)
rownames(clust_data) <- clust_data$speaker # make row
clust_data <- dplyr::select(clust_data,-speaker) # remove extra col sub 54 has weird formatting
head(clust_data)
8 | 7 | 1 | 10 | 11 | 12 | 14 | 15 | 16 | 17 | 18 | 19 | 2 | 20 | 23 | 25 | 26 | 27 | 28 | 29 | 3 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 38 | 4 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 5 | 50 | 51 | 52 | 53 | 55 | 56 | 58 | 59 | 6 | 78 | 87 | 90 | 91 | 96 | 105 | 110 | 111 | 115 | 121 | 123 | 125 | 132 | 133 | 135 | 148 | 151 | 152 | 153 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5 | 5 | 1 | 11 | 2 | 1 | 8 | 4 | 2 | 2 | 1 | 9 | 7 | 4 | 5 | 1 | 1 | 1 | 11 | 5 | 1 | 7 | 7 | 9 | 8 | 10 | 8 | 1 | 3 | 1 | 10 | 1 | 1 | 12 | 1 | 5 | 1 | 5 | 8 | 1 | 3 | 7 | 1 | 8 | 9 | 1 | 1 | 5 | 9 | 1 | 11 | 1 | 11 | 1 | 6 | 11 | 1 | 2 | 1 | 9 | 1 | 3 | 1 | 1 | 1 | 5 | 1 | 8 | 4 | 8 | 6 | 5 | 1 | 5 | 1 | 7 | 4 | 1 | 5 | 1 | 7 | 1 | 2 |
6 | 5 | 5 | 7 | 14 | 4 | 2 | 7 | 4 | 2 | 6 | 3 | 9 | 1 | 4 | 5 | 4 | 2 | 3 | 11 | 12 | 1 | 8 | 11 | 9 | 8 | 10 | 11 | 1 | 4 | 12 | 1 | 1 | 8 | 11 | 1 | 5 | 4 | 1 | 8 | 5 | 4 | 7 | 3 | 8 | 9 | 1 | 8 | 11 | 7 | 1 | 11 | 1 | 11 | 1 | 6 | 11 | 7 | 8 | 6 | 9 | 1 | 6 | 2 | 3 | 6 | 9 | 1 | 8 | 4 | 10 | 6 | 5 | 2 | 1 | 14 | 7 | 6 | 4 | 5 | 1 | 4 | 2 | 4 |
1 | 5 | 5 | 7 | 7 | 4 | 3 | 6 | 2 | 8 | 3 | 3 | 3 | 1 | 3 | 4 | 6 | 1 | 3 | 10 | 2 | 1 | 7 | 8 | 9 | 8 | 10 | 11 | 6 | 3 | 8 | 8 | 1 | 8 | 11 | 1 | 1 | 2 | 5 | 7 | 5 | 4 | 7 | 3 | 8 | 9 | 8 | 1 | 6 | 7 | 5 | 11 | 2 | 6 | 1 | 6 | 10 | 6 | 8 | 6 | 10 | 10 | 4 | 2 | 3 | 1 | 7 | 4 | 8 | 11 | 8 | 6 | 3 | 2 | 3 | 14 | 7 | 7 | 1 | 5 | 1 | 7 | 4 | 4 |
4 | 5 | 5 | 1 | 14 | 4 | 1 | 7 | 4 | 9 | 9 | 1 | 9 | 4 | 3 | 5 | 4 | 2 | 4 | 8 | 9 | 1 | 7 | 9 | 9 | 6 | 10 | 8 | 1 | 2 | 8 | 7 | 1 | 8 | 8 | 3 | 11 | 2 | 2 | 8 | 10 | 3 | 7 | 11 | 8 | 9 | 7 | 3 | 11 | 7 | 1 | 11 | 2 | 10 | 1 | 1 | 11 | 6 | 8 | 6 | 9 | 10 | 4 | 3 | 2 | 3 | 5 | 3 | 2 | 5 | 9 | 6 | 3 | 1 | 3 | 14 | 7 | 5 | 6 | 6 | 1 | 4 | 2 | 5 |
1 | 5 | 5 | 1 | 15 | 4 | 2 | 8 | 4 | 2 | 6 | 1 | 9 | 4 | 5 | 5 | 6 | 2 | 4 | 8 | 5 | 1 | 7 | 9 | 9 | 8 | 10 | 2 | 1 | 2 | 12 | 1 | 1 | 8 | 11 | 1 | 11 | 6 | 5 | 8 | 2 | 4 | 7 | 1 | 8 | 9 | 7 | 10 | 5 | 7 | 1 | 11 | 2 | 11 | 1 | 6 | 11 | 2 | 8 | 6 | 9 | 10 | 4 | 2 | 1 | 1 | 5 | 1 | 2 | 5 | 9 | 6 | 3 | 1 | 4 | 14 | 7 | 5 | 6 | 5 | 1 | 7 | 3 | 6 |
5 | 5 | 5 | 1 | 7 | 4 | 1 | 8 | 4 | 9 | 9 | 3 | 9 | 5 | 7 | 5 | 4 | 4 | 6 | 1 | 5 | 1 | 6 | 9 | 9 | 6 | 10 | 9 | 1 | 4 | 13 | 2 | 1 | 8 | 11 | 6 | 3 | 4 | 5 | 8 | 3 | 3 | 7 | 15 | 8 | 8 | 8 | 6 | 5 | 6 | 2 | 11 | 8 | 11 | 1 | 6 | 11 | 6 | 9 | 6 | 9 | 8 | 4 | 2 | 1 | 2 | 5 | 3 | 8 | 5 | 9 | 6 | 3 | 8 | 4 | 5 | 3 | 7 | 6 | 7 | 6 | 7 | 5 | 1 |
Calculate distance matrix
bengali_9 bengali_13 bengali_16 gujarati_5 gujarati_13
bengali_13 31.272992
bengali_16 30.528675 27.964263
gujarati_5 32.969683 24.677925 26.851443
gujarati_13 31.176915 24.535688 28.178006 23.216374
gujarati_14 35.468296 31.464265 30.199338 26.962938 29.765752
urdu_2 32.202484 30.248967 28.513155 22.671568 28.792360
urdu_15 30.967725 27.477263 28.757608 21.540659 21.000000
urdu_27 30.512293 27.910571 29.240383 22.090722 30.545049
indonesian_1 40.755368 37.403208 35.085610 31.968735 36.537652
indonesian_8 42.567593 40.373258 36.523965 34.770677 40.987803
indonesian_10 37.215588 32.817678 31.480152 31.336879 33.749074
tagalog_6 38.948684 37.403208 33.955854 35.972211 36.674242
tagalog_9 40.137264 36.864617 32.449961 35.665109 37.483330
tagalog_18 35.651087 36.728735 33.361655 35.355339 37.509999
thai_2 37.854986 37.000000 36.207734 37.094474 37.696154
thai_6 39.204592 38.275318 37.188708 37.762415 37.775654
thai_7 41.448764 40.743098 37.788887 39.179076 38.470768
japanese_11 37.934153 37.563280 33.808283 37.309516 37.027017
japanese_12 38.600518 36.221541 31.937439 35.651087 38.026307
japanese_26 38.948684 39.711459 35.227830 38.832976 41.629317
korean_2 39.974992 35.552778 35.298725 37.349699 39.774364
korean_24 39.370039 42.355637 35.298725 37.986840 40.963398
korean_30 41.121770 38.794329 35.482390 35.213634 41.024383
mandarin_14 42.708313 39.089641 36.932371 37.563280 39.749214
mandarin_53 38.574603 41.036569 35.496479 37.054015 39.166312
mandarin_63 40.472213 38.807216 36.823905 35.312887 39.623226
english_21 45.803930 40.938979 38.961519 37.269290 38.858718
english_89 47.801674 42.579338 40.779897 40.348482 41.073106
english_103 48.031240 42.906876 40.779897 39.547440 41.629317
english_428 45.552168 40.311289 39.204592 39.038443 39.179076
english_212 49.122296 40.558600 40.877867 38.987177 39.204592
english_357 48.672374 41.097445 40.804412 40.890097 40.062451
english_288 48.259714 42.626283 41.773197 40.422766 40.755368
english_171 43.977267 42.166337 40.669399 39.812058 41.448764
english_126 48.197510 43.370497 41.749251 42.284749 42.059482
english_3 50.318983 49.355851 47.010637 43.829214 47.476310
english_73 44.988888 45.760245 44.226689 42.626283 43.174066
english_153 49.325450 48.877398 46.206060 45.431267 46.893496
english_2 50.695167 48.867167 46.368092 43.231933 46.561787
english_38 49.618545 49.959984 45.122057 43.023250 47.979162
english_460 51.361464 48.805737 46.368092 44.933284 47.265209
africaans_2 50.606324 49.162994 45.749317 45.122057 46.936127
africaans _5 50.438081 49.497475 46.216880 43.554563 47.413078
africaans _42 51.623638 47.738873 47.106263 45.122057 46.421978
gujarati_14 urdu_2 urdu_15 urdu_27 indonesian_1
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2 28.053520
urdu_15 29.068884 25.845696
urdu_27 23.853721 21.400935 29.933259
indonesian_1 29.748950 32.802439 32.000000 31.811947
indonesian_8 33.045423 35.623026 36.235342 36.510273 23.811762
indonesian_10 29.782545 31.016125 34.146742 30.495901 29.966648
tagalog_6 34.365681 37.013511 33.704599 37.121422 27.856777
tagalog_9 31.984371 40.767634 36.166283 34.842503 29.933259
tagalog_18 35.902646 35.888717 37.841776 35.327043 38.496753
thai_2 32.908965 37.363083 40.422766 34.322005 33.526109
thai_6 34.452866 38.832976 39.293765 35.888717 33.555923
thai_7 34.813790 37.907783 41.436699 35.930488 33.541020
japanese_11 37.349699 38.262253 37.336309 38.340579 35.383612
japanese_12 37.067506 31.859065 33.541020 36.041643 31.764760
japanese_26 34.161382 37.496667 40.199502 34.727511 29.866369
korean_2 35.440090 36.810325 38.923001 34.597688 34.770677
korean_24 34.985711 38.587563 37.696154 38.379682 32.634338
korean_30 35.986108 34.957117 38.209946 35.496479 30.659419
mandarin_14 33.645208 38.535698 40.162171 34.799425 29.681644
mandarin_53 33.615473 36.891733 38.974351 33.481338 30.577770
mandarin_63 33.105891 37.403208 37.134889 33.541020 31.160873
english_21 36.932371 37.000000 38.613469 38.327536 36.290495
english_89 39.230090 39.370039 40.149720 41.255303 37.682887
english_103 39.635842 39.673669 41.279535 40.348482 38.236109
english_428 38.974351 39.496835 40.274061 40.496913 39.293765
english_212 37.456642 39.724048 40.718546 39.724048 40.914545
english_357 38.948684 40.914545 42.355637 40.669399 41.036569
english_288 38.820098 40.447497 41.761226 39.799497 40.644803
english_171 39.522146 38.223030 38.691084 39.484174 38.065733
english_126 38.483763 41.880783 43.657760 40.546270 41.133928
english_3 44.181444 45.508241 45.530210 46.032597 42.860238
english_73 43.104524 43.737855 42.953463 44.418465 41.677332
english_153 44.710178 45.847574 46.195238 46.497312 42.473521
english_2 43.034870 45.155288 46.032597 45.574115 43.462628
english_38 43.794977 44.687806 46.076024 44.933284 42.579338
english_460 43.772137 46.850827 48.425200 45.661800 43.783559
africaans_2 45.000000 47.201695 46.130250 46.904158 44.000000
africaans _5 43.840620 45.022217 45.749317 45.376205 42.836900
africaans _42 43.806392 47.623524 47.265209 46.086874 42.801869
indonesian_8 indonesian_10 tagalog_6 tagalog_9 tagalog_18
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10 29.137605
tagalog_6 28.442925 29.189039
tagalog_9 31.096624 32.124757 27.712813
tagalog_18 37.269290 32.984845 32.496154 33.316662
thai_2 38.483763 35.665109 35.972211 37.229021 37.841776
thai_6 37.242449 34.409301 34.467376 35.805028 38.314488
thai_7 37.389838 35.114100 34.626579 35.171011 34.132096
japanese_11 34.626579 33.316662 33.970576 31.591138 26.645825
japanese_12 32.434549 34.828150 33.301652 35.341194 34.568772
japanese_26 34.539832 33.045423 29.732137 32.832910 32.310989
korean_2 36.110940 33.926391 34.684290 33.301652 38.587563
korean_24 32.124757 36.124784 31.511903 29.342802 32.109189
korean_30 32.511536 34.785054 32.893768 35.665109 36.606010
mandarin_14 34.146742 35.028560 32.388269 31.448370 34.423829
mandarin_53 35.665109 33.985291 33.837849 35.623026 38.613469
mandarin_63 29.427878 33.421550 30.773365 29.410882 33.837849
english_21 35.693137 37.269290 36.592349 37.242449 40.012498
english_89 37.854986 40.024992 37.682887 38.935845 42.166337
english_103 38.794329 38.832976 39.012818 38.574603 42.520583
english_428 40.385641 37.841776 36.959437 39.166312 40.841156
english_212 40.410395 40.099875 39.597980 39.698866 43.081318
english_357 42.178193 39.242834 38.522721 39.089641 42.379240
english_288 40.187063 41.060930 40.249224 40.249224 43.104524
english_171 38.340579 38.091994 35.566838 39.433488 41.194660
english_126 41.000000 40.447497 39.446166 39.924930 43.358967
english_3 40.767634 43.370497 43.943145 45.749317 46.270941
english_73 42.332021 43.806392 40.755368 43.000000 43.646306
english_153 41.097445 45.188494 42.544095 44.586994 45.978256
english_2 41.231056 43.988635 44.373415 46.054316 46.335731
english_38 40.987803 43.393548 43.943145 45.705580 46.421978
english_460 43.312816 44.350874 44.215382 44.124823 45.683695
africaans_2 42.438190 44.899889 42.261093 44.158804 45.934736
africaans _5 40.693980 43.508620 43.255058 45.354162 45.880279
africaans _42 42.508823 44.721360 43.703547 43.335897 46.840154
thai_2 thai_6 thai_7 japanese_11 japanese_12 japanese_26
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6 19.183326
thai_7 22.956481 21.656408
japanese_11 35.411862 33.136083 31.448370
japanese_12 36.619667 38.013156 35.014283 31.953091
japanese_26 32.249031 29.832868 23.000000 28.495614 31.192948
korean_2 32.295511 28.301943 30.000000 32.787193 34.380227 29.883106
korean_24 33.955854 32.171416 29.495762 29.171904 28.809721 27.294688
korean_30 31.906112 30.594117 27.110883 32.280025 29.681644 24.124676
mandarin_14 26.476405 26.286879 24.738634 32.634338 34.698703 29.103264
mandarin_53 26.851443 25.942244 29.495762 35.369478 37.709415 31.859065
mandarin_63 32.357379 27.730849 29.732137 29.274562 31.464265 29.748950
english_21 38.923001 36.510273 36.742346 38.923001 38.755645 40.435133
english_89 40.298883 39.471509 39.761791 41.665333 39.610605 43.428102
english_103 41.761226 40.743098 40.902323 42.237424 41.436699 44.226689
english_428 38.366652 36.878178 38.948684 40.546270 40.286474 42.684892
english_212 39.572718 37.229021 38.587563 41.400483 43.347434 44.339599
english_357 38.935845 37.682887 38.665230 41.880783 42.343831 43.772137
english_288 39.242834 37.496667 39.025633 42.237424 43.965896 45.497253
english_171 40.012498 37.802116 39.698866 40.681691 38.652296 41.194660
english_126 38.392708 36.138622 38.353618 42.190046 44.011362 43.817805
english_3 46.701178 46.335731 45.650849 45.596052 46.151923 47.212287
english_73 43.023250 43.874822 44.519659 43.783559 43.497126 45.727453
english_153 45.628938 45.276926 43.988635 45.299007 44.955534 45.541190
english_2 45.088801 45.617979 45.563143 46.572524 47.770284 48.466483
english_38 45.066617 46.572524 46.432747 47.169906 46.690470 48.238988
english_460 43.920383 45.376205 44.136153 45.836667 47.560488 47.989582
africaans_2 46.303348 46.432747 45.376205 45.891176 46.636895 47.286362
africaans _5 45.398238 45.836667 45.453273 46.615448 46.626173 47.613023
africaans _42 45.033321 45.276926 45.066617 46.882833 48.445846 48.104054
korean_2 korean_24 korean_30 mandarin_14 mandarin_53 mandarin_63
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6
thai_7
japanese_11
japanese_12
japanese_26
korean_2
korean_24 32.588341
korean_30 24.474477 27.404379
mandarin_14 25.961510 31.176915 29.681644
mandarin_53 29.461840 33.734256 33.301652 26.305893
mandarin_63 27.276363 27.531800 29.916551 24.819347 26.944387
english_21 36.414283 39.089641 35.397740 34.380227 37.202150 31.906112
english_89 39.509493 41.557190 38.548671 36.097091 39.761791 34.856850
english_103 40.509258 42.461747 39.446166 37.563280 40.360872 36.097091
english_428 39.686270 42.011903 39.749214 37.775654 38.923001 36.235342
english_212 38.170669 43.806392 39.824616 37.429935 38.768544 35.028560
english_357 39.382737 44.328321 41.400483 37.000000 38.794329 36.124784
english_288 39.306488 43.692105 41.036569 36.455452 37.669616 34.655447
english_171 38.574603 40.472213 38.091994 36.986484 38.470768 34.234486
english_126 38.535698 43.439613 40.890097 36.152455 36.537652 34.568772
english_3 45.563143 45.803930 41.581246 43.243497 41.809090 37.523326
english_73 45.628938 45.519227 42.930176 42.801869 39.949969 37.040518
english_153 45.088801 45.661800 41.689327 42.555846 40.902323 37.107951
english_2 45.891176 47.116876 42.883563 43.289722 40.398020 38.000000
english_38 46.065171 46.065171 42.130749 43.863424 39.395431 39.623226
english_460 45.011110 46.840154 43.508620 41.472883 40.174619 37.603191
africaans_2 46.850827 46.076024 43.150898 43.324358 40.828911 38.091994
africaans _5 46.281746 46.173586 42.035699 43.127717 40.792156 37.894591
africaans _42 46.249324 48.383882 44.788391 43.092923 40.975602 38.249183
english_21 english_89 english_103 english_428 english_212
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6
thai_7
japanese_11
japanese_12
japanese_26
korean_2
korean_24
korean_30
mandarin_14
mandarin_53
mandarin_63
english_21
english_89 9.746794
english_103 13.076697 12.569805
english_428 18.520259 18.547237 18.920888
english_212 14.933185 16.733201 18.493242 17.549929
english_357 18.357560 17.663522 18.220867 11.832160 12.409674
english_288 14.662878 16.852300 18.547237 20.149442 10.099505
english_171 16.613248 17.578396 19.773720 21.702534 20.615528
english_126 16.763055 19.183326 20.000000 17.888544 13.038405
english_3 27.018512 27.622455 26.664583 33.985291 31.352831
english_73 30.757113 30.675723 31.701735 31.733263 33.361655
english_153 27.549955 27.459060 27.202941 32.924155 31.527766
english_2 27.640550 28.407745 27.258026 33.926391 29.816103
english_38 31.208973 32.109189 30.675723 36.207734 33.985291
english_460 29.120440 30.215890 28.687977 32.572995 31.128765
africaans_2 27.946377 27.712813 27.129320 32.588341 30.463092
africaans _5 27.349589 27.184554 26.400758 33.391616 30.149627
africaans _42 30.610456 30.757113 28.565714 33.406586 30.133038
english_357 english_288 english_171 english_126 english_3
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6
thai_7
japanese_11
japanese_12
japanese_26
korean_2
korean_24
korean_30
mandarin_14
mandarin_53
mandarin_63
english_21
english_89
english_103
english_428
english_212
english_357
english_288 16.492423
english_171 22.869193 19.824228
english_126 13.784049 10.295630 20.952327
english_3 34.249088 30.577770 30.724583 32.202484
english_73 34.044089 34.161382 29.866369 34.336569 22.671568
english_153 32.588341 31.685959 31.921779 31.749016 13.527749
english_2 32.969683 29.206164 32.031235 30.773365 10.954451
english_38 36.428011 32.817678 33.466401 33.926391 15.491933
english_460 33.060551 30.479501 33.852622 30.870698 17.776389
africaans_2 32.093613 29.899833 31.543621 31.336879 14.317821
africaans _5 32.878564 29.580399 30.033315 31.416556 9.380832
africaans _42 30.757113 30.298515 35.284558 30.724583 21.470911
english_73 english_153 english_2 english_38 english_460
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6
thai_7
japanese_11
japanese_12
japanese_26
korean_2
korean_24
korean_30
mandarin_14
mandarin_53
mandarin_63
english_21
english_89
english_103
english_428
english_212
english_357
english_288
english_171
english_126
english_3
english_73
english_153 19.416488
english_2 23.194827 14.594520
english_38 24.657656 19.052559 13.564660
english_460 21.307276 18.138357 17.088007 19.442222
africaans_2 21.610183 13.190906 14.035669 16.462078 17.804494
africaans _5 21.679483 14.106736 8.944272 13.190906 17.549929
africaans _42 26.438608 18.439089 18.734994 22.869193 21.563859
africaans_2 africaans _5
bengali_13
bengali_16
gujarati_5
gujarati_13
gujarati_14
urdu_2
urdu_15
urdu_27
indonesian_1
indonesian_8
indonesian_10
tagalog_6
tagalog_9
tagalog_18
thai_2
thai_6
thai_7
japanese_11
japanese_12
japanese_26
korean_2
korean_24
korean_30
mandarin_14
mandarin_53
mandarin_63
english_21
english_89
english_103
english_428
english_212
english_357
english_288
english_171
english_126
english_3
english_73
english_153
english_2
english_38
english_460
africaans_2
africaans _5 11.618950
africaans _42 18.330303 19.313208
\[ minimize\Bigg(\sum^k_{k=1}W(C_k)\Bigg) \]
Cohesion \(a(i)\)
Separability \(b(i)\)
\[ s(i) = \frac{b(i)-a(i)}{max{(a(i), b(i))}} \]
\[ Gap_n(k) = E^*_n{log(W_k)} - log(W_k) \]
# Plot cluster results
p1 <- fviz_nbclust(clust_data, FUN = hcut, method = "wss",
k.max = 10) +
ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(clust_data, FUN = hcut, method = "silhouette",
k.max = 10) +
ggtitle("(B) Silhouette method")
p3 <- fviz_nbclust(clust_data, FUN = hcut, method = "gap_stat",
k.max = 10, nboot=100) +
ggtitle("(C) Gap statistic")
Cluster 1: English
Cluster 2: Non-English
Pros:
Show all possible linkage between clusters
No need to preset any cluster values
Cons:
Subjective
Scaleability (can be computationally intensive with lots of data)
Iris dataset
K-means clustering with 3 clusters of sizes 33, 21, 96
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.8135055 1.3145538 -1.2825372 -1.2156393
2 -1.3232208 -0.3718921 -1.1334386 -1.1111395
3 0.5690971 -0.3705265 0.6888118 0.6609378
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 2 1 1 2 1 2 1 1 3 3 3 3 3 3 3 2 3 3
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
141 142 143 144 145 146 147 148 149 150
3 3 3 3 3 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 17.33362 23.15862 149.25899
(between_SS / total_SS = 68.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
# Plot cluster results
p1 <- fviz_nbclust(iris.scaled,kmeans, method = "wss") +
ggtitle("(A) Elbow method")
p2 <- fviz_nbclust(iris.scaled, kmeans, method = "silhouette",
k.max = 10) +
ggtitle("(B) Silhouette method")
p3 <- fviz_nbclust(iris.scaled, FUN = hcut, method = "gap_stat",
k.max = 10) +
ggtitle("(C) Gap statistic")
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1
Pros:
Simple
Fast
Cons:
Sensitive to outliers
# of clusters can change depending on order of data
Assumes spherical density
dbscan
k-mediods
Special thanks to Chelsea Parlett-Pelleriti for some of the content contained herein
PSY 504: Advanced Statistics