TY  - BOOK
AU  - Macabenta,Mel Zha Leah M.
TI  - Clustering of data sets with missing values using statistical imputation methods
PY  - 2000///
KW  - Undergraduate Thesis
KW  - AMAT200
N1  - Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2000
N2  - K-means clustering algorithm is the most widely used clustering algorithm in the field of data analysis. One major drawback of this algorithm is that it can never accommodate data set with missing values. However, in reality, occurrence of missing values can not be avoided. Imputation methods are more extensively used in treating missing values compared to deletion. Several imputation methods are suggested but each has advantages and disadvantages over the others, so proper choice of imputation methods is very necessary. Two of the statistical imputation methods namely, hot deck imputation and imputation using a prediction model were used in treating the incomplete data sets. The incomplete data sets after treatment were then clustered using the K-means clustering algorithm. To have a clear comparison, five data sets were used with two kinds of missingness, missing completely at random (MCAR) and missing at random (MAR) at five different levels of degradation ranging from 1% missing values. The evaluation of the resulting clusters was done using the adjusted Rand index. The two methods were compared to the modified K-means algorithm, particularly the modified Euclidean distance. Results showed that the hot deck imputation, regression method and modified K-means clustering algorithm attained a high recovery of clusters especially with big data sets until 30% levels of missing values. In small data sets, good recovery is attained until 10% level of missing values only
ER  -