A modified K-modes algorithm for clustering categorical data sets with missing values using bhattacharyya distance function /

Gabiana, Marie Lou Manalili.

A modified K-modes algorithm for clustering categorical data sets with missing values using bhattacharyya distance function / Marie Lou Manalili Gabiana. - 2006 - 61 leaves.

Thesis (BS Computer Science -- University of the Philippines Mindanao, 2006

Clustering can be defined as the process of organizing objects in a database into cluster/groups such that objects within the same cluster hav ea high degree of similarity, while objects belonging to different clusters have a high degree of dissimalirity. This study clusters data sets and utilized K-modes algorithm for clustering. However, this algorithm is arranged only for complete data sets and not for data sets which contains missing values. This led to the modification of the K-modes algorithm incorporated with the Bhattacharyya distance. There were two modifications; the first modification was the availbale case analyis which uses the availbale information left on the data set while the second modification was the adaptive imputation which imputes missing data during clustering stage. The performances of these modifications were compared with the performances of the existing methods namely; attribute deletion, mode imputation, KNN imputation and K-modes clustering using Chi-square distance. The two modifications produced goofd quality of clustering results compared with K-modes after attribute deletion and K-modes after mode iputation. These modifications were also competitive with regards to K-modes after KNN imputation. The first modification using Bhattcharyya distance produced higher quality resluts compared with forst modification using Chi-square distance. The second modification using Bhattacharyya distance on the other hand produced poorer quality results compared with second modification using Chi-sqaure distance. However, differences between the results in second modifications of both distance functions were not that high. The two modifications using Bhattacharyya distance were later used to cluster an actual incomplete data set to verify further the clustering perfomances.


Bhattacharyya distance.
Clustering.
K-modes algorithm.
Categorical data.
Missing values.


Undergraduate Thesis --AMAT200,
 
University of the Philippines Mindanao
The University Library, UP Mindanao, Mintal, Tugbok District, Davao City, Philippines
Email: library.upmindanao@up.edu.ph
Contact: (082)295-7025
Copyright @ 2022 | All Rights Reserved