Local cover image
Local cover image
Local cover image
Local cover image

A modified k-means clustering algorithm with Mahalanobis distance for clustering incomplete data sets / Iresh Granada Moreno.

By: Material type: TextTextLanguage: English Publication details: 2009Description: 94 leavesSubject(s): Dissertation note: Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2009 Abstract: Cluster analysis is an art of finding grounds in data in such a way that objects in the same group are similar to each other, whereas objects in different groups are as dissimilar as possible. The most commonly used clustering algorithm is the K-means with Euclidean distance. However, such distance function neglects the covariance among the variables in calculating distances. To account for this issue, the Mahalanobis distance is used. However, occurrence of missing values is inevitable and clustering such kind of data set is impossible. Existing method such as case deletion and mean imputation for treating missing values are very prone to producing erroneous conclusions by imputing unreliable estimates and significantly reducing the data set. To avoid these problems, modifications of the K-means clustering algorithm's two most essential elements, allocation and representation, were made. Allocation, which was defined by the Mahalanobis distance, was modified to compute distances between two vectors and to compute variances with some unknown values. The representation which was defined by arithmetic mean was modified to estimate mean where there are one or more unknown values of the certain attribute. The proposed algorithm was applied to Iris and Bupa incomplete data sets simulated under MCAR and MAR assumptions with different levels of missing values. Under MAR, case deletion has the highest cluster recovery at 5% of the samples. However, it was totally outperformed by the proposed algorithm as the occurrences of missing values in the sample increased. In general, the modified k-means with Mahalanobis distance has outdone the rest of the algorithms when applied to both data sets.
List(s) this item appears in: BS Applied Mathematics
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Cover image Item type Current library Collection Call number Status Date due Barcode
University Library Theses Room-Use Only LG993.5 2009 A64 M67 (Browse shelf(Opens below)) Not For Loan 3UPML00012377
University Library Archives and Records Preservation Copy LG993.5 2009 A64 M67 (Browse shelf(Opens below)) Not For Loan 3UPML00032503

Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2009

Cluster analysis is an art of finding grounds in data in such a way that objects in the same group are similar to each other, whereas objects in different groups are as dissimilar as possible. The most commonly used clustering algorithm is the K-means with Euclidean distance. However, such distance function neglects the covariance among the variables in calculating distances. To account for this issue, the Mahalanobis distance is used. However, occurrence of missing values is inevitable and clustering such kind of data set is impossible. Existing method such as case deletion and mean imputation for treating missing values are very prone to producing erroneous conclusions by imputing unreliable estimates and significantly reducing the data set. To avoid these problems, modifications of the K-means clustering algorithm's two most essential elements, allocation and representation, were made. Allocation, which was defined by the Mahalanobis distance, was modified to compute distances between two vectors and to compute variances with some unknown values. The representation which was defined by arithmetic mean was modified to estimate mean where there are one or more unknown values of the certain attribute. The proposed algorithm was applied to Iris and Bupa incomplete data sets simulated under MCAR and MAR assumptions with different levels of missing values. Under MAR, case deletion has the highest cluster recovery at 5% of the samples. However, it was totally outperformed by the proposed algorithm as the occurrences of missing values in the sample increased. In general, the modified k-means with Mahalanobis distance has outdone the rest of the algorithms when applied to both data sets.

There are no comments on this title.

to post a comment.

Click on an image to view it in the image viewer

Local cover image Local cover image
 
University of the Philippines Mindanao
The University Library, UP Mindanao, Mintal, Tugbok District, Davao City, Philippines
Email: library.upmindanao@up.edu.ph
Contact: (082)295-7025
Copyright @ 2022 | All Rights Reserved