Due to lack of scalability of feature selection algorithms when applied in a centralized manner, most classification algorithms perform sub-optimally especially in the presence of irrelevant and redundant features in high dimensional datasets-large feature size small instances. Though it is imperative to remove insignificant features to improve learning, the process is complex and time-consuming. This paper proposed a distributed learning method (DLM) by horizontal partitioning, maintained the class distribution and measured the data complexity and stability of the feature subsets towards achieving scalability. The min-max, mean imputation and minority oversampling techniques were applied on the dataset to create a balanced feature-sample sized ratio. Three common feature selection algorithms: information gain, gain ratio and chi-square as well as classifiers: SVM, decision tree and Naive Bayes, were used to demonstrate the adequacy of the model. The study obtained 99.67% accuracy with significant reduction in runtime and reduct - feature subsets when compared with existing models. The findings suggest that the model has future prospects in accuracy improvement, runtime and reduct size reduction.
Volume
Keyword(s)
Year
Page Number
52-74
Upload
BJPS(2) Paper 4.pdf
(779.58 KB)