Exploration of Graph-based and Traditional Machine Learning Algorithms in Regards to the Amount of Labeled Training Data Required

Ashwin Balaji, Ekaterina Rapinchuk, Department of Mathematics, Michigan State University, 619 Red Cedar Road, East Lansing, Michigan 48824

Data classification and segmentation, defined as the process of categorizing data into a pre-specified number of clusters, is a machine learning task that is vital in creating algorithms with predictive capabilities and has applications in virtually every field. This task is incredibly challenging because of the reliance on large, labeled training data sets for models to perform well. In fact, modern machine learning techniques, such as support vector machines (SVM) and neural networks, require large amounts of cleaned and processed labeled data to create accurate models. The state-of-the-art deep learning techniques, such as convolutional neural networks, require tuning of millions of free parameters to produce optimal results. In recent years, graph-based semi-supervised approaches have been developed, requiring less parameter tuning and labeled data to perform accurately. This project visits a graph-based adaptation of the classical numerical Merriman-Bence-Osher scheme and the Ginzburg-Landau functional. It provides a comparison of multiple modern machine learning techniques across their newer graph-based counterparts. We analyze 2 traditional classification methods: Support Vector Machines, Neural Networks against 3 graph-based approaches: Random forests, KNearest Neighbors, and the graph-based MBO adaptation. Models are assessed across six different datasets, three binary classification and three multiclass classification datasets, ranging from 600 to 20,000 elements. They have been coded using python, Keras, Sci-kit learn, and Libsvm packages. For each method, accuracies from 10 different subsets of each dataset were averaged. We then analyzed the relationship between training size and accuracy for each model. Some graph-based techniques required as little as 1/6 of the training data needed to consistently produce accuracies of over 98%. Various graph-based models performed more accurately, depending on the number of attributes and classes in each dataset. This research can provide insight into how the new field of graph-based techniques can create more accurate semi-supervised classification models. 

Additional Abstract Information

Presenter: Ashwin Balaji

Institution: Novi High School

Type: Oral

Subject: Computer Science

Status: Approved

Time and Location

Session: Oral 6
Date/Time: Tue 2:00pm-3:00pm
Session Number: 611
List other presenters in this same room and session