As a simplified example of character recognition, we will compare several supervised learning classifiers with validation on a larger version of the MNIST digit recognition dataset. In this assignment we will use a much larger dataset than that used for assignment 1; this should represent a better distribution of the natural variability in hand written 8s and 9s.
Download (from moodle), NumberRecognitionBigger.mat. Not the dataset includes data samples for all handwritten digits 0 to 9, but we will be using only 8 and 9 for this assignment. You can implement your assignment in either Matlab or python, with details to follow:
Coding
Example Matlab and Python functions that can be relied upon are already outlined in Assignment 1. Assignment 2 may also benefit from the following commands. You are expected to read documentation on the commands available and try to get them working, prior to asking for assistance. Please address questions to the course Python from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA from sklearn.naive_bayes import GaussianNB as NB
Also strongly consider using:
from sklearn.model_selection import cross_validate from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
and using the random_state argument for either StratifiedShuffleSplit or StratifiedKFold.
Machine learning Question 1:
Implement K-Fold cross validation (K=5). Within the validation, you will train and compare a Linear Discriminant Analysis Classifier, a Quadratic Discriminant Analysis Classifier, a Bayesian Classifier (Naïve Bayes) and a K-NN
(K=1, K=5 and K=10) classifier. The validation loop will train these models for predicting 8s and 9s. NOTE: for a fair comparison, K-Fold randomization should only be performed once, with any selected samples for training applied to the creation of all classifier types (LDA, QDA, Bayes, KNN) in an identical manner (i.e. the exact same set of training data will be used to construct each model being compared to ensure a fair comparison).
Provide a K Fold validated error rate for each of the classifiers. Provide a printout of your code (Matlab or python). Answer the following questions:
- a) Which classifier performs the best in this task?
- b) Why do you think this classifier outperforms the others?
- c) How does KNN compare to the results obtained in assignment 1? Why do you observe this comparative pattern?
It was previously announced on multiple occasions that each student is required to assemble their own dataset compatible with supervised learning based classification (i.e. a collection of measurements across many samples/instances/subjects that include a group of interest distinct from the rest of the samples).
If you are happy with your choice from assignment 1, then re-provide your answer to Assignment 1 Question 2 below. If you want to change your dataset for this assignment, for a future assignment or for your graduate project, you are free to do so, but you have to update your answer to Question 2 based on your new dataset choice.
Machine learning Question 2:
(Repeat) Describe the dataset you have collected: total number of samples, total number of measurements, brief description of the measurements included, nature of the group of interest and what differentiates it from the other samples, sample counts for your group of interest and sample count for the group not of interest. Write a program that analyzes each measurement/feature individually.
For each measurement, compute Cohen’s d statistic (the difference between the average value of the group of interest and the average value of the group not of interest, divided by the standard deviation of the joint distribution that includes both groups). Provide a printout of the 10 leading measurements (d statistic furthest from zero), with their corresponding d statistic, making it clear what those measurements represent in your dataset (these are the measurements with the most obvious potential to inform prediction in any given machine learning algorithm). Provide a printout of this code.
Question 3:
Adapt your code from Question 1 to be applied to the dataset that you’ve organized for yourself. Provide a printout of the error rates for the different classifiers and your code. Answer the following question: is the best performing classifier from
Question 1 the same in Question 3? Elaborate on those similarities/differences – what about your dataset may have contributed to the differences/similarities observed?