Distance is a key notion underlying many data mining algorithms, such as k-nearest neighbor (k-NN).
Why can it be a problem to compare customers using regular Euclidean distance such as when they are described by age (in years), income (in dollars), and number of credit cards? How can this problem be fixed?
You currently work for Aperture Science, a small company that sells information technology (IT) products.
The lone data scientist at Aperture approaches you one day and proposes to use k-NN estimation to build a model to predict the IT budget of companies to identify potential new clients. They would like your help building and deploying the model.
The only data you have on hand is a sample of companies across the United States, which includes their IT budget for last year, their total revenue last year, their total number of employees last year, and their industry classification.
This data will make up your database of potential neighbors. Ultimately, as a first true test of the model you want predict the IT budget for Acme Corp., a potential client for whom you do not know their IT budget (but you know their total revenue, number of employees, and industry classification).
- Given the information above, explain how you could estimate Acme’s IT budget using k-NN.
- If you chose k=N, the total number of training examples, what would be the effect?
After seeing a presentation on the power of data mining and hearing about your past consulting work, you are approached to create a model for a university’s admissions office.
The members of the admissions review board just heard about a technique called “clustering”, and they think it would be a good idea to try this technique on the newest batch of applying students for the upcoming academic year.
By sorting applicants into groups, they believe it will be easier to then make application and recruiting decisions.
For confidentiality reasons, you are provided with only a few select attributes for each of the university’s roughly 4,000 undergraduate applicants.
- hsgpa – the applicant’s high school GPA (out of 4)
- sat – the applicant’s SAT score (out of 1600)
- hsize – the size of the applicant’s graduating class
- athlete – whether the student participated in high school athletics for at least 1 year
- Create a cluster model. Use the default k-means algorithm and set the number of clusters (k) to 3.Here is the model URL (you don’t have to create it) https://bigml.com/shared/cluster/eqaIdcHWrODBlC2P07tORoP9ZYZ
In about 5-8 sentences, briefly describe each of the 3 clusters and explain any overarching takeaways about the groups of applicants this university receives.
- Based on your findings in Part A, what type of supervised learning could help further explore the data and these clusters? Do you believe this would provide any meaningful information and do have any concerns about a university making application decisions using these types of models (supervised or unsupervised)? Explain why or why not