K-Means Clustering in Python: University Group Case

Erwindra Rusli
4 min readSep 15, 2020

For this project, I will attempt to use K-Means Clustering to cluster Universities into two groups, Private and Public. It is very important to note, we actually have the labels for this data set, but we will NOT use them for the K-Means clustering algorithm, since that is an unsupervised learning algorithm. When using the K-Means algorithm under normal circumstances, it is because we don’t have labels. In this case, we will use the labels to try to get an idea of how well the algorithm performed, but you won’t usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don’t truly make sense in a real world setting!

I will work with the university CSV file from the class and put it to the dataframe called college_data. This dataset contains the following features:

  • Private A factor with levels No and Yes indicating private or public university
  • Apps Number of applications received
  • Accept Number of applications accepted
  • Enroll Number of new students enrolled
  • Top10perc Pct. new students from top 10% of H.S. class
  • Top25perc Pct. new students from top 25% of H.S. class
  • F.Undergrad Number of full-time undergraduates
  • P.Undergrad Number of parttime undergraduates
  • Outstate Out-of-state tuition
  • Room.Board Room and board costs
  • Books Estimated book costs
  • Personal Estimated personal spending
  • PhD Pct. of faculty with Ph.D.’s
  • Terminal Pct. of faculty with a terminal degree
  • S.F.Ratio Student/faculty ratio
  • perc.alumni Pct. alumni who donate
  • Expend Instructional expenditure per student
  • Grad.Rate Graduation rate

It’s time to create some data visualizations. I created a scatterplot of Grad. Rate versus Room.Board where the points are colored by the Private column.

Then I created a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.

Let’s continue to create a stacked histogram showing Out of State Tuition based on the Private column. Try doing this using sns.FacetGrid.

Create a similar histogram for the Grad.Rate column.

Let’s notice how there seems to be a private school with a graduation rate of higher than 100%.

I set that school’s graduation rate to 100 so it makes sense. I get a warning (not an error) when doing this operation, so I use dataframe operations or just re-do the histogram visualization to make sure it actually went through.

Now it is time to create the Cluster labels! Import K-Means from SciKit Learn. Create an instance of a K-Means model with 2 clusters. Fit the model to all the data except for the Private label.

There is no perfect way to evaluate clustering if we don’t have the labels, however since this is just an exercise project, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, we are usually won’t have this luxury in the real world. Create a new column for the dataframe called ‘Cluster’, which is a 1 for a Private school, and a 0 for a public school.

Create a confusion matrix and classification report to see how well the K-Means clustering worked without being given any labels.

Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! Hopefully, we can begin to see how K-Means is useful for clustering un-labeled data!

You can see the full python script on my GitHub.

Thank You

See you in another Data Exploration.

BR,
Erwindra Rusli
Data Scientist Student in Purwadhika School

--

--