## DS in the Real World

# Applying Machine Learning to Address a Problem That Is More Than Academic

## Breaking down a research article by Himabindu Lakkaraju, et al.

As a data science student, most of the lectures and labs I encountered covered topics in the abstract. We spent more time discussing irises, Titanic passengers, real estate in King County, WA, and a seemingly endless array of Pokemon than I ever expected I would. However, the more I explored research papers and publications, the more I found real-world examples of data science and machine learning in action that I could relate to. This one, in particular, lept out at me.

## A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes by Himabindu Lakkaraju, et al.

Before studying data science, I spent over fifteen years working in education and academic administration. In fact, one of the first “real” jobs I had was working on a fine arts program at an alternative high school in Chicago.

There are many reasons that a student may not finish high school on time (or at all) and other studies have shown that can have a far-reaching impact on the rest of their life, especially in regards to future career prospects. Additionally, students who do not graduate on time can stress the resources of an already strained school.

The authors of this paper designed a machine learning framework to identify students at risk of not completing high school on time. This framework “also lays a foundation for future work on other adverse academic outcomes.”

Many schools have intervention programs in place to help at-risk students get back on track so they graduate on time. What they likely do not have is a framework to identify those students who will need an intervention.

The success of these individualized intervention programs depends on schools’ ability to accurately identify and prioritize at-risk students with enough time to implement existing intervention programs to get them back on track.

**THE DATA**

The authors partnered with two U.S. school districts to undertake this study:

District A is in the mid-Atlantic area with 150,000 students in 40 schools.

District B is in the east coast area with 30,000 students in 39 schools.

The list of Student Attributes they had access to are included in this table:

**METHODOLOGY**

The final question in Table 1 is the target of this problem:

## Did the student graduate on time?

The problem of identifying students who are at risk of not graduating on time can thus be formulated as a binary classification with ‘no_grad’ as the outcome variable. All other variables in Table 1 can be used as predictors.

So the target is defined as a binary outcome where 1 = No and 0=Yes. The authors then ran their experiment on several different machine learning models: Logistic Regression, Decision Tree, Random Forest, Adaboost, and Support Vector Machines.

We carry out 100 runs with each of these models and average the predictions (and/or probabilities) to compute the final estimate.

These models are all fed the data from the *predictors* — “attributes for these students such as their GPAs, absence rates, tardiness, gender, etc.” to determine if any of the machine learning models can correctly predict whether a student graduated on time it did not graduate on time.

*A WORD ABOUT DATASETS AND MACHINE LEARNING*

In the machine learning process, the data is divided into two or more sets. The largest set is referred to as the *training set*, which is used to train the model. After the data is read by the model and it produces an answer, adjustments are made to the parameters of the model to improve that output. The data may be used many times during the training. There is the potential that a model may become familiar with all the nuances of this set.

A second set called a *validation set* is sometimes used during the training. As the name suggests, this set of data is used to validate the quality of the training results. The addition of this set of data is used when fine-tuning the model.

Finally, the model is given the *test set* for the first time. The results will be an unbiased evaluation of how well the model can predict the outcome in question — namely did the student graduate on time. This test shows how well the model will perform on future data as well.

In this study, the data from District B is used as the test set. Within District A, there are two cohorts of approximately equal size: students scheduled to graduate in 2012, and students scheduled to graduate in 2013. These cohorts are used for training and validation, respectively.

**THE RESULTS**

After running each model 100 times, then averaging the results, it became clear that the Random Forest model (the red line on both graphs) performed the best on both the data from Districts. It had approximately a 90% accuracy for predicting students who would not graduate on time. The other models performed comparably on the District A data but did not perform as well on District B.

*ADDITIONAL QUESTION OF RANKING AND PRIORITIZING STUDENTS*

Their Random Forest Model shows that it can be applied to predict the probability of a student not graduating on time.

The educators of these school districts asked for a further method whereby they could list their students “according to some measure of *risk* such that students at the top of the list are verifiably at a higher risk.” This would allow the schools to prioritize their interventions with students based on the available resources.

The authors were able to do this by looking at the *feature importance* of the variables used by each of their models. The feature importance is a method “to understand which factors contribute most heavily to the predictions.” There were variations from one model to the next, but a pattern emerged showing the GPA and absence rate for students in their 8th-grade year could be used to measure their likelihood of getting off track for their scheduled high school graduation date.

**CONCLUSION**

There are more maths, models, and calculations involved in the study than what I’ve gone over here. I was surprised at how much I enjoyed reading this paper (is that strange?) and I highly recommend further reading if you are into either the topics of machine learning or academic success.

**CREDITS**

A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes*

Himabindu Lakkaraju, Stanford University, himalv@cs.stanford.edu

Everaldo Aguiar, University of Notre Dame, eaguiar@nd.edu

Carl Shan, University of Chicago, carlshan@uchicago.edu

David Miller, Northwestern University, dmiller@u.northwestern.edu

Nasir Bhanpuri, University of Chicago, nbhanpuri@uchicago.edu

Rayid Ghani, University of Chicago, rayid@uchicago.edu

Kecia L. Addison, Montgomery County Public Schools, Kecia_L_Addison@mcpsmd.org

Publication: KDD ’15: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2015, Pages 1909–1918, https://doi.org/10.1145/2783258.2788620