Personalised predictions of issue relevancy through machine learning

2020

https://github.com/chillu/github-issue-ml-relevancy

Too many Github notifications, too little time. The obvious answer there is to spend lots of time creating a hyper personalised prediction engine that can tell me what I’m interested in. And learn a whole bunch of stuff on the way. This is a tongue-in-cheek experiment, which resulted in a realisation that I’m pretty unpredictable.

See it in action (predictions for the chillu github user):

http://github-issue-ml-relevancy.herokuapp.com

What does this do?

Collect Github events from each repo the viewer has previously interacted with
Score each issue and pull request based on the amount of interactions (if any)
Train a neural network with both categorical and continuous data, with a regression learner
Provide a prediction service for this user

The approach was also presented in Sept 2020 at the virtual StripeConEU conference - see talk recording.

Data Collection

The input parameters are sourced from https://githubarchive.org, a ~6TB data set of every Github event ever created. The data is accessible via Google BigQuery. We’re only interested in events related to repositories that the user has previously interacted with. In my case, this got the training data set to about 20k rows.

See notebook/learn.ipynb for the BigQuery queries run to retrieve the parameters.

Training

Training happens via Python3 on the Fast.AI framework, which builds on awesome libraries like Pytorch, Scikit Learn and Pandas. We’re training both a Neural Network and a Random Forest.

See notebook/learn.ipynb for a (non-interactive) snapshot of the training process.

Prediction Frontend

The frontend is a flask web application served by gunicorn, powered by Python3. It’s hosted on Heroku.