The DAP Journey: Multi Programming Language Classifier
Machine/Deep Learning Approach to predicting programming language snippets
Introducing the team behind our Data Associate Program (DAP) Project!
Mentor: Yar Khine Phyo
Members: Aw Khai Loong, Felice Png, Lee Yu Hao, Bingyu Yap
Our group would like to sincerely thank Phyo for his support and expertise throughout the semester, this project would not have been possible without his guidance.
How we got started..
We observed that traditional methods to write code language detectors are often complex. There are very long list of rules required to code out an algorithmic approach, which need to be further tweaked if new languages are added.
Hence we decided to try a machine learning (ML) based approach to see if our models can successfully predict the language of a code segment after training on already existing code! How cool is that! :)
This is a summary of our pipeline, and we would be covering this in detail below!
Note that the languages we tried to predict are C, HTML, Java, PHP, Python and Ruby. (as the features are relatively distinct enough for us to predict)
Since there are no structured datasets available for us to readily use, we decided to manually scrape the data from github repos using Beautiful Soup.
We first began to scrape the most common coding languages by looking at Stack Overflow surveys. After which, we proceeded to scrap all the repo names in HTML and began to clone the repositories from Github (For example: we cloned the python repositories from https://github.com/trending/python?since=monthly.)
Once we obtained the textual data, we realised that the words are useless to us unless we could convert it to useful numerical data that can be fed into our models.
We first generated the file lists, and for each file, we built a token dictionary using a greedy tokenizer that splits words by non-word characters. The tokenizer keeps the matched pattern and delimiters in order to ensure that the most common characters would be noted by our dictionary.
Once that is done, we begin to build the dataset by populating with the top tokens, and checking the frequency available in each file.
An important point to note is that the train and test set was split prior to the model training, to prevent the model from learning any signals from the test set (which it should not!)
Once we are done with the preprocessing, we can begin to do machine learning on our dataset. We used a mix of models (including Random Forest, Support Vector Machines, Logistic Regression), and even an ensemble of models using a Voting Classifier to see whether we could obtain an optimal result.
Overall, we obtain quite decent results from our models:
However, we decided to go a step further to see if deep learning models would be able to better learn the trends from the tokenizer dictionary. Specifically, we were concerned that our ML pipeline was modelled simply with a bag-of-words and was not able to learn from the ordering of tokens, hence we decided to use neural networks to see if we were able to obtain a better result.
To achieve this, we decided to use a pre-made tokenizer (Keras tokenizer) which would split by characters instead of words. We did this as we felt that that the neural networks would be able to extract more information from the characters present.
Once done, we did a padding of sequences to account for the different file lengths, and then did a one hot encoding to convert the information into numerical data for use by our Long Short Term Memory (LSTM) model!
We made use of LSTM layers as well as dense layers to allow the neural networks to properly learn from the data in the input layer.
From the LSTM, we obtained an accuracy of 96.88%! Not bad! There are 2 things to be noted here.
Firstly, it seems that our random forest ML model has done much better than our LSTM model, showing that the large number of neural networks may have trained on some noise instead of actually learning the trend of the data, resulting in a slightly lower accuracy.
More importantly, it seems that our test results are all relatively high (> 90%). This while seemingly encouraging, may just simply mean that our model was able to predict new data from github code, and may not necessarily translate to a good accuracy for code snippets in general. This is an important point to note, where the accuracy of a result should also be noted within the context it is measured.
Improvements to note
We noted that when we tested the results on new test results, we managed to get decent results, however we may need to scrape data from a greater pool of repositories and also obtain data outside of GitHub in order to successfully attain a reliable programming language classifier.
Feel free to refer to our github repo at https://github.com/yuhaoleeyh/github-DAP
We welcome any suggestions/future improvements to the project! :)