- Posted by João Rodrigues
- On August 28, 2018
- 0 Comments
This summer, from the 10th to the 14th of September, a group of students will come together a Redglue’s offices to participate in the first edition of Red Summer Machine Learning Lab. In this summer lab, the participants will work with Data Science tools in order to solve a Credit Card Fraud Classification problem. To do so, they will use a dataset of credit card transactions. To develop the solution, the participants will be challenged to use the Azure Machine Learning Studio, a cloud GUI-based integrated development environment for developing and deploying Machine Learning Solutions.
It is important that credit card companies are able to recognize and detect fraudulent credit card transactions, in order to protect its costumers.
The dataset that the participants will work with contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
In this lab, the participants will be challenged to work with the Azure Machine Learning Studio, a collaborative drag-and-drop tool to build, test and deploy predictive analytic solutions. Azure ML Studio also publishes the developed models ans web services that can be easily consumed by custom apps or other tools.
Azure Machine Learning Studio can easily be integrated in data pipelines, making it a good option to integrate with a client’s system in order to offer Machine Learning solutions.
The tool has powerful Data Science capabilities, with drag-and-drop modules for data cleaning/preparation, several ML algorithms, automatic model evaluation and more.
The participants will have to develop a Machine Learning solution that is able to provide analytics about the dataset (Exploratory Data Analysis), apply a Machine Learning algorithm capable of successfully predicting and recognizing fraudulent credit card transactions and evaluate the solution. The solution will then be provided as a web service capable of analyzing and predicting unseen cases.
On the first day of our Red Summer ML Lab, the participants learned how to use the Azure Machine Learning Studio, by creating a simple credit fraud Classification experiment. Their experience was seamless and they had no issues importing a dataset, choosing different classifiers, training and evaluating them. They found the ability to have such a visual Machine Learning pipeline with the ability to train and visualize the results very interesting.
Then, we switched to Jupyter Notebooks, in order to work in a different environment. Using python, they loaded the credit fraud dataset and started to work on it.
The first task was to develop a classifier evaluation. They learned about different evaluation metrics and chose the ones they wanted to represent. With this they were ready to implement different classifiers, work on the dataset and see the results.
On the second day, the participants learned how to analyse and take conclusions from the dataset. They arrived at different conclusions but the main points were that the dataset was highly imbalanced (only 0.17% of Fraud cases), that the features had different scales and that there were some outliers in the data.
To fix this issues, they started by implementing different classifiers and then started to work on the scaling of the features, evaluating the impact of these changes on the classifiers.
On the next days, they will work on the imbalance of the dataset and will choose the best approaches, porting them to the Azure ML Studio environment.