Synthetic financial dataset for fraud detection published on Kaggle

The synthetic datasets generated by the PaySim mobile money simulation have been published for Kaggle-users to practice machine learning techniques for fraud detection.

There is a lack of public available datasets on financial services and especially in the emerging mobile money transactions domain. When Edgar Lopez started his PhD studies he had difficulties obtaining datasets to use within the financial domain. Financial datasets are important to many researchers and in particular to those performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, which leads to no publicly available datasets. Financial institutions might have a lot of data, but their main purpose is not to develop effective methods to detect fraud, therefore there is a huge gap of knowledge, which Edgar set out to close by generating synthetic datasets that can be used by the research community. When he heard of the Kaggle, a data scientist driven community, it seemed to be an appropriate place to publish the datasets.

– Simulations have some limitations in comparison with real data. But it also enables researchers to experiment with new scenarios. With the help of simulation you can generate synthetic datasets and study specific fraud phenomena and, more importantly, measure the impact of different controls that we can test before actually implementing them, says Edgar Lopez. Once you have a hint of which control to use you can simply run another simulation to determine if your idea was good or not. Finally, you can compare different datasets and find the scenario with the most satisfying control for fraud prevention.

PaySim simulates mobile money transactions based on a sample of real transactions extracted from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of a mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down to 1/4 of the original dataset and it can be found on Kaggle: https://www.kaggle.com/ntnu-testimon/paysim1