Abstract:
Recent years have witnessed an emergence of online social lending market, also known as peer-to-peer,
or P2P lending. Borrowers and lenders are allowed to interact through P2P lending platforms online
without a presence of a strong intermediary such as conventional banks. Nevertheless, as P2P platforms
promote wider financial inclusion, the market is also characterized by the issue of higher levels of
information asymmetry than that faced by traditional banks. For said reason, this thesis studies how well
can the individual investors deal with information asymmetry by the means of machine learning default
prediction modelling data provided by Lending Club P2P platform. To that purpose, we first examine the
findings of related literature. We then choose Random Forest and XGBoost machine learning
classification algorithms for experimental part of our study, with Logistic Regression classifier as
performance benchmark. Our study emphasizes the use of appropriate performance metrics in presence
of class imbalance, but also fair and transparent interpretation of the classification results. Next, we
conduct a thorough and transparent data preparation. In the experimental results, the performance of the
chosen classifiers is compared between themselves, with no significant difference between them to justify
their ranking. Additionally, the results of premier classifiers of six related works are showcased, and the
similarity of these results generally coincides with those of our research. However, unlike the related
literature, our study further introduces the thresholding technique for the prediction results, which is
illustrated to be capable of reducing the number of misclassified loan defaults, providing the opportunity
for higher and more stable portfolio returns for the individual investors. Although we demonstrate how
machine learning classification algorithms combined with thresholding technique can provide reasonable
results for the investors, the observable consistency of the prediction results across the field suggest that
the type of data provided by Lending Club may be insufficient to build machine learning models of high
predictive power. Thus, we underline the need for wider use of alternative data in P2P lending market.
However, this notion raises a number of questions for further research regarding alternative data
regulations, privacy, and ethics in P2P lending.