This document provides a more in-depth look at the LoanMate loan approval prediction model.
The LoanMate model utilizes an ensemble learning approach to achieve robust and accurate predictions. It combines the strengths of several individual machine learning algorithms. The core idea is that by aggregating the predictions of multiple models, the overall performance can be improved, and the risk of relying on a single model's potential weaknesses is mitigated.
The ensemble typically includes the following types of models:
- Logistic Regression: A linear model that's good for binary classification tasks and provides interpretable coefficients.
- Random Forest: An ensemble of decision trees that handles non-linear relationships well and is robust to overfitting.
- Support Vector Machine (SVM): Effective in high-dimensional spaces and can model complex decision boundaries.
- XGBoost (Extreme Gradient Boosting): A highly efficient and scalable gradient boosting algorithm known for its performance in competitions.
- LightGBM (Light Gradient Boosting Machine): Another gradient boosting framework that is fast and efficient, especially with large datasets.
Each of these models is trained on the preprocessed data, and their predictions are combined (e.g., through voting or averaging probabilities) to produce the final loan approval decision and probability score.
The model was trained and evaluated on a combination of loan application datasets to ensure generalization across different applicant profiles and loan scenarios.
Other internal or proprietary datasets might have also been used during the development of the model to enhance its robustness.
Consistent and thorough data preprocessing is crucial for the performance of any machine learning model. The following steps are applied to the raw input data before it's fed to the models:
-
Handling Missing Values:
- Numerical features: Missing values are imputed using the mean or median of the respective column.
- Categorical features: Missing values are imputed using the most frequent value (mode) of the respective column.
-
**Feature Engineering (Derived Features):
DebtToIncomeRatio: Calculated as(Existing Debt / Annual Income) * 100.LoanToIncomeRatio: Calculated as(Loan Amount / Annual Income) * 100.IncomeToDebtRatio: Calculated asAnnual Income / (Existing Debt + 1)(1 is added to avoid division by zero).- Credit History is often binarized (e.g., 1 for good history, 0 for bad).
-
Encoding Categorical Features:
- One-Hot Encoding: Categorical features with no inherent order (nominal features like 'Gender', 'Married', 'Self_Employed', 'Property_Area') are converted into numerical format using one-hot encoding. This creates new binary columns for each category.
- Ordinal Encoding/Mapping: Categorical features with an inherent order (ordinal features like 'Education', 'Dependents') might be mapped to numerical values if not handled by one-hot encoding.
-
Scaling Numerical Features:
- StandardScaler: Numerical features (like 'ApplicantIncome', 'LoanAmount', 'Loan_Amount_Term') are scaled to have zero mean and unit variance. This helps algorithms that are sensitive to feature magnitudes (e.g., SVM, Logistic Regression).
These preprocessing steps are encapsulated within a ColumnTransformer and Pipeline from scikit-learn. This ensures that the same transformations are applied consistently during training, evaluation, and prediction.
The build_master_model function in loan_master_model.py is responsible for orchestrating the training process for different dataset types (like 'hugging' and 'loan_approval'). For each dataset type, it defines the specific preprocessor and trains the ensemble model.
The final trained model, which includes the preprocessors and the ensemble models for each dataset type, is saved as a single loan_master_model.pkl file using joblib. joblib is preferred over pickle for saving scikit-learn models as it's more efficient with large NumPy arrays.
When a new loan application is submitted via the Gradio interface:
- The input data is converted into a Pandas DataFrame.
- Derived features are calculated.
- The
predict_loan_approvalfunction is called. - This function selects the appropriate preprocessor and ensemble model based on the
dataset_type(which is 'hugging' for the Space). - The data is transformed using the loaded preprocessor.
- The ensemble model makes a prediction (0 for rejected, 1 for approved) and calculates the probability of approval.
- A risk score is then derived from this probability.
This detailed process ensures that predictions are made consistently with how the model was trained.
