Income Prediction and Validation

7 min readMay 18, 2021

Assessing payment capacity is one of the most important issues when financial institutions need to assign credit limits. Although it might seem trivial, in some cases the information is no available, and due to the informality of some latinamercian economies (Colombia for this study case), it is important to build statistical models that can estimate the income of the customers.

Why using survey information?

We are going to be using the data from the National Administrative Department of Statistics of Colombia DANE The database is the result of a survey conducted to more than 25k households in three major cities in Colombia for 2018.

What data do we have?

All the data and metadata can be found in this link. The data has 331 variables including spending behaviours and financial burden of the households.

What question do we want to answer?

1. Can the income be modeled after the spending patterns of the household?
2. Can the income be modeled after the financial burden of the household?
3. Is there a possible way financial institutions can include this informetion in their models?

Data Analysys

The Financial Survey of households and individuals seeks to obtain detailed information on the financial situation, the level of indebtedness and financial education.

Some of the objectives that the information serves are:

Know the proportion of households that, according to their socioeconomic situation, have used and use the services of the financial sector.
Monitor the level of indebtedness of the households.
Measure the financial burden of households in Colombia.
Inquire what real and financial assets households in Colombia own, as well as the financing used to acquire them.
Know the level of financial education of Colombian households on issues associated with credit (concepts of interest rate, present value and yield, among others) and related to the financial market (operation of the stock market and risk of financial instruments, among others ).

The original dataset has 331 variables for 62.838 residents of three major cities en colombia. It holds some interesting data, like spenditures, indebtedness and assets.

Reading the database documentation it can be found that there are some household that do not report the total income, and that have one entry for every member of the house (including children). For this reason we are going to keep only the household head and only the households that report total income.

After reviewing the documentation andchecking for null values the following variables were selected.

Income: total income per person.
Sependitures: how much does he spend monthly home in food, clothing, utilities, leisure, health and internet.
Credit card info: Pyment, balance and “term” of the credit card usage.
House value: Value of the house if the person is an owner
Vehicle value: Value of the vehicles (cars or bikes) if the person is an owner
Savins accounts: Balance of the savins accounts

Let’s beging by ploting the correlation of the variables…

There seem to be some obvoius correlations:

Income has a big correlation with house value.
Credit card balance and payment.

But there are some that are not that simple:

Money spent in utilities is highly correlated
Money spent in internet is correlated with money spent in food

Checking correlations only with ‘price’ it is clear that the house value is strongly correlated with the price. And the number of payments for the credit card is negatively correlated with the income. Now we are going to impute with the mean.

Modeling and results

Estimating income is basically a regression problem. For this case we are going to use a simple regression model. In the industry income is traditionally already available, because is easy to extract information from reliable sources.

We are going build an XG-boost which is one of the most popular machine learning algorithms in regression problems.

XGBoost is used for supervised learning problems, where we use the training data (with multiple features) to predict a target variable. What is XGBoost? XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks.

The following diagram shows the the main components of the XGBoost

Split data (70/30)
Create a pipeline to make a grid search of the following parameters:

Learning_rate: Step size shrinkage used in update to prevents overfitting.
Max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.
Reg_lambda: L2 regularization term on weights.
Gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.

A grid search was used for hyperparameter optimization.

Model Evaluation And Validation

Cross validation was performed to evaluate the better model parameters.

Model trained

This were the resulting parameters:

The next scatter plot lets us evaluate the performance of the model.

Scatter plot comparing predictions vs real values

Visually it seem that our model did a somewhat good estimation of the income.

We can se that spending in housekeeping is a very important predictor of income.

Metrics

The metrics used to assses the model are based on the impact decision on credit limit. It would let the policy maker know the impacto on the credit decision.

Mean Absolute Error (MSE): is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement.

Root Mean Square Error (RMSE): is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.

Model MSE: $ 781.935

Model RMSE: $ 1.604.596

Regarding our initial questions we have the following:

1. Can the income be modeled after the spending patterns of the household?

The most important variable is the money spent on housekeepres. The other house spenditures that have a high impact on the results is internet, retirement and leisure.

2. Can the income be modeled after the financial burden of the household?

The second most important variable is the monthly payment of credit cards. The other variable related with credit cards are not as important, but the value of the assests have an important effect on income.

3. Is there a possible way financial institutions can include this informetion in their models?

The five most important variables can be accesed by the banks. Credit cards usage, assests values, and some spent patterns can be accesed via surveys to some clients.

Conclusion

Public available information can be used to identify what variables coud make for good predictors for income. This information can be used for financial entities to identify which variables are important to capture new information and include it in the models.

Further improvements can be made to both, the model and the insights extracted from it.

The model: There are almost 300 additional variables to extract information from. The problem is the underrepresentation of some variables.
Insights: Some further analysis can be made on the variables that we found important explaining the income. Some transformation and created variables can be used to do this.

Wrapping up, we found that there is a strong correlation between the location of a listing and its price (kind of obvious), but the real insight here is that the average sentiment of the comments is way more important than the scores given by the guests.

You can have all the files of this analysys in my github :)