Elo Merchant Category Recommendation: Kaggle competition — A Case Study

Dhananjay Chaudhari

13 min readFeb 1, 2021

Demo: elo-merchant.herokuapp.com.

Table Of Content:

Business Problem and Objective
Source of Data
Mapping the real-world problem to an ML problem
Exploratory Data Analysis (EDA)
Feature Engineering
The existing approach to the problem
First cut approach to solving the problem
Comparison of the models in tabular format
Kaggle Submission
The final pipeline of the problem
Productionization and Demo
Future Work
References

1. Business problem and Objective

1.1 What is ELO:

Elo is a pure domestic card brand in Brazil and only allows for transactions in Brazilian Real currency. Elo results from a partnership of three of the largest banks in Brazil: Banco do Brasil, Bradesco, and CAIXA. The Elo brand supports credit, debit, and prepaid card transactions. As of March 2019, Elo has issued over 120 million cards.

1.2 Problem Statement:

Elo merchant category recommendation problem is to predict customers' loyalty scores based on individual transactions. Elo, one of the largest payment brands in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? So we want to develop an algorithm to predict customers’ loyalty so they can give more offers to loyal customers which will be beneficial to customers and merchants as well and also helps reduce the unwanted campaign for Elo.

1.3 Objectives:

In the problem statement, it is mentioned as a recommendation problem but here we have to predict loyalty scores which is nothing but real value. So we can treat it as a Regression Machine learning Problem.
Here we are using Root Mean Squared Error (RMSE) as an evaluation metric for our problem. We want to reduce RMSE.

2. Source of Data

Link: https://www.kaggle.com/c/elo-merchant-category-recommendation

2.1 Overview:

We have a total of 5 data files and all are in CSV format. You will need, at a minimum, the train.csv, and test.csv files. These contain the card_ids that we'll be using for training and prediction.

Train: It contains 5 features(independent variable) card_id, first_active_month,feature_1, feature_2, feature_3 and a target variable(dependent variable) target.
Test: Here target variable is not present, We have to predict the target variable(loyalty score) to given 5 features.
Merchant: It contains aggregate information for each merchant_id. It has a total of 22 columns. Merchants can be joined with the transaction sets to provide additional merchant-level information
Historical_transactions: Contains up to 3 months of transactions for every card at any of the provided merchant_id. It has a total of 14 columns.
New_merchant_transactions: Contains the transactions at new merchants (merchant_ids that this particular card_id has not yet visited) over a period of two months. it has a total of 14 columns.

3. Mapping the real-world problem to an ML problem:

We will need, at a minimum, the train.csv, and test.csv files. These contain the card_ids that we'll be using for training and prediction. But data present in the train.csv file is not enough to make a perfect model that will reduce the RMSE score. We have historical transactions and new transactions for each card Id. Hence we have to merge both tables so we can increase accuracy and reduce error as much as possible.

historical_transactions.csv and new_merchant_transactions.csv are designed to be joined with train.csv, test.csv, and merchants.csv. They contain information about transactions for each card. Merchant details are also going to help us in understanding transactions and customers' behavior based on merchants' location, merchant type, and so on. Hence we will merge the merchant's table also to form our training data.

Evaluation Metric: Here we will use Root Mean Squared Error (RMSE) as an evaluation metric for our regression machine learning problem.

4. Exploratory Data Analysis (EDA)

It is very important to understand the data before going to modeling. It will help us in understanding each variable of data. It is a very important step in the Data Science Life cycle and around 60–65% of the total time is spent on data cleaning, understanding, and data visualization.

In EDA, we will observe each variable present in all 5 different data files.

Train dataset has a total of 201917 rows/data points with 5 features and 1 target variable.
The test dataset has a total of 123623 rows/data points with 5 features
Merchant data have 334696 rows with 22 columns
New_merchant_transactions have 1963031 rows with 14 columns.
Historical_transactions have 29112361 rows with 14 columns

Let's explore datasets and perform EDA one by one:

4.1 Train and Test:

Train and Test dataset has a total of 6 columns:

Card_id: Unique card identifier in Alphanumeric form
First_active_month: ‘YYYY-MM’, the month of first purchase.
Feature_1: Anonymized card categorical feature.
Feature_2: Anonymized card categorical feature.
Feature_3: Anonymized card categorical feature.
Target: Loyalty numerical score calculated 2 months after the historical and evaluation period.

The train set has 0 null values but the First_active_month columns of the test set have one null value.

[2] Train and Test set distribution of features

[3] Target variable distribution in the train set

Observation:

We can see in fig[2], feature_1,feature_2, and feature_3 has the same distribution for train and test set
For loyalty score fig[3], most of the values are ranging from -10 to +10 but as we can see in the graph there are some outliers that have a value less than -30.
The number of outliers in the data: 2207 whose target value is less than -30.

[4] correlation between categorical features for train and test

Observations:

In the train set, all features are correlated with each other except feature_2 and feature _3(value is 0.14).
In the Test set, all features are highly correlated except feature_3 with feature_1 and feature_2(value is 0.53 and 0.47 respectively) which are less correlated.
It is very difficult to predict the loyalty score just based on the features given the train set. We have to merge data from merchants.csv, historical_transactions.csv, and new_merchants_transactions.csv for better results.

4.2 Merchants Dataset:

Contains aggregate information for each merchant_id represented in the data set.

Merchant dataset has a total of 22 columns

merchant_id: Unique merchant identifier
merchant_group_id: Merchant group(anonymized)
merchant_category_id: Unique identifier for merchant category (anonymized). Has 322 unique values
subsector_id: Merchant category group(anonymized). Has 41 unique values.
numerical_1: anonymized measure
numerical_2: anonymized measure
category_1: anonymized category
most_recent_sales_range: Range of revenue(monetary units)in last active month->A->B->C->D->E
most_recent_purchases_range: Range of quality of transactions in last active month->A->B->C->D->E
avg_sales_lag3: Monthly average of revenue in last 3 months divided by revenue in the last active month.13 null values.
avg_purchases_lag3: Monthly average of transactions in the last 3 months divided by transactions in the last active month. 3 null values.
active_months_lag3: Quantity of active months within the last 3 months. 3 unique values.
avg_sales_lag6: Monthly average of revenue in the last 6 months divided by revenue in the last active month.13 null values.
avg_purchases_lag6: Monthly average of transactions in the last 6 months divided by transactions in the last active month.3 null values.
active_months_lag6: Quantity of active months within the last 6 months. 6 unique values.
avg_sales_lag12: Monthly average of revenue in the last 12 months divided by revenue in the last active month. 13 null values. 12 unique values.
avg_purchases_lag12: Monthly average of transactions in the last 12 months divided by transactions in the last active month. 3 null values.
active_months_lag12: Quantity of active month within the last 12 months.
category_4: anonymized category. 2 Unique values
city_id: city identifier(anonymized)
state_id: state identifier(anonymized)
category_2: anonymized category. 11887 null values. 5 Unique values

Fill category_2 null values with new category ‘6.0’

Fill numeric columns avg_sales_lag3, avg_sales_lag6, avg_sales_lag12, avg_purchases_lag3, avg_purchases_lag6, avg_purchases_lag12 with median value.

[5] merchants plot for categorical and numerical values

Observations:

From the above fig[5]

Active_months_lag is most in the last month of each Active_months_lag column.
avg_sales_lag3, avg_sales_lag6, avg_sales_lag12 by looking at these parameters, we can say that around 50% of merchants sales is constant(50% value appro 1.00)
numerical_1 and numerical_2 have 90% of values less than 0.
By looking at most_recent_purchase_range and most_recent_sales_range we can say sales are going in increasing order.
There are some extreme points in numeric columns, so we will take the 99 percentile of each numeric column.
It looks that numerical_1 and numerical_2 are highly correlated with each other.

[6] correlation between categorical features for merchants data

Observations:

For categorical features, category_1 is correlated with active_months. category_1, category_2, category_4, active_months_lag are correlated with each other.
For numerical features, we have used the Spearman rank correlation coefficient. numerical_1 and numerical_2 are correlated with each other. avg_sales_lag3, avg_sales_lag6, avg_sales_lag12, avg_purchase_lag3, avg_purchase_lag6, avg_purchase_lag12 are highly correlated with each other. subsector_id is not correlated with any feature so we can remove this feature.

4.3 New Merchants Transactions Dataset:

Contains the transactions at new merchants(merchant_ids that this particular card_id has not yet visited) over a period of two months.

New Merchants Transactions dataset has a total of 14 columns

card_id: Card identifier
month_lag: month lag to reference date
purchase_date: purchase date
authorized_flag: ‘Y’ if approved, ’N’ if denied.
category_3: anonymized category, has 55922 null values
installments: number of installments of purchase.
category_1: anonymized category
merchant_category_id: merchant category identifier(anonymized)
subsector_id: Merchant category group identifier(anonymized)
merchant_id: Merchant identifier(anonymized), has 26216 null values
purchase_amount: Normalized purchase amount
city_id: City identifier(anonymized)
state_id: State identifier(anonymized)
category_2: anonymized category, has 111745 null values

Fill category columns category_2 with new category ‘6.0’ , category_3 with ‘D’ and merchant_id with ‘M_ID_00a6ca8a8a’.

[7] new merchants transactions plot for categorical and numerical values

Observations:

From the above fig[7]

All autherized_flags are Y only. This means all transactions in new merchants are authorized so we can delete this row from the final data.
Around 96.6(1829695/1888122) percent of transactions belong to the ‘N’ type in category_1 and 54(1021814/1888122) percent belongs to the ‘1’ type in category_2.
92.7((918880+832601)/1888122) percent of transactions have installments 0 or 1.
Month_lag is almost equally distributed for all transactions.

[8]Relationship between purchase date, first active month, and target for new transactions

Observations:

From the above fig[8], In the first plot(first_active_months vs target), the loyalty score confidence interval is very less in 2017–2018 as compared to previous years. It means lots of cards got activated in 2017–2018. And in the second graph, we can see the number of purchases increased in 2017–2018. So we can conclude that loyalty score is related to the number of purchases customers made.

[9] Correlation test for new merchants transaction data

Observations:

From above fig[9]

For categorical features, month_lag is highly correlated with category_1 and authorized_flag and category_1 is correlated with month_lag.
For numerical features, both purchase_amount and installments are very less correlated with each other

4.4 Historical Transactions Dataset:

Contains up to 3 months of transactions for every card at any of the provided merchant_id.

The historical Transactions dataset has a total of 14 columns.

card_id: Card identifier
month_lag: month lag to reference date
purchase_date: purchase date
authorized_flag: ‘Y’ if approved, ’N’ if denied.
category_3: anonymized category, has 178159 null values
installments: number of installments of purchase.
category_1: anonymized category
merchant_category_id: merchant category identifier(anonymized)
subsector_id: Merchant category group identifier(anonymized)
merchant_id: Merchant identifier(anonymized), has 138481 null values
purchase_amount: Normalized purchase amount
city_id: City identifier(anonymized)
state_id: State identifier(anonymized)
category_2: anonymized category, has 2652864 null values

Fill category columns category_2 with new category ‘6.0’ , category_3 with ‘D’ and merchant_id with ‘M_ID_00a6ca8a8a’.

[10] Historical transactions plot for categorical and numerical values

Observations:

From above fig[10],

There are some authorized and unauthorized transactions. but there is 91.55%(26235810 / 28657333) are authorized.
Around 92.97(26641630 / 28657333) percent of transactions belong to ‘N’ type in category_1 and 52.3 (14986115 / 28657333) percent belongs to ‘1’ type in category_2.
94.12((15355645+11615799) / 28657333) percent of transactions has installments 0 or 1.

[11] Number of transactions for each card id

Observations:

There are few card_ids who made lots of transactions. It might be possible that the loyalty score is related to the number of transactions and the number of declined transactions.

5. Feature Engineering

Whatever features given in the dataset are not enough for accurate prediction, so we have to perform some Feature Engineering to generate new features from existing features.

5.1 From Train and Test Dataset:

We have 5 given features in the train set and we have generated the below features with the help of the given features.

‘first_active_month’, ‘card_id’, ‘target’, ‘outliers’, ‘quarter’, ’elapsed_time’, ’days_feature1’, ‘days_feature2’, ‘days_feature3’, ‘days_feature1_ratio’, ‘days_feature2_ratio’, ‘days_feature3_ratio’, ‘feature_sum’, ‘feature_mean’, ‘feature_max’, ‘feature_min’

5.2 Merchants Dataset:

We will aggregate all features of merchants based on unique merchants' id.

https://gist.github.com/dhananjaychaudhari26/500411013edca0dead8cf32b7c5eb8bc

5.3 New Transactions Dataset:

We have 14 given features in the new transaction set and we have generated the below features with the help of the given features.

‘hour’, ‘day’, ‘weekday’, ‘weekend’, ‘month’, ‘weekofyear’, ‘month_diff’, ‘duration’, ‘amount_month_ratio’, ‘price’, ‘Christmas_Day_2017’, ‘Mothers_Day_2017’, ‘fathers_day_2017’, ‘Children_day_2017’, ‘Valentine_Day_2017’, ‘Black_Friday_2017’, ‘Mothers_Day_2018’,

First, we will merge aggregated above merchants feature with new transactions list with the help of merchants id.

https://gist.github.com/dhananjaychaudhari26/9f031d0dabe7f3d8224f37433dda3770

After merging merchants data with new transactions, Perform aggregation on a new transaction data frame

https://gist.github.com/dhananjaychaudhari26/3a3b18e613e035e52059edf5200ca83c

5.4 Historical Transactions Dataset:

We have 14 given features in the historical transaction set and we have generated the below features with the help of the given features.

‘hour’, ‘day’, ‘weekday’, ‘weekend’, ‘month’, ‘weekofyear’, ‘month_diff’, ‘duration’, ‘amount_month_ratio’, ‘price’, ‘Christmas_Day_2017’, ‘Mothers_Day_2017’, ‘fathers_day_2017’, ‘Children_day_2017’, ‘Valentine_Day_2017’, ‘Black_Friday_2017’, ‘Mothers_Day_2018’,

First, we will merge the above aggregated merchants feature with the new transactions list with the help of merchants id.

https://gist.github.com/dhananjaychaudhari26/53ef38049919248f6f65923a9ab4b33a

After merging merchants' data with historical transactions, perform aggregation on a historical transaction data frame.

https://gist.github.com/dhananjaychaudhari26/5d342d4b19f1d78c2c0013e018815039

Now merge new_trans_df and hist_trans_df with train dataset with the help of card_id. And add some more features into a train set.

https://gist.github.com/dhananjaychaudhari26/ad53be2ac50cf8d23e5d6dd2838777f4

6. The existing approach to the problem

Link: https://www.kaggle.com/brendanhasz/elo-modeling

In this solution, they are calculating entropy, mean_diff, and period for each column as an aggregation.

Entropy: A feature corresponding to entropy could be informative — for example, it could be that card accounts with high entropy over the merchants they use their card with are more likely to be more loyal card users than those who only use their card with a single merchant (and therefore have low entropy).

mean_diff: Computes the mean difference between consecutive items in a series. For example, given a column with the purchase date, this function would compute the meantime between purchases. This could conceivably be a good predictor of how likely an individual is to be a loyal card user: individuals who use their cards regularly and frequently are probably more likely to be loyal.

period: It is the difference between the minimum and maximum value. For example, customers who have been making purchases over a long period of time (the difference between the date of their first and last purchases is large) may be more likely to be loyal card users.

It also uses Mutual information for feature selection, the mutual information represents the amount of information that can be gained about one variable by knowing the value of some other variable.

I have tried with entry,mean_diff, and period in aggregation function but it doesn't improve the RMSE score significantly and it is time-consuming also. It took me around 5–6hrs for calculating aggregation.

7. First cut approach to solving the problem

Here we will try to follow the data science lifecycle for our problem.

7.1 EDA:

We have to analyze all tables and their columns. Card_id and merchant_id are alphanumeric and it seems like hexadecimal form, so we have to convert them into numeric form. Perform standard preprocessing(like one-hot encoding, label encoding) for categorical data and normalization/standardization for numeric data. We also realize there are some outliers in the data set (like in merchants table avg_sales_lag3 has some negative value and we know that average sales can not be less than 0). So we have to find all outliers in all data files and remove them to make our model robust. Handle missing values using the standard imputation technique.

7.2 Feature Engineering/Selection:

As we know,train.csv has only 5 variables and these are not enough for model prediction so we have to combine features from other tables also. Even sometimes we have to create our own features based on existing features(feature engineering). We have also performed aggregation on merchants, new transactions, and historical transactions to get maximum output.

7.3. Training model:

RMSE has no predefined range and it can range from 0 to infinite. Hence we will create any simple model as a baseline model to define the boundary line for RMSE and train complex models to get RMSE less than the baseline model. We will start with LightGBM which is fast and uses tree based training algorithm We can also use ensemble models where we will train multiple complex models and combine them together to get the best performance.

For hyperparameter tuning, we have to use RandomSearch and GridSearch.

8. Comparison of the models in tabular format

I have tried with 5 different models and each for Kfold and StratifiedKFold.

LGBM with Kfold works well for us and gives the lowest RSME score.

9. Kaggle submission

Kaggle submission private and public score

10. The final pipeline of the problem

First Import all data files and perform preprocessing. We don't have text data so we don't need to perform text data preprocessing. We have some category features so we will perform preprocessing for category features. Like format ‘Y’ and ’N’ to 1 and 0 values and ‘A’, ‘B’, ‘C’, ‘D’ to 1,2,3,4.
Perform aggregation on merchant data and merge with a new transaction data frame and hist transaction data frame. And perform feature engineering and creates new features on merged data frames.
Perform aggregation on new and historical data frames and again add new features from existing features. Now total we have more than 200 features for each data point(for each card_id)
We have used LightGBM and Kfold validation with the split size is 11 which gives a minimum RMSE value.

11. Productionization and Demo

I have used Streamlit and Heroku platform for model deployment.

Demo: elo-merchant.herokuapp.com. On selecting card_id it will predict loyalty score.

12. Future Work

In the first prize solution, they have mentioned linear stacking so we can try that to improve our solution.
We have to generate some more features from existing features to get maximum performance.
We can also try feature selection using standard feature selection methods Forward feature selection, Backword feature selection, Recursive Feature Elimination, and Mutual information.

13. References/Credits

https://www.mikulskibartosz.name/how-to-reduce-memory-usage-in-pandas/: Talks about how to reduce memory size for data frame
https://www.kaggle.com/mfjwr1/simple-lightgbm-without-blending
https://www.kaggle.com/batalov/making-sense-of-elo-data-eda:
https://www.kaggle.com/peacemaket/outlier-prediction-eda-fe-model-etc/ talks about outlier detection
https://www.kaggle.com/brendanhasz/elo-feature-engineering-and-feature-selection
https://www.appliedaicourse.com/

Github Repo:

Github: https://github.com/dhananjaychaudhari26/Elo_Merchant

LinkedIn: https://www.linkedin.com/in/dhananjay-chaudhari-a69270102/

Elo Merchant Category Recommendation: Kaggle competition — A Case Study

Table Of Content:

1. Business problem and Objective

1.1 What is ELO:

1.2 Problem Statement:

1.3 Objectives:

2. Source of Data

2.1 Overview:

3. Mapping the real-world problem to an ML problem:

4. Exploratory Data Analysis (EDA)

4.2 Merchants Dataset:

4.3 New Merchants Transactions Dataset:

4.4 Historical Transactions Dataset:

5. Feature Engineering

6. The existing approach to the problem

7. First cut approach to solving the problem

8. Comparison of the models in tabular format

9. Kaggle submission

10. The final pipeline of the problem

11. Productionization and Demo

12. Future Work

13. References/Credits

Github Repo:

Written by Dhananjay Chaudhari