Transaction Prediction of Customers

4 minute read

Introduction.

In this project, A company wants to help them identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this project has the same structure as the real data they have available to solve this problem. The data is anonimyzed, each row containing 200 numerical values identified just with a number. In the following we will explore the data, prepare it for a model, train a model and predict the target value for the test set, then prepare a submission.

Prepare for Data Analysis.

Load Packages and Load data

description

Data Exploration

Check the Data

Let’s Check the test and train data.

description

Both train and test data have 200,000 entries and 202, respectivelly 201 columns.

Let’s glimpse train and test dataset.

description

Basic Details description

description

Train contains:

  • ID_code (string);
  • target;
  • 200 numerical variables, named from var_0 to var_199;

Test contains:

  • ID_code (string);
  • 200 numerical variables, named from var_0 to var_199;
  • Let’s check if there are any missing data. We will also check the type of data.

We check first train.

description

Here we check test dataset.

description

There are no missing data in train and test datasets. Let’s check the numerical values in train and test dataset.

Describe the Train Data

description

We can make few observations here:

  • standard deviation is relatively large for both train and test variable data;
  • min, max, mean, sdt values for train and test data looks quite close;
  • mean values are distributed over a large range.

The number of values in train and test set is the same. Let’s plot the scatter plot for train and test set for few of the features.

We will show just 5% of the data. On x axis we show train values and on the y axis we show the test values.

description

Let’s check the distribution of target value in train dataset.

description

There are 10.049% target values with 1 The data is unbalanced with respect with target value.

Density plots of features.

Let’s show now the density plot of variables in train dataset.

We represent with different colors the distribution for values with target value 0 and 1.

t0 = train_df.loc[train_df['target'] == 0]
t1 = train_df.loc[train_df['target'] == 1]
features = train_df.columns.values[2:102]
plot_feature_distribution(t0, t1, '0', '1', features)

We can observe that there is a considerable number of features with significant different distribution for the two target values. For example, var_0, var_1, var_2, var_5, var_9, var_13, var_106, var_109, var_139 and many others.

Also some features, like var_2, var_13, var_26, var_55, var_175, var_184, var_196 shows a distribution that resambles to a bivariate distribution.

We will take this into consideration in the future for the selection of the features for our prediction model.

Le’t s now look to the distribution of the same features in parallel in train and test datasets.

Distribution of mean and std

Let’s check the distribution of the mean values per row in the train and test set.

description

Distribution of min and max

Let’s check the distribution of min per row in the train and test set.

description

We show here the distribution of min values per columns in train set.

plt.figure(figsize=(16,6))
plt.title("Distribution of min values per column in the train set")
sns.distplot(t0[features].min(axis=0),color="red", kde=True,bins=120, label='target = 0')
sns.distplot(t1[features].min(axis=0),color="blue", kde=True,bins=120, label='target = 1')
plt.legend(); plt.show()

description

Distribution of skew and kurtosis

Let’s see now what is the distribution of skew values per rows and columns.

Let’s see first the distribution of skewness calculated per rows in train and test sets.

description

Features correlation

We calculate now the correlations between the features in train set. The following table shows the first 10 the least correlated features.

correlations = train_df[features].corr().abs().unstack().sort_values(kind="quicksort").reset_index()
correlations = correlations[correlations['level_0'] != correlations['level_1']]
correlations.head(10)

Let’s look to the top most correlated features, besides the same feature pairs. description

Let’s see also the least correlated features. description

Feature engineering

Let’s calculate for starting few aggregated values for the existing features.

idx = features = train_df.columns.values[2:202]
for df in [test_df, train_df]:
    df['sum'] = df[idx].sum(axis=1)  
    df['min'] = df[idx].min(axis=1)
    df['max'] = df[idx].max(axis=1)
    df['mean'] = df[idx].mean(axis=1)
    df['std'] = df[idx].std(axis=1)
    df['skew'] = df[idx].skew(axis=1)
    df['kurt'] = df[idx].kurtosis(axis=1)
    df['med'] = df[idx].median(axis=1)

description

Let’s check the distribution of these new, engineered features.

We plot first the distribution of new features, grouped by value of corresponding target values.

t0 = train_df.loc[train_df['target'] == 0]
t1 = train_df.loc[train_df['target'] == 1]
features = train_df.columns.values[202:]
plot_new_feature_distribution(t0, t1, 'target: 0', 'target: 1', features)

description

Model

From the train columns list, we drop the ID and target to form the features list.

features = [c for c in train_df.columns if c not in ['ID_code', 'target']]
target = train_df['target']

We define the hyperparameters for the model.

param = {
    'bagging_freq': 5,
    'bagging_fraction': 0.4,
    'boost_from_average':'false',
    'boost': 'gbdt',
    'feature_fraction': 0.05,
    'learning_rate': 0.01,
    'max_depth': -1,  
    'metric':'auc',
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'num_leaves': 13,
    'num_threads': 8,
    'tree_learner': 'serial',
    'objective': 'binary',
    'verbosity': 1
}

We run the model.

folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df.values, target.values)):
    print("Fold {}".format(fold_))
    trn_data = lgb.Dataset(train_df.iloc[trn_idx][features], label=target.iloc[trn_idx])
    val_data = lgb.Dataset(train_df.iloc[val_idx][features], label=target.iloc[val_idx])

    num_round = 1000000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
    oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))

description

Let’s check the feature importance.

cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:150].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,28))
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('Features importance (averaged/folds)')
plt.tight_layout()
plt.savefig('FI.png')

description

Code

Thank you for reading