天池竞赛O2O优惠券使用预测+XGboost做法,AUC为0.5379

作为初入算法竞赛的计算机大菜鸡,想把自己这次的历程分享下来,用来记录自己在竞赛中的进步.

从2月份开始看天池竞赛代码,到现在将近两个月算是基本入门,知道是怎么个比赛流程,一路上自己摸爬滚打,中间过程挺艰辛的,很多次因为一个bug卡好几天,也没人帮着解决,效率挺低的.

好了,言归正传,分享下自己最近才做的一个O2O竞赛.

从一个在天池新人赛报名之后,就先到技术圈去学习了下,看到一个100行代码入门天池O2O优惠券使用新人赛的baseline,就拿来调试代码,结果运行很顺利,代码也相对比较简单.几乎未进行特征工程处理,采用了SGDClassifier算法,最后AUC是0.5287,排名412/13500,离第一名的0.81差距甚远,

于是乎在此基础上重新进行了代码修改,提取部分特征,采用XGboost算法模型,提交结果,最后AUC是0.5379,排名进了60多名,350/13500.下边就分享这次代码,并详细进行解读.

import os, sys, pickle

import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdates import seaborn as sns from datetime import date from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier, LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve from sklearn.preprocessing import MinMaxScaler import xgboost as xgb   

%matplotlib inline
%config InlineBackend.figure_format = ‘retina’

一:导入数据集
大概查看一下数据集

dfoff = pd.read_csv("..\input\ccf_offline_stage1_train.csv",keep_default_na = False) dftest = pd.read_csv("..\input\ccf_offline_stage1_test_revised.csv",keep_default_na = False) dfon = pd.read_csv("..\input\ccf_online_stage1_train.csv",keep_default_na = False) 
dfoff.head(5)  
dftest.head(5)  

由此可见,训练集比测试集多了一列Date,我们要做的就是通过训练训练集的数据,最后通过测试集来预测用户是否会进行消费

本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。

数据集清洗方法一: 查看是否有缺失值

dfoff.isnull().sum().sort_values(ascending=False).head(10)  

Date 0
Date_received 0
Distance 0
Discount_rate 0
Coupon_id 0
Merchant_id 0
User_id 0
dtype: int64

dfoff.info()  

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id object
Discount_rate object
Distance object
Date_received object
Date object
dtypes: int64(2), object(5)
memory usage: 93.7+ MB

dfoff["Date_received"].unique()  
dfoff["Date"].unique() 
print("有优惠卷,购买商品:" ,dfoff[(dfoff["Date_received"] != "null") & (dfoff["Date"] != "null")].shape[0]) print('有优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] == 'null')].shape[0]) print('无优惠卷,购买商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] != 'null')].shape[0]) print('无优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] == 'null')].shape[0]) 

有优惠卷,购买商品: 75382
有优惠卷,未购商品:977900
无优惠卷,购买商品:701602
无优惠卷,未购商品:0

 print('1. User_id in training set but not in test set', set(dftest['User_id']) - set(dfoff['User_id']))  print('2. Merchant_id in training set but not in test set', set(dftest['Merchant_id']) - set(dfoff['Merchant_id'])) 
  1. User_id in training set but not in test set {2495873, 1286474}
  2. Merchant_id in training set but not in test set {5920}

二:清洗数据集与提取特征
知识点1: unique函数是查看多少种类型

知识点2: 想要提取数据集中某一行的特征就用 dfoff[“中间存放标题”]+.函数 这种形式.例如:下面

特征一:打折率

我们想到第一个特征就是打折率,打折力度越大,用户使用优惠券的概率越大,先用unique函数查看一下有多少种打折类型

print('Discount_rate 类型:',dfoff['Discount_rate'].unique()) 

Discount_rate 类型: [‘null’ ‘150:20’ ‘20:1’ ‘200:20’ ‘30:5’ ‘50:10’ ‘10:5’ ‘100:10’ ‘200:30’
‘20:5’ ‘30:10’ ‘50:5’ ‘150:10’ ‘100:30’ ‘200:50’ ‘100:50’ ‘300:30’
‘50:20’ ‘0.9’ ‘10:1’ ‘30:1’ ‘0.95’ ‘100:5’ ‘5:1’ ‘100:20’ ‘0.8’ ‘50:1’
‘200:10’ ‘300:20’ ‘100:1’ ‘150:30’ ‘300:50’ ‘20:10’ ‘0.85’ ‘0.6’ ‘150:50’
‘0.75’ ‘0.5’ ‘200:5’ ‘0.7’ ‘30:20’ ‘300:10’ ‘0.2’ ‘50:30’ ‘200:100’
‘150:5’]
根据打印结果发现,打折类型一共有3中,

第一种是nan , 表示没有打折

第二种是150:20 表示满150元减少20元

第三种是0.95 表示打折0.95的折扣

我们因此构建4个函数,提取4个特征,分别是:

打折类型:getDiscountType()

打折率:convertRate

满多少:getDiscountMan

减多少:getDiscountJian

 def getDiscountType(row):  if row == 'null': return 'null'  elif ':' in row: return 1  else: return 0  def convertRate(row): """Convert discount to rate""" if row == 'null':  return 1.0 elif ':' in row: rows = row.split(':') return 1.0 - float(rows[1])/float(rows[0])  else: return float(row)  def getDiscountMan(row): if ':' in row: rows = row.split(':') return int(rows[0])  else: return 0  def getDiscountJian(row): if ':' in row: rows = row.split(':') return int(rows[1])  else: return 0  def processData(df):   df['discount_rate'] = df['Discount_rate'].apply(convertRate) df['discount_man'] = df['Discount_rate'].apply(getDiscountMan) df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian) df['discount_type'] = df['Discount_rate'].apply(getDiscountType) print(df['discount_rate'].unique())   df['distance'] = df['Distance'].replace('null', -1).astype(int) return df 
dfoff = processData(dfoff)  dftest = processData(dftest)  

[1. 0.86666667 0.95 0.9 0.83333333 0.8
0.5 0.85 0.75 0.66666667 0.93333333 0.7
0.6 0.96666667 0.98 0.99 0.975 0.33333333
0.2 0.4 ]
[0.83333333 0.9 0.96666667 0.8 0.95 0.75
0.98 0.5 0.86666667 0.6 0.66666667 0.7
0.85 0.33333333 0.94 0.93333333 0.975 0.99 ]

 dfoff.head(2) 

数据集清洗二:对数据集进行类型转换
处理”Distance”距离这一列

dftest.head(2)  
date_received = dfoff['Date_received'].unique()  date_received = sorted(date_received[date_received != 'null']) date_buy = dfoff['Date'].unique() date_buy = sorted(date_buy[date_buy != 'null']) date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date']) print('优惠券收到日期从',date_received[0],'到', date_received[-1]) print('消费日期从', date_buy[0], '到', date_buy[-1]) 

优惠券收到日期从 20160101 到 20160615
消费日期从 20160101 到 20160630

特征二:提取星期特征,消费时间更有可能和星期有关

def getWeekday(row):  if row == 'null':  return row else: return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1  dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday) dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)  dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 ) dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 ) dfoff.head() 

数据集处理方式三:进行one-hot独热编码

 weekdaycols = ['weekday_' + str(i) for i in range(1,8)] print(weekdaycols)  tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan))  tmpdf.columns = weekdaycols  dfoff[weekdaycols] = tmpdf tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan))  tmpdf.columns = weekdaycols dftest[weekdaycols] = tmpdf dfoff.head() 

[‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’]

好了,经过以上简单的特征提取,我们总共得到了 14 个有用的特征: discount_rate

discount_type

discount_man

discount_jian

distanceweek

distance

dayweekday_type

weekday_1

weekday_2

weekday_3

weekday_4

weekday_5

weekday_6

weekday_7

标注标签 Label

有了特征之后,我们还需要对训练样本进行 label 标注,即确定哪些是正样本(y = 1),哪些是负样本(y = 0)。我们要预测的是用户在领取优惠券之后 15 之内的消费情况。所以,总共有三种情况:

1.Date_received == ‘null’:

表示没有领到优惠券,无需考虑,y = -1

2.(Date_received != ‘null’) & (Date != ‘null’) & (Date – Date_received <= 15):

表示领取优惠券且在15天内使用,即正样本,y = 1

3.(Date_received != ‘null’) & ((Date == ‘null’) | (Date – Date_received > 15)):

表示领取优惠券未在在15天内使用,即负样本,y = 0

好了,知道规则之后,我们就可以定义标签备注函数了。

正负样本

def label(row):  if row['Date_received'] == 'null': return -1  if row['Date'] != 'null': td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d') if td <= pd.Timedelta(15, 'D'):  return 1 return 0  dfoff['label'] = dfoff.apply(label, axis = 1) 
dfoff["label"].unique()  

array([-1, 0, 1], dtype=int64)

value_counts函数是统计各个类型的个数

我们可以使用这个函数对训练集进行标注,看一下正负样本究竟有多少:

print(dfoff['label'].value_counts())  

0 988887
-1 701602
1 64395
Name: label, dtype: int64

很清晰地,正样本共有 64395 例,负样本共有 988887 例。显然,正负样本数量差别很大。 这也是为什么会使用 AUC 作为模型性能评估标准的原因。

dfoff.columns.tolist() 函数是用来查看表头

print('已有columns:',dfoff.columns.tolist())  

已有columns: [‘User_id’, ‘Merchant_id’, ‘Coupon_id’, ‘Discount_rate’, ‘Distance’, ‘Date_received’, ‘Date’, ‘discount_rate’, ‘discount_man’, ‘discount_jian’, ‘discount_type’, ‘distance’, ‘weekday’, ‘weekday_type’, ‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’, ‘label’]

dfoff.head(2) 

三:建立模型
接下来就是最主要的建立机器学习模型了。首先确定的是我们选择的特征是上面提取的 14 个特征,为了验证模型的性能,需要划分验证集进行模型验证,划分方式是按照领券日期,即训练集:20160101-20160515,验证集:20160516-20160615。我们采用XGboost算法

xgboost
1.划分训练集和验证集
注意这里得到的结果 pred_prob 是概率值(预测样本属于正类的概率)。

最后,就可以对验证集计算 AUC。直接调用 sklearn 库自带的计算 AUC 函数即可。

 df = dfoff[dfoff['label'] != -1].copy()  train = df[(df['Date_received'] < '20160516')].copy()  valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy()  print(train['label'].value_counts())  print(valid['label'].value_counts())  

0 759172
1 41524
Name: label, dtype: int64
0 229715
1 22871
Name: label, dtype: int64

y = train.label   X = train.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1) val_y = valid.label  val_X = valid.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1)  tests = dftest.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date_received"],axis=1) 
val_X["weekday"].unique(),val_X["discount_type"].unique()  

(array([6, 1, 4, 3, 2, 7, 5], dtype=object), array([1, 0], dtype=object))

  X["weekday"] = X["weekday"].astype(int) X["discount_type"] =X["discount_type"].astype(int) val_X["weekday"]=val_X["weekday"].astype(int) val_X["discount_type"]=val_X["discount_type"].astype(int) 
tests["weekday"].unique() 
array([2, 3, 5, 6, 7, 1, 4], dtype=int64) 

val_X[“weekday”].unique() #查看一下

 array([6, 1, 4, 3, 2, 7, 5], dtype=int64) 
 xgb_val = xgb.DMatrix(val_X,label=val_y)  xgb_train = xgb.DMatrix(X, label=y)  xgb_test = xgb.DMatrix(tests)  xgb_val_X = xgb.DMatrix(val_X)  

C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, ‘base’, None) is not None and
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
data.base is not None and isinstance(data, np.ndarray) \

 def myauc(test): testgroup = test.groupby(["Coupon_id"]) aucs = [] for i in testgroup: tmpdf = i[1] if len(tmpdf['label'].unique()) != 2: continue fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred'], pos_label=1) aucs.append(auc(fpr, tpr)) return np.average(aucs) 

XGboost算法框架

params = {'booster': 'gbtree',  'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'exact', 'seed': 0, 'nthread': 12 } watchlist = [(xgb_train, 'train')] model = xgb.train(params, xgb_train, num_boost_round=1000, evals=watchlist,early_stopping_rounds=100) model.save_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel') model = xgb.Booster(params) model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel') 
val_X.head() 
valid.head() 
model = xgb.Booster() model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')  temp = valid[["Coupon_id", "label"]].copy()  temp['pred'] = model.predict(xgb_val)  temp.pred = MinMaxScaler(copy=True, feature_range=(0, 1)).fit_transform(temp['pred'].values.reshape(-1, 1)) print(myauc(temp))  temp.head() 

0.5518357641394374

tests.head() 
val_X.head() 
 y_test = dftest[['User_id','Coupon_id',"Date_received"]].copy() y_test['label'] = model.predict(xgb_test)   y_test.to_csv("C:/Users/Administrator/o2o.code/notebook/second.csv", index=None, header=None) y_test.head() 

原文链接:https://blog.csdn.net/yuekangwei/article/details/89375563?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522165277499316782390570638%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=165277499316782390570638&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~times_rank-24-89375563-null-null.nonecase&utm_term=%E4%BC%98%E6%83%A0

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享
评论 抢沙发
头像
文明发言,共建和谐米科社区
提交
头像

昵称

取消
昵称表情图片