Xgboost实践+天池比赛O2O优惠券auc接近天池第一名0.81(auc0.80,支持CPU、GPU源代码下载链接)

0. 前言

1. auc0.53

在天池新人赛报名之后,就先到技术圈去学习了下,看到一个100行代码入门天池O2O优惠券使用新人赛【精简教程版】,就拿来练了下手,运行很顺利,代码也相对比较简单,部分代码如下所示:

  1. 【使用的库】

    import os, sys, pickle import numpy as np import pandas as pd import matplotlib.pyplot as plt from datetime import date from sklearn.linear_model import SGDClassifier, LogisticRegression dfoff = pd.read_csv('datalab/1990/data/ccf_offline_stage1_train.csv') dftest = pd.read_csv('datalab/1990/data/ccf_offline_stage1_test_revised.csv') dfon = pd.read_csv('datalab/1990/data/ccf_online_stage1_train.csv') print('data read end.') 
  2. 【使用的模型】

    # feature original_feature = ['discount_rate','discount_type','discount_man', 'discount_jian','distance', 'weekday', 'weekday_type'] + weekdaycols print("----train-----") model = SGDClassifier(#lambda: loss='log', penalty='elasticnet', fit_intercept=True, max_iter=100, shuffle=True, alpha = 0.01, l1_ratio = 0.01, n_jobs=1, class_weight=None ) model.fit(train[original_feature], train['label']) 
  3. 【使用的feature】

    # feature original_feature = ['discount_rate','discount_type','discount_man', 'discount_jian','distance', 'weekday', 'weekday_type'] + weekdaycols 
  4. 【提交后结果为0.53,如下图所示:】
    在这里插入图片描述

  5. 【结果分析:】
    分数不高的原因是使用的feature很简单,几乎未进行特征工程的处理。

2. auc0.78

  1. 【使用的库】

    import pandas as pd import numpy as np import pickle import xgboost as xgb from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve from sklearn.model_selection import train_test_split 
  2. 【使用的模型】

    params = {'booster': 'gbtree', 'objective': 'rank:pairwise', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'exact', 'seed': 0, 'nthread': 12 } watchlist = [(dataTrain, 'train')] model = xgb.train(params, dataTrain, num_boost_round=3500, evals=watchlist) model.save_model('train_dir_3/xgbmodel') model = xgb.Booster(params) model.load_model('train_dir_3/xgbmodel') # predict test set dataset3_preds1 = dataset3_preds.copy() dataset3_preds1['label'] = model.predict(dataTest) 
  3. 【使用的feature】

    def DataProcess(dataset, feature, TrainFlag): other_feature = GetOtherFeature(dataset) merchant = GetMerchantRelatedFeature(feature) user = GetUserRelatedFeature(feature) user_merchant = GetUserAndMerchantRelatedFeature(feature) coupon = GetCouponRelatedFeature(dataset, feature) dataset = pd.merge(coupon, merchant, on='merchant_id', how='left') dataset = pd.merge(dataset, user, on='user_id', how='left') dataset = pd.merge(dataset, user_merchant, on=['user_id', 'merchant_id'], how='left') dataset = pd.merge(dataset, other_feature, on=['user_id', 'coupon_id', 'date_received'], how='left') dataset.drop_duplicates(inplace=True) dataset.user_merchant_buy_total = dataset.user_merchant_buy_total.replace(np.nan, 0) dataset.user_merchant_any = dataset.user_merchant_any.replace(np.nan, 0) dataset.user_merchant_received = dataset.user_merchant_received.replace(np.nan, 0) dataset['is_weekend'] = dataset.day_of_week.apply(lambda x: 1 if x in (6, 7) else 0) weekday_dummies = pd.get_dummies(dataset.day_of_week) weekday_dummies.columns = ['weekday' + str(i + 1) for i in range(weekday_dummies.shape[1])] dataset = pd.concat([dataset, weekday_dummies], axis=1) if TrainFlag: dataset['date'] = dataset['date'].fillna('null'); dataset['label'] = dataset.date.astype('str') + ':' + dataset.date_received.astype('str') dataset.label = dataset.label.apply(get_label) dataset.drop(['merchant_id', 'day_of_week', 'date', 'date_received', 'coupon_count'], axis=1, inplace=True) else: dataset.drop(['merchant_id', 'day_of_week', 'coupon_count'], axis=1, inplace=True) dataset = dataset.replace('null', np.nan) return dataset 
  4. 【提交后结果为0.78,如下图所示:】
    在这里插入图片描述

  5. 【结果分析:】
    分数比第一次高的原因是进行了较充分的特征工程处理,但是因为只使用了offline的数据,而未使用online的数据,所以导致最后的分数仅0.78.

3. auc0.80

  1. 【使用的库】

    import datetime import os import time from concurrent.futures import ProcessPoolExecutor from math import ceil from catboost import CatBoostClassifier from lightgbm import LGBMClassifier from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier from sklearn.externals import joblib from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold from sklearn.metrics import accuracy_score, classification_report, roc_auc_score import matplotlib.pyplot as plt import pandas as pd import numpy as np from xgboost.sklearn import XGBClassifier import xgboost as xgb from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve 
  2. 【使用的模型】

     # 使用优化后的num_boost_round参数训练模型 watchlist = [(train_dmatrix, 'train')] model = xgb.train(params, train_dmatrix, num_boost_round=3500, evals=watchlist) model.save_model('train_dir_2/xgbmodel') params['predictor'] = 'cpu_predictor' model = xgb.Booster(params) model.load_model('train_dir_2/xgbmodel') # predict test set dataset3_predict = predict_dataset.copy() dataset3_predict['label'] = model.predict(predict_dmatrix) 
  3. 【使用的feature】

    def get_features(dataset, feature_off, feature_on): dataset = get_offline_features(dataset, feature_off) return get_online_features(feature_on, dataset) 
  4. 【num_boost_round=3500,CPU,提交后结果为0.78533,如下图所示:】
    在这里插入图片描述
    使用的模型参数为:

    params = {'booster': 'gbtree', 'objective': 'rank:pairwise', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'exact', 'seed': 0, 'nthread': 12 } 
  5. 【num_boost_round=3800,CPU,提交后结果为0.79182,如下图所示:】
    在这里插入图片描述
    使用的模型参数为:

    params = {'booster': 'gbtree', 'objective': 'rank:pairwise', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'exact', 'seed': 0, 'nthread': 12 } 
  6. 【num_boost_round=4500,CPU,提交后结果为0.79509,如下图所示:】
    在这里插入图片描述
    使用的模型参数为:

    params = {'booster': 'gbtree', 'objective': 'rank:pairwise', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'exact', 'seed': 0, 'nthread': 12 } 
  7. 【num_boost_round=6200,GPU,提交后结果为0.80039,如下图所示:】
    在这里插入图片描述
    使用的模型参数为:

    params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'gpu_hist', 'n_gpus': '-1', 'seed': 0, 'nthread': cpu_jobs, 'predictor': 'gpu_predictor' } 
  8. 【num_boost_round=6558(xgbcv优化的参数),GPU,提交后结果为0.80042,如下图所示:】
    在这里插入图片描述
    使用的模型参数为:

    params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 0.1, 'min_child_weight': 1.1, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.01, 'tree_method': 'gpu_hist', 'n_gpus': '-1', 'seed': 0, 'nthread': cpu_jobs, 'predictor': 'gpu_predictor' } 
  9. 【结果分析:】
    在同时使用了offline的feature和online的feature后,auc从0.7878变为0.7853,在提高了迭代次数后,变为0.79;后续更改为GPU模式,同时将更改其它参数为` ‘objective’: ‘binary:logistic’, ‘tree_method’: ‘gpu_hist’,后,通过优化迭代次数,逐步提升到了0.80042.
    (GPU运行遇到问题,请参考博客:https://blog.csdn.net/myourdream2/article/details/86603300

收获与感悟

在不断的实践中,发现特征工程真的很重要,也正印证了“特征工程决定了上限,模型算法及优化只是逼近这个上限”。同时,也熟悉了xgboost的相关参数意义,及迭代次数参数优化方法;此外,通过搭建GPU xgboost运行环境,真切的感受到了GPU运行的飞起的感觉。后续继续研究看是否可以通过优化其它参数、或者模型融合,来进一步提升分数。

原文链接:https://blog.csdn.net/myourdream2/article/details/86618120?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522165277499316780357276664%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=165277499316780357276664&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~times_rank-28-86618120-null-null.nonecase&utm_term=%E4%BC%98%E6%83%A0

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享
评论 抢沙发
头像
文明发言,共建和谐米科社区
提交
头像

昵称

取消
昵称表情图片