Machine Learning笔记 - XGBOOST 教程

背景说明:

  • XGBOOST,屠榜神器
  • 全称:eXtreme Gradient Boosting | 简称:XGB
  • XGB作者:陈天奇(华盛顿大学),my icon❤
  • XGB前身:GBDT(Gradient Boosting Decision Tree),XGB是目前决策树的顶配。
    • 注意!上图得出这个结论时间:2016年3月,两年前,算法发布在2014年,现在是2018年6月,它仍是算法届的superstar🌟!
    • 目前,在所有声名显赫的数据挖掘赛场上(kaggle/天池/…),这个算法无人不知,slay全场。

注:

  1. 适用人群:机器学习(数据挖掘)大赛选手 / (准)人工智能工程师 / 算法效果遇到瓶颈的朋友 / …
  2. 假设:读者理解回归树算法、泰勒公式、梯度下降法和牛顿法,简单说就是GBDT,顺便,Adaboost也可以了解一下。
  3. When learning XGBoost, be calm and be patient.
  4. 因为XGB很屌,所以本文很长,可以慢慢看,或者一次看一部分,it’s ok~

链接🔗:


正文:

[1] 算法原理简述(基于上面陈天奇的PPT):

(1) Review of key concepts of supervised learning | 监督学习的主要元素
  • Y值(label标签)
  • 目标函数(Objective Function)= 损失函数(Loss Function)+ 正则化(Regularization)
  • 损失函数表示模型对训练数据的拟合程度,loss越小,代表模型预测的越准。
  • 正则化项衡量模型的复杂度,regularization越小,代表模型模型的复杂度越低。
  • 目标函数越小,代表模型越好。
(2) Regression Tree and Ensemble | 当你谈决策树时你在谈什么

Tree Ensemble methods的好处:

  • Very widely used.Almost half of data mining competition are won by using some variants of tree ensemble methods.
    被大规模的使用,几乎一半的数据挖掘比赛冠军队都在用集合树模型
  • Invariant to scaling of inputs, so you do not need to do careful features normalization.
    与输入数据的取值范围无关,所以无需做很细致的特征归一化
  • Learn higher order interaction between features.
    能够学习到特征间的高维相关性
  • Can be scalable, and are used in Industry.
    工业使用,扩展性好

  • 在这页,模型复杂度(function space)是由所有的回归树决定的。
  • 学习的是fk(树),而不是权重w——体现gradient的思想。

  • 信息增益(Information Gain):决定分裂节点,主要是为了减少损失loss
  • 树的剪枝:主要为了减少模型复杂度,而复杂度被‘树枝的数量’影响
  • 最大深度:会影响模型复杂度
  • 平滑叶子的值:对叶子的权重进行L2正则化,为了减少模型复杂度,提高模型的稳定性
  • 回归树不止用于做回归,还可以做分类、排序等,主要依赖于目标函数的定义
(3) Gradient Boosting (How do we Learn)
  • Bias-variance tradeoff is everywhere
    偏差与方差的权衡无处不在
  • The loss + regularization objective pattern applies for regression tree learning (function learning)
    损失+正则的模式适用于回归树学习
  • We want predictive and simple functions
    预测模型的出路在哪里,结果如下:



使用二阶泰勒展开式来近似Loss:

  • 箭头所指的就是XGB的目标函数表达式,Obj目标函数 = 损失函数 + 正则项 + 常数项,是个优秀的表达式,后面会解释
  • 本篇只是提了些基本的概念,其它slice解读请参阅官方介绍或者陈天奇slide学习笔记或者XGBoost算法原理

[2] 参数说明:

XGB的参数是目前见过的模型里最多的,面试被问到就瞎了,如果你是第一次看所有的参数,请做好心理准备~

下面只列举部分常用参数,所有参数的官方说明文档,请点击XGBoost Parameters

(1) General parameters

(2) Booster parameters
  • Parameters for Tree Booster

  • Additional parameters for Dart Booster
  • Parameters for Linear Booster and Tweedie Regression
    (3) Learning Task parameters


    (4) Command line parameters

[3] 代码实现:R语言版本

(1) 导入数据

data0.RData 下载,仅作XGB流程展示,不做数据清洗
如果对数据清洗感兴趣,请点击基于R的数据清洗(1)
样本数据是RData格式的,是R专有的数据存储格式,好用又不占地方~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 导入包
packages<-c("data.table","xgboost","ggplot2","dplyr")
UsePackages<-function(p){
if (!is.element(p,installed.packages()[,1])){
install.packages(p)}
require(p,character.only = TRUE)}
for(p in packages){
UsePackages(p)
}

library(data.table)
library(xgboost)
library(ggplot2)
library(dplyr)

导入数据:

1
2
setwd("D:/Zhang")       # R文件设置路径
load("data/data0.RData") # 导入数据

拆分训练集和测试集,转换数据格式:

1
2
3
4
5
6
7
8
9
10
11
12
13
#----------------------------------------------------------
# train & test select randomly
#----------------------------------------------------------

a = round(nrow(data0)*0.8)
b = sample(nrow(data0), a, replace = FALSE, prob = NULL)

train= data0[b,] # 训练集80%
test = data0[-b,] # 测试集20%

# 将dataframe格式转换成xgb.DMatrix格式
# Y值的列名: 'bad'
dtrain <- xgb.DMatrix(data=select(train,-bad)%>%as.matrix,label= train$bad%>%as.matrix)

  • 注:Y值的特征名是‘bad’
(2) 利用 xgb.cv 调参
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
best_param = list()
best_seednumber = 1234
best_logloss = Inf
best_logloss_index = 0

# 自定义调参组合
for (iter in 1:50) {
param <- list(objective = "binary:logistic", # 目标函数:logistic的二分类模型,因为Y值是二元的
eval_metric = c("logloss"), # 评估指标:logloss
max_depth = sample(6:10, 1), # 最大深度的调节范围:1个 6-10 区间的数
eta = runif(1, .01, .3), # eta收缩步长调节范围:1个 0.01-0.3区间的数
gamma = runif(1, 0.0, 0.2), # gamma最小损失调节范围:1个 0-0.2区间的数
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround = 50 # 迭代次数:50
cv.nfold = 5 # 5折交叉验证
seed.number = sample.int(10000, 1)[[1]]
set.seed(seed.number)
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, metrics=c("auc","rmse","error"),
nfold=cv.nfold, nrounds=cv.nround, watchlist = list(),
verbose = F, early_stop_round=8, maximize=FALSE)

min_logloss = min(mdcv$evaluation_log[,test_logloss_mean])
min_logloss_index = which.min(mdcv$evaluation_log[,test_logloss_mean])

if (min_logloss < best_logloss) {
best_logloss = min_logloss
best_logloss_index = min_logloss_index
best_seednumber = seed.number
best_param = param
}
}

(nround = best_logloss_index)
set.seed(best_seednumber)
best_seednumber
(best_param) # 显示最佳参数组合,到后面真正的模型要用

得到最佳参数组合:

(3) 绘制 auc | rmse | error 曲线
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#mdcv$evaluation_log

xgb_plot=function(input,output){
history=input
train_history=history[,1:8]%>%mutate(id=row.names(history),class="train")
test_history=history[,9:16]%>%mutate(id=row.names(history),class="test")
colnames(train_history)=c("logloss.mean","logloss.std","auc.mean","auc.std","rmse.mean","rmse.std","error.mean","error.std","id","class")
colnames(test_history)=c("logloss.mean","logloss.std","auc.mean","auc.std","rmse.mean","rmse.std","error.mean","error.std","id","class")

his=rbind(train_history,test_history)
his$id=his$id%>%as.numeric
his$class=his$class%>%factor

if(output=="auc"){
auc=ggplot(data=his,aes(x=id, y=auc.mean,ymin=auc.mean-auc.std,ymax=auc.mean+auc.std,fill=class),linetype=class)+
geom_line()+
geom_ribbon(alpha=0.5)+
labs(x="nround",y=NULL,title = "XGB Cross Validation AUC")+
theme(title=element_text(size=15))+
theme_bw()
return(auc)
}


if(output=="rmse"){
rmse=ggplot(data=his,aes(x=id, y=rmse.mean,ymin=rmse.mean-rmse.std,ymax=rmse.mean+rmse.std,fill=class),linetype=class)+
geom_line()+
geom_ribbon(alpha=0.5)+
labs(x="nround",y=NULL,title = "XGB Cross Validation RMSE")+
theme(title=element_text(size=15))+
theme_bw()
return(rmse)
}

if(output=="error"){
error=ggplot(data=his,aes(x=id,y=error.mean,ymin=error.mean-error.std,ymax=error.mean+error.std,fill=class),linetype=class)+
geom_line()+
geom_ribbon(alpha=0.5)+
labs(x="nround",y=NULL,title = "XGB Cross Validation ERROR")+
theme(title=element_text(size=15))+
theme_bw()
return(error)
}

}
  • auc
    1
    xgb_plot(mdcv$evaluation_log[,-1]%>%data.frame,"auc")


训练集与测试集的表现差距有点大,可能出现过拟合

  • rmse
    1
    xgb_plot(mdcv$evaluation_log[,-1]%>%data.frame,"rmse")


训练集与测试集的表现较统一,但是这个数值还是偏高

  • error
    1
    xgb_plot(mdcv$evaluation_log[,-1]%>%data.frame,"error")


测试集的表现非常不稳定,error值偏高

总的来说模型需要进一步调整,但是作为XGB功能以及流程的展示,本篇不做细致调整,继续下一步!

(4) 建立模型

根据转换后的数据格式dtrain,调参结果的最佳参数组合best_param,最佳迭代次数nround来建模

1
model <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6, watchlist = list())

(5) 绘制Importance排序图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
importanceRaw <- xgb.importance(feature_names=colnames(dtrain), model = model)

xgb.ggplot.importance(importanceRaw) # importance 就是 信息增益

# #--------------------------------------------------------------------------------------
# # feature selection # 这里可以根据importance设置阈值,进行特征筛选,这是特征筛选的方式之一
# cum_impt=data.frame(names=importanceRaw$Feature,impt=cumsum(importanceRaw$Importance))
# cum_impt=filter(cum_impt,cum_impt$impt<0.9)
# selected_feature<-cum_impt$names
#
# train=select(train,selected_feature)
# dtrain<- xgb.DMatrix(data=select(train,-bad)%>%as.matrix,label= train$bad%>%as.matrix)
#
# model <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6, watchlist = list())
# #--------------------------------------------------------------------------------------


上图代表特征的重要性排序,可以设置重要性阈值,进行特征筛选。

(6) 进行预测
1
2
dtest=select(test,-bad)    # 'bad'是Y值
yhat=predict(model,as.matrix(dtest),missing=NA)
(7) 保存模型文件
1
save(model, file = "model/model_xgb.rda")

下次使用时,能直接导入训练好的模型,进行预测。


[4] 代码实现:Python版本

xgb的更新迭代特别快,目前在Windows上的安装就很烧脑,希望佛系安装一下
不提供源数据,感兴趣的朋友可以去找分类的数据试着跑一下

(1) 拆分数据集

任何报错no module的包都请自行pip安装下来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 导入包
import os
os.chdir("C:/Users/Yi/Desktop/abc") # 设置文件路径

import random
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb

from numpy import sort
from xgboost import plot_importance,XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix,mean_squared_error
from ggplot import *
from sklearn.externals import joblib

# split data into X and Y
X = tmp_df # 特征集,数据请自行提供
Y = label_Y # 标签集,数据请自行提供

# split data into train and test sets # 拆分数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=7)

(2) API接口说明

截至 2018/6 ,xgb model 有两个接口,点击接口文件
接口文件值得反复阅读熟悉一下,与参数说明一起食用更佳~

  • XGB Learning API ( import xgboost )
  • Scikit-Learn API ( from xgboost import XGBClassifier )
(3) XGB调参
  • 方法一: 直接调参,调用 xgboost包 的 XGBClassifier()
    可以对其参数进行手动修改,default参数如下

  • 方法二: 随机调参,使用 xgb.cv

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    best_param = list()
    best_seednumber = 123
    best_logloss = np.Inf
    best_logloss_index = 0

    dtrain = xgb.DMatrix(X_train, y_train, feature_names = list(X_train))

    # 自定义调参组合------------------------------------
    for iter in range(50):
    param = {'objective' : "binary:logistic", # 目标函数:logistic的二分类模型,因为Y值是二元的
    'max_depth' : np.random.randint(6,11), # 最大深度的调节范围
    'eta' : np.random.uniform(.01, .3), # eta收缩步长调节范围
    'gamma' : np.random.uniform(0.0, 0.2), # gamma最小损失调节范围
    'subsample' : np.random.uniform(.6, .9),
    'colsample_bytree' : np.random.uniform(.5, .8),
    'min_child_weight' : np.random.randint(1,41),
    'max_delta_step' : np.random.randint(1,11)}

    cv_nround = 50 # 迭代次数:50
    cv_nfold = 5 # 5折交叉验证
    seed_number = np.random.randint(0,100)
    random.seed(seed_number)

    mdcv <- xgb.cv(params = param, dtrain=dtrain, metrics=["auc","rmse","error","logloss"],
    nfold=cv_nfold, num_boost_round=cv_nround, verbose_eval = None,
    early_stopping_rounds=8, maximize=False)

    min_logloss = min(mdcv['test-logloss-mean'])
    min_logloss_index = mdcv.index[mdcv[test-logloss-mean] == min(mdcv[test-logloss-mean])][0]

    if min_logloss < best_logloss:
    best_logloss = min_logloss
    best_logloss_index = min_logloss_index
    best_seednumber = seed_number
    best_param = param


    random.seed(best_seednumber)
    nround = best_logloss_index
    print('best_round = %d, best_seednumber = %d' %(nround,best_seednumber))
    print('best_param : ------------------------------')
    print(best_param) # 显示最佳参数组合,到后面真正的模型要用
  • 方法三:使用 gridsearch 和 cross validation
    参考 Complete Guide to Parameter Tuning in XGBoost

(4) 绘制 train/test 的 auc/rmse/error

定义函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def xgb_plot(input,output):
history=input
train_history=history.iloc[:,8:16].assign(id=[i+1 for i in history.index])
train_history['Class'] = 'train'
test_history=history.iloc[:,0:8].assign(id=[i+1 for i in history.index])
test_history['Class'] = 'test'
train_history.columns = ["auc_mean","auc_std","error_mean","error_std","logloss_mean","logloss_std","rmse_mean","rmse_std","id","Class"]
test_history.columns = ["auc_mean","auc_std","error_mean","error_std","logloss_mean","logloss_std","rmse_mean","rmse_std","id","Class"]

his=pd.concat([train_history,test_history])


if output=="auc":
his['y_min_auc'] = his['auc_mean']-his['auc_std']
his['y_man_auc'] = his['auc_mean']+his['auc_std']

auc=ggplot(his,aes(x='id', y='auc.mean', ymin='y_min_auc', ymax='y_man_auc',fill=Class)+\
geom_line()+\
geom_ribbon(alpha=0.5)+\
labs(x="nround",y='',title = "XGB Cross Validation AUC")
return(auc)



if output=="rmse":
his['y_min_rmse'] = his['rmse_mean']-his['rmse_std']
his['y_man_rmse'] = his['rmse_mean']+his['rmse_std']

rmse=ggplot(his,aes(x='id', y='rmse.mean',ymin='y_min_rmse',ymax='y_man_rmse',fill=Class))+\
geom_line()+\
geom_ribbon(alpha=0.5)+\
labs(x="nround",y='',title = "XGB Cross Validation RMSE")
return(rmse)


if output=="error":
his['y_min_error'] = his['error_mean']-his['error_std']
his['y_man_error'] = his['error_mean']+his['error_std']

error=ggplot(his,aes(x='id',y='error.mean',ymin='y_min_error',ymax='y_man_error',fill=Class))+\
geom_line()+\
geom_ribbon(alpha=0.5)+\
labs(x="nround",y='',title = "XGB Cross Validation ERROR")
return(error)

  • 横坐标是迭代次数,可以观察迭代时是否过拟合
  • train曲线和test曲线的相差程度,可以侧面反映模型复杂度,检验是否过拟合
    1
    xgb_plot(mdcv,'auc')


1
xgb_plot(mdcv,'rmse')


1
xgb_plot(mdcv,'error')

(5) 建模,进行预测,打印评估指标
  • 方法一: 使用 xgboost.train

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    # 利用上面调参结果: best_param

    md_1 = xgb.train(best_param, dtrain, num_boost_round=nround)

    # 预测
    dtest = xgb.DMatrix(X_test, feature_names=list(X_test))
    preds = md_1.predict(dtest)
    print(mean_square_error(y_test, preds))

    predictions = [round(value) for value in preds]
    accuracy = accuracy_score(y_test, predictions)
    f1_score = f1_score(y_test,predictions)
    print("Accuracy: %.2f%%" %(accuracy * 100.0))
    print("F1 Score: %.2f%%" %(f1_score * 100.0))

    # save model
    md_1.save_model('xgb.model')
  • 方法二: 使用 XGBClassifier()

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    # 由于 xgb.train 与 XGBClassifier() 有部分参数的名字稍有出入,具体参考API接口文档
    best_param['learning_rate'] = best_param.pop('eta') # 修改参数字典的某个key名字
    best_param.update({'colsample_bytree': 1}) # 取消列抽样,修改参数字典的某个value

    md_2 = XGBClassifier(**best_param) # 2个*号,允许直接填入字典格式的param
    md_2.fit(X_train, y_train)

    ypred = md_2.predict(X_test)
    predictions = [round(value) for value in ypred]

    # 打印评估指标
    MSE = mean_squared_error(y_test, predictions)
    print("MSE: %.2f%%" % (MSE * 100.0))
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    f1_score = f1_score(y_test, predictions)
    print("F1 Score: %.2f%%" % (f1_score * 100.0))
(6) 绘制Importance排序图
1
2
3
4
ax = xgb.plot_importance(md_2, height=0.5)
fig = ax.figure
fig.set_size_inches(25,20) # 可调节图片尺寸和紧密程度
plt.show()

(7) 根据Importance进行特征筛选
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# sorted(list(selection_model.booster().get_score(importance_type='weight').values()),reverse = True)

importance_plot = pd.DataFrame({'feature':list(X_train.columns),'importance':md_2.feature_importances_})
importance_plot = importance_plot.sort_values(by='importance')
importance_plot = importance_plot.reset.index(drop=True)
thresholds = importance_plot.importance
thresholds_valid = np.unique(thresholds[thresholds != 0])


for thresh in thresholds_valid:

# select features using threshold
selection = SelectFromModel(md_2, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier(**best_param)
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.4f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))


thresh = 0.034
selected_features = list(importance_plot[importance_plot.importance > thresh]['feature'])
print('selected features are :\n %s'%selected_features)
select_X_train = X_train[selected_features] # 筛选Importance符合阈值的特征集

n_features = selected_X_train.shape[1]
print('total: %d features are selected' %n_features)

selection_model = XGBClassifier(**best_param)
selection_model.fit(select_X_train, y_train)

select_X_test = X_test[selected_features]
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
f1_score = f1_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("F1 Score: %.2f%%" % (f1_score * 100.0))

至于是先调参,再做变量筛选,还是先筛选后调参,或是反复调参反复筛选,纯凭个人喜号。

(8) 绘制决策树
  • 先下载graphviz 的 graphviz-2.38.zip,我的是windows,其他系统请自由选择
  • 配置环境变量
    1
    2
    3
    # graphviz文件存在的路径配置
    os.environ["PATH"] += os.pathsep + 'C:/Users/Yi/Anaconda3/envs/release/bin/' # 在引号''这里替换你的dot.exe路径
    xgb.to_graphviz(md_2, num_trees=0, rankdir='LR') # num_trees的值是第几棵树,0为第一棵,rankdir是树的方向,default是从上到下

(9) 保存模型文件,导入模型文件
1
2
3
4
5
# save model
joblib.dump(selection_model,'xgb.model')

# load model
loaded_model = joblib.load('xgb.model')

[5] XGB的优点

敲桌子!重点考点!
同意义问题:XGB 与 GBDT的区别

  1. 损失函数:GBDT是一阶,XGB是二阶泰勒展开
  2. XGB的损失函数可以自定义,具体参考 objective 这个参数
  3. XGB的目标函数进行了优化,有正则项,减少过拟合,控制模型复杂度
  4. 预剪枝:预防过拟合
    • GBDT:分裂到负损失,分裂停止
    • XGB:一直分裂到指定的最大深度(max_depth),然后回过头剪枝。如某个点之后不再正值,去除这个分裂。优点是,当一个负损失(-2)后存在一个正损失(+10),(-2+10=8>0)求和为正,保留这个分裂。
  5. XGB有列抽样/column sample,借鉴随机森林,减少过拟合
  6. 缺失值处理:XGB内置缺失值处理规则,用户提供一个和其它样本不同的值,作为一个参数传进去,作为缺失值取值。
    XGB在不同节点遇到缺失值采取不同处理方法,并且学习未来遇到缺失值的情况。
  7. XGB内置交叉检验(CV),允许每轮boosting迭代中用交叉检验,以便获取最优 Boosting_n_round 迭代次数,可利用网格搜索grid search和交叉检验cross validation进行调参。
    GBDT使用网格搜索。
  8. XGB运行速度快:data事先安排好以block形式存储,利于并行计算。在训练前,对数据排序,后面迭代中反复使用block结构。
    关于并行,不是在tree粒度上的并行,并行在特征粒度上,对特征进行Importance计算排序,也是信息增益计算,找到最佳分割点。
  9. 灵活性:XGB可以深度定制每一个子分类器
  10. 易用性:XGB有各种语言封装
  11. 扩展性:XGB提供了分布式训练,支持Hadoop实现
  12. 共同优点:
    • 当数据有噪音的时候,树Tree的算法抗噪能力更强
    • 树容易对缺失值进行处理
    • 树对分类变量Categorical feature更友好

XGB实在太强大 实时在更新,目前的总结只是利用目前的资源 未来会发展成什么样,谁也猜不到。

参考:

would you buy me a coffee☕~
0%