Kaggle进阶

Kaggle 进阶

谨以此文纪念自己在 Kaggle 第一次进入前 10%,并总结和分享其中的经验。本文将于26日前更新完成。很高兴你能引用我的代码,如果你可以注明出处链接,便是对我的更大认可~

希望看完这篇博客,像我这样对stack理解模糊的人,能有清晰的收获。

Kaggle 房价预测竞赛161/4557,分数为0.11421

竞赛概述

如官网所述,房价预测是 Kaggle 里最佳的数据挖掘练手竞赛,可以很好地锻炼特征工程和回归技术,而且其讨论区和笔记区有丰富的实践经验供学习,参赛人数充足(迄今有4,929人),不用担心没地方交流。虽然该竞赛的数据泄露了,但仍然有很多优秀的kernel值得学习。

我不推荐新手出于一种猎奇的心理直接提交我或者其他kaggler的submission,哪怕理解之后,复现一下再提交也会更有收获。

Kaggle 房价预测竞赛概述

数据描述

已提供的数据有

  1. train.csv - 训练集,81列数据,包含1列房屋序号,79列房屋标签,1列房屋售价
  2. test.csv - 测试集,80列数据,包含1列房屋序号,79列房屋标签
  3. data_description.txt - 标签描述
  4. sample_submission.csv - 样例输出,2列数据,包含1列房屋序号,1列房屋售价

数据可以下载到本地,也可以在线分析,Kaggle 官网已经对每个标签做了简单的统计。如下图所示,可以方便地观察每个标签的众数、缺省情况等。

Kaggle房价预测竞赛数据统计

数据处理

但是仅从上述的统计情况还不够找出有价值的特征,所以我们需要对数据进一步的可视化分析,处理缺失值、异常值。

离群值

竞赛概述里提到训练集里有一些离群值,我们先试试它会是什么样的。

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

离群值探索

在上图右下角有两个数值很大的离群值,是可以安全删除的。

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

离群值删除

去除这类值总是安全的,因为他们非常离谱(非常大的地区,非常低的价格)。

训练数据中可能还有其他的异常值。但是,如果在测试数据中也有异常值,那么删除所有这些异常值可能会严重影响我们的模型。这就是为什么我不把它们全部删除,而是设法使我的一些模型在它们上变得更健壮。

目标变量

「SalePrice」是需要预测的变量。我们先来分析一下这个变量。

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

目标变量分布情况

目标变量是右偏的。由于线性模型喜欢正态分布的数据,我需要对这个变量进行对数转换,使其更符合正态分布。

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

对数修正后的目标变量分布情况

修正后的数据更趋于正态分布了。

特征工程

缺失值处理

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Missing Ratio
PoolQC 99.691
MiscFeature 96.400
Alley 93.212
Fence 80.425
FireplaceQu 48.680
LotFrontage 16.661
GarageQual 5.451
GarageCond 5.451
GarageFinish 5.451
GarageYrBlt 5.451
GarageType 5.382
BsmtExposure 2.811
BsmtCond 2.811
BsmtQual 2.777
BsmtFinType2 2.743
BsmtFinType1 2.708
MasVnrType 0.823
MasVnrArea 0.788
MSZoning 0.137
BsmtFullBath 0.069

表中的缺失项是有迹可循的,比如上表第一个PoolQC,表示的是游泳池的质量,其值缺失代表的是这个房子本身没有游泳池,因此可以用 「None」来填补。

像「Alley」、「Fence」类似的项都可以用「None」来填补。

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None", inplace=True)

像 TotalBsmtSF 这类表示面积的项,如果缺失就表示该房子本身没有地下室,则缺失值就用0来填补。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

另外,因为与房子相连的每条街道的面积很可能与附近的其他房屋面积相似,所以可以用其附近的房屋面积中位数来填补「LotFrontage」。

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

模型

基础模型

导入相关包

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

定义评价指标,千万不要搞错了,不然后面功夫就白费了。

$$\sqrt{\frac{\sum^{N}_{i=1}{(y_i-\hat {y_i}})^2}{N}}$$

def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse

初步打算使用这13个算法和5折交叉验证来评估效果。

  • LinearRegression
  • Ridge
  • Lasso
  • Random Forrest
  • Gradient Boosting Tree
  • Support Vector Regression
  • Linear Support Vector Regression
  • ElasticNet
  • Stochastic Gradient Descent
  • BayesianRidge
  • KernelRidge
  • ExtraTreesRegressor
  • XgBoost

先看一下各自的效果。

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]
for name, model in zip(names, models):
    score = rmse_cv(model, X_scaled, y_log)
    print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

再进行详细的调参,下面定义一个调参方法。

class grid():
    def __init__(self,model):
        self.model = model

    def grid_get(self,X,y,param_grid):
        grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")
        grid_search.fit(X,y)
        print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))
        grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
        print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])

对模型分别调参。

grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})
{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965

最后选定了这六个效果不错的模型

lasso = Lasso(alpha=0.0005,max_iter=10000)
ridge = Ridge(alpha=60)
svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)
ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)
ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)
bay = BayesianRidge()

模型融合

加权平均

class AverageWeight(BaseEstimator, RegressorMixin):
    def __init__(self,mod,weight):
        self.mod = mod
        self.weight = weight

    def fit(self,X,y):
        self.models_ = [clone(x) for x in self.mod]
        for model in self.models_:
            model.fit(X,y)
        return self

    def predict(self,X):
        w = list()
        pred = np.array([model.predict(X) for model in self.models_])
        # for every data point, single model prediction times weight, then add them together
        for data in range(pred.shape[1]):
            single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]
            w.append(np.sum(single))
        return w

试试效果,0.107,比任意单个模型的效果都要好。

weightavg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])
score = rmsecv(weightavg,Xscaled,y_log)
print(score.mean()) 

使用 Stack 进行模型融合:

  1. 把训练集分成5份
  2. 其中4份做训练集来训练模型,另外1份用作预测,得到1份预测结果
  3. 将第2步的预测集分别与这4份互换,重复第2步,得到4份预测结果
  4. 将第2、3步的预测结果合并作为新的训练集
  5. 将第2、3步训练好的5个模型分别在测试集上预测,再取平均,作为新的测试集
  6. 使用新的训练集训练模型,在新的测试集上预测,得到最终的标签列

所以有两层循环,第一层循环控制基模型的数目,第二层循环控制的是交叉验证的次数。

Stack示意图

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds

    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)

        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred

        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self

    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

最后的结果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge作为第一层的基模型,Kernel Ridge作为第二层的元模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)
stack_model.fit(a,b)
pred = np.exp(stack_model.predict(test_X_scaled))

result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})
result.to_csv("submission.csv",index=False)

结束语

以上就是本次分享的全部内容,希望大家都能对stack和回归有更深的理解。


   转载规则


《Kaggle进阶》 帅张 采用 知识共享署名 4.0 国际许可协议 进行许可。
  目录