比赛网址
模型
ftrl + fm + lr + gbdt + rnn + ridge 打到了top 4%, 78/2091,继续加油。
定义评估函数RMSLE
RMSLE的评估函数如下:
注意:该评估函数对欠预测的惩罚大于过预测。
def rmsle(y, y_pred): assert len(y) == len(y_pred) to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)] return (sum(to_sum) * (1.0/len(y))) ** 0.5
加载数据
train = pd.read_table("../input/train.tsv") test = pd.read_table("../input/test.tsv")
处理类别特征
category_split_result = all_df.category_name.str.split("/", expand=True).astype(str) all_df['cat1'] = category_split_result[0] all_df['cat2'] = category_split_result[1] all_df['cat3'] = category_split_result[2] le = LabelEncoder() le.fit(np.hstack([train.category_name, test.category_name])) train.category_name = le.transform(train.category_name) test.category_name = le.transform(test.category_name) le.fit(np.hstack([train.brand_name, test.brand_name])) train.brand_name = le.transform(train.brand_name) test.brand_name = le.transform(test.brand_name) del le
文本转序列
from keras.preprocessing.text import Tokenizer raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()]) print(" Fitting tokenizer...") tok_raw = Tokenizer() tok_raw.fit_on_texts(raw_text) print(" Transforming text to seq...") train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower()) test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower()) train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower()) test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())
数据集分割
dtrain, dvalid = train_test_split(train, random_state=123, train_size=0.99)