机器学习-文本分类(2)-新闻文本分类
参考:https://mp.weixin.qq.com/s/6vkz18Xw4USZ3fldd_wf5g
1、数据集下载地址
https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/train_set.csv.zip
https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a.csv.zip
数据集来自天池比赛,训练集20w条样本,测试集A包括5w条样本。而且文本按照字符级别进行了匿名处理,处理后的数据为下:
这里就直接拆分训练集为训练集和测试集了。
在数据集中标签的对应的关系如下:
{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}
评价指标:
2、导入相应包
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
3、读取数据
train_path="/content/drive/My Drive/nlpdata/news/train_set.csv"
train_df = pd.read_csv(train_path, sep='t', nrows=15000)
train_df['text']
train_df['label']
4、进行文本分类
(1)n-gram+岭分类
vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
0.65441877581244
(2)TF-IDF+岭分类
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
train_test = tfidf.fit_transform(train_df['text'])
clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
0.8719372173702
5、探究参数对模型的影响
取大小为5000的样本,保持其他参数不变,令阿尔法从0.15增加至1.5,画出F1关于阿尔法的图像
(1)针对于岭分类而言:阿尔法对模型的影响
sample = train_df[0:5000]
n = int(2*len(sample)/3)
tfidf = TfidfVectorizer(ngram_range=(2,3), max_features=2500)
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
f1 = []
for i in range(10):
clf = RidgeClassifier(alpha = 0.15*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
plt.plot([0.15*(i+1) for i in range(10)], f1)
plt.xlabel('alpha')
plt.ylabel('f1_score')
plt.show()
可以看出阿尔法不宜取的过大,也不宜过小。越小模型的拟合能力越强,泛化能力越弱,越大模型的拟合能力越差,泛化能力越强。
(2)max_features对模型的影响
分别取max_features的值为1000、2000、3000、4000,研究max_features对模型精度的影响
f1 = []
features = [1000,2000,3000,4000]
for i in range(4):
tfidf = TfidfVectorizer(ngram_range=(2,3), max_features=features[i])
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
plt.plot(features, f1)
plt.xlabel('max_features')
plt.ylabel('f1_score')
plt.show()
可以看出max_features越大模型的精度越高,但是当max_features超过某个数之后,再增加max_features的值对模型精度的影响就不是很显著了。
(3) ngram_range对模型的影响
n-gram提取词语字符数的下边界和上边界,考虑到中文的用词习惯,ngram_range可以在(1,4)之间选取
f1 = []
for i in range(4):
tfidf = TfidfVectorizer(ngram_range=(1,1), max_features=2000)
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
tfidf = TfidfVectorizer(ngram_range=(2,2), max_features=2000)
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
tfidf = TfidfVectorizer(ngram_range=(3,3), max_features=2000)
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=2000)
train_test = tfidf.fit_transform(sample['text'])
train_x = train_test[:n]
train_y = sample['label'].values[:n]
test_x = train_test[n:]
test_y = sample['label'].values[n:]
clf = RidgeClassifier(alpha = 0.1*(i+1), solver = 'sag')
clf.fit(train_x, train_y)
val_pred = clf.predict(test_x)
f1.append(f1_score(test_y, val_pred, average='macro'))
[0.7931919639413474, 0.7831242477075827, 0.6293265527038611, 0.8436709720083034, 0.8127288721306228, 0.791639726421815, 0.6425340629702662, 0.8512559206701422, 0.82151852494927, 0.7978544191527702, 0.6500441251723578, 0.8516726763849712, 0. 8275245575862662, 0.7963717190315031, 0.6577157272412916, 0.8485051384495732]
6、其它分类模型
均使用TF-IDF作为预处理方式。
(1)逻辑回归
from sklearn import linear_model
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000)
train_test = tfidf.fit_transform(train_df['text']) # 词向量 15000*max_features
reg = linear_model.LogisticRegression(penalty='l2', C=1.0,solver='liblinear')
reg.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = reg.predict(train_test[10000:])
print('预测结果中各类新闻数目')
print(pd.Series(val_pred).value_counts())
print('n F1 score为')
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
预测结果中各类新闻数 0 1032 1 1029 2 782 3 588 4 375 5 316 6 224 8 166 7 161 9 123 10 109 11 60 12 23 13 12 dtype: int64
F1 score为 0.8464704900433653
(2)SGDClassifier
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000)
train_test = tfidf.fit_transform(train_df['text']) # 词向量 15000*max_features
reg = linear_model.SGDClassifier(loss="log", penalty='l2', alpha=0.0001,l1_ratio=0.15)
reg.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = reg.predict(train_test[10000:])
print('预测结果中各类新闻数目')
print(pd.Series(val_pred).value_counts())
print('n F1 score为')
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
(3)SVM
from sklearn import svm
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=5000)
train_test = tfidf.fit_transform(train_df['text']) # 词向量 15000*max_features
reg = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto',decision_function_shape='ovr')
reg.fit(train_test[:10000], train_df['label'].values[:10000])
val_pred = reg.predict(train_test[10000:])
print('预测结果中各类新闻数目')
print(pd.Series(val_pred).value_counts())
print('n F1 score为')
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
- MYSQL之库操作
- 实战-如何获取安卓iOS上的微信聊天记录、通过Metasploit控制安卓
- lightswitch binding custom control
- 3339: Rmq Problem
- Codeforce GYM 100741 A. Queries
- UVA - 11178 Morley's Theorem
- PyMySQL模块的使用
- Python之进程
- Angularjs 通过asp.net web api认证登录
- P3391 【模板】文艺平衡树(Splay)
- 零基础入门小程序 &实战经验分享
- mysql explain详解
- 【前沿】Pytorch开源VQA神经网络模块,让你快速完成看图问答
- #106. 二逼平衡树(附带详细代码注释)
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- thinkphp5实现无限级分类
- Python numpy矩阵处理运算工具用法汇总
- Django后端分离 使用element-ui文件上传方式
- PHP fprintf()函数用法讲解
- django template实现定义临时变量,自定义赋值、自增实例
- PHP创建文件及写入数据(覆盖写入,追加写入)的方法详解
- PHP写API输出的时用echo的原因详解
- thinkphp5使用无限极分类
- 手写dubbo框架7-SPI(dubbo和jdk的区别)
- Thinkphp5+plupload实现的图片上传功能示例【支持实时预览】
- YII框架学习笔记之命名空间、操作响应与视图操作示例
- python实现批量命名照片
- 手写dubbo框架8-SPI 自适应扩展机制
- thinkphp5框架扩展redis类方法示例
- 详解php中生成标准uuid(guid)的方法