加载语料库及预处理

本文选用的语料库为sklearn自带API的20newsgroups语料库，该语料库包含商业、科技、运动、航空航天等多领域新闻资料，很适合NLP的初学者进行使用。sklearn_20newsgroups给出了非常详细的介绍。
预处理方面，直接调用了NLTK的接口进行小写化、分词、去除停用词、POS筛选及词干化。这里进行哪些操作完全根据实际需要和数据来定，比如我就经常放弃词干化或者放弃POS筛选（原因通常是结果不好==）…以下代码为加载20newsgroups数据及文本预处理部分代码。

# 1.加载数据
# 该语料库包含商业、科技、运动、航空航天等多领域新闻资料
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:2000]  # 截取需要的量，2000条
# print(data_samples)



# 2.文本预处理, 可选项
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


# 每次访问数据需要添加数据至路径当中
def textPrecessing(text):
    # 小写化
    text = text.lower()
    # 去除特殊标点
    for c in string.punctuation:
        text = text.replace(c, ' ')
    # 分词
    wordLst = nltk.word_tokenize(text)
    # 去除停用词
    filtered = [w for w in wordLst if w not in stopwords.words('english')]
    # 仅保留名词或特定POS
    refiltered = nltk.pos_tag(filtered)
    filtered = [w for w, pos in refiltered if pos.startswith('NN')]
    # 词干化
    ps = PorterStemmer()
    filtered = [ps.stem(w) for w in filtered]

    return " ".join(filtered)


# 该区域仅首次运行，进行文本预处理，第二次运行起注释掉
# docList = []
# for desc in data_samples:
#     docLst.append(textPrecessing(desc).encode('utf-8'))
# with open('D:/data/LDA/20newsgroups(2000).txt', 'a') as f:
#     for line in docLst:
#         f.writelines(str(line) +
#                      '\n')

# ==============================================================================
# 从第二次运行起，直接获取预处理过的docLst，前面load数据、预处理均注释掉
docList = []
with open('D:/data/LDA/20newsgroups(2000).txt', 'r') as f:
    for line in f.readlines():
        if line != '':
            docList.append(line.strip())
# ==============================================================================

CountVectorizer统计词频

LDA模型学习时的训练数据并不是一篇篇文本，而是Document-word matrix，它可以是array也可以是稀疏矩阵，维数是n_samples*n_features，其中n_features为词(term)的个数。因此在训练LDA主题模型前，需要先利用CountVectorizer统计词频并保存，代码如下：

# 3.统计词频
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib  # 也可以选择pickle等保存模型，请随意

# 构建词汇统计向量并保存，仅运行首次 API: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=1500,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(docList)
joblib.dump(tf_vectorizer, 'D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
# ==============================================================================
# #得到存储的tf_vectorizer,节省预处理时间
# tf_vectorizer = joblib.load('D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
# tf = tf_vectorizer.fit_transform(docList)
# print(tf)
# ==============================================================================

CountVectorizer的API请自行参考sklearn，文中代码限定term出现次数必须大于2，最终保留前n_features=2500的term作为features。训练得到的tf_vectorizer 利用joblib保存到文件，第二次起可以直接从文件中load进来避免重复计算。该步骤得到的tf矩阵为一个“文章-词语”稀疏矩阵，可以通过tf_vectorizer.get_feature_names()得到每一维feature对应的term。

LDA主题模型训练

终于到了最关键的LDA主题模型训练阶段。虽说此阶段最关键，但如果数据质量高，如果前面的步骤没有偷工减料，这步其实水到渠成；反之，问题可能都会累计到此阶段集中的反映出来。要想训练优秀的主题模型，两个重要的前提就是数据质量和文本预处理。在此特别安利一下用起来舒服的预处理包：中文–>jieba，英文–>spaCy。上文采用nltk实属无奈，因为这台电脑无法成功安装spaCy唉。。
好了不跑题。LDA训练代码如下，其中参数请参考最后面的附录sklearn LDA API 中文解释。

# 4.LDA主题模型训练
# API: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
from sklearn.decomposition import LatentDirichletAllocation

# ===============================================================================
# 训练并存储为模型，初次执行取消注释
lda = LatentDirichletAllocation(n_components=20,  # 文章表示成20维的向量
                                max_iter=200,
                                learning_method='batch',
                                verbose=True)
lda.fit(tf)  # tf即为Document_word Sparse Matrix
joblib.dump(lda, 'D:/saved_model/LDA_sklearn/LDA_sklearn_main.model')
# ===============================================================================
# 加载lda模型，初次请执行上面的训练过程，注释加载模型
# lda = joblib.load('D:/saved_model/LDA_sklearn/LDA_sklearn_main.model')

print(lda.perplexity(tf))  # 收敛效果

（4）结果展示

LDA的训练时间根据max_iter设置的不同以及数据收敛情况的不同而差别很大。测试时max_iter设置为几十次通常很快就会结束，当然如果实际应用的话，建议至少上千次吧。

texts = [
    "In this morning's TechBytes, we look back at the technology that changes the world in the past decade.\"5,4...\" As we counted down to 2000, fears Y2K would crash the world's computers had many questioning if we become too dependent on technology. Most of us had no idea just how hooked we get.Google was just a few years old then, a simple search engine with a loyal following. A few months later, it would explode into the world's largest. Today, it is the most visited site on the web, with over 1 billion searches everyday.\"The iPod, it's cute.\" MP3 players were nothing new when the first iPod was introduced in the fall of 2001, but this player from Apple was different.\"You can download 1,000 of your favourite songs from your Apple computer in less than 10 minutes.\"TV was revolutionized, too. HDTV, huge flat screens but the most life changing development— TiVo and the DVR. Now we can watch shows on our time and rewind to see something we missed. Today, more than 38 million US households have a DVR.\"People for 2001 are gonna wanna take it on the roads to see something like the Blackberry.""From this to this tiny thing?""Well...\" Little devices called Blackberries became Crackberries. Now, the office is always at your fingertips.And the decade brought friends closer together. Friendster and MySpace got it started, but Facebook took it mainstream.\"It's everyone's, like Santa, like life.\"At first, it was all college kids, but soon their parents and even grandparents followed. Today, Facebook is the second most visited site on the web with 350 million users.That was a look at some of the biggest tech stories of the past decade. For the latest tech news, log on to the technology page of abcnews.com. Those are your TechBytes. I'm Winnie Tanare.",
    "Movement is usually the sign of a person healthy, because only people who love sports will be healthy. I am a love sports, so I was born to now only had a disease. Of the many sports I like table tennis best.Table tennis is a sport, it does not hurt our friendship don't like football, in front of the play is a pair of inseparable friends, when the play is the enemy, the enemy after the play. When playing table tennis, as long as you aim at the ball back and go. If the wind was blowing when playing, curving, touch you, you can only on the day scold: \"it doesn't help me also. If is another person with technical won, you can only blame yourself technology is inferior to him. Table tennis is also a not injured movement, not like basketball, in play when it is pulled down, injured, or the first prize. When playing table tennis, even if be hit will not feel pain. I'm enjoying this movement at the same time, also met many table tennis masters, let my friends every day.",
    "While starting out on a business endeavour, following a set of rules is crucial for finding success.Without proper rules a business can go spiralling down and without taking too long at it. Following are golden rules that will ensure your success in business.Map it outMap where you want to head. Plant goals and results all across that mental map and keep checking it off once you start achieving them one by one.Care for your peoplePeople are your biggest asset. They are the ones who will drive your business to the top. Treat them well and they will treat you well, too.Aim for greatness.Build a great company. Build great services or products. Instil a fun culture at your workplace. Inspire innovation. Inspire your people to keep coming with great ideas, because great ideas bring great changes.Be wary.Keep a close eye on the people who you partner with. It doesn’t mean you have to be sceptical of them. But you shouldn’t naively believe everything you hear. Be smart and keep your eyes and ears open all the time.Commit and stick to it.Once you make a decision, commit to it and follow through. Give it your all. If for some reason that decision doesn’t work, retract, go back to the drawing board and pick an alternate route. In business, you will have to make lots of sacrifices. Be prepared for that. It will all be worth it in the end.Be proactive.Be proactive. Just having goals and not doing anything about them will not get you anywhere. If you don’t act, you will not get the results you’re looking for.Perfect timing.Anticipation is the key to succeed in business. You should have the skills to anticipate changes in the market place and, the changing consumer preferences. You have to keep a tab on all this. Never rest on your past laurels and always look to inject newness into your business processes.Not giving up.That’s the difference between those who succeed and those who don’t. As a businessman you should never give up, no matter what the circumstance. Keep on persevering. You will succeed sooner or later. The key is to never quit trying.Follow these rules and you'll find yourself scaling up the ladder of succcess."]
# 文本先预处理，再在词频模型中结构化，然后将结构化的文本list传入LDA主题模型，判断主题分布。
processed_texts = []
for text in texts:
    temp = textPrecessing(text)
    processed_texts.append(temp)
vectorizer_texts = tf_vectorizer.transform(processed_texts)
# print(vectorizer_texts)
print(lda.transform(vectorizer_texts))  # 获得分布矩阵


# 5.结果
def print_top_words(model, feature_names, n_top_words):
    # 打印每个主题下权重较高的term
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    # 打印主题-词语分布矩阵
    print(model.components_)


n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

（Optional）调参过程

可以调整的参数

n_topics: 主题的个数
n_features: feature的个数，即常用词个数
doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α
topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η
learning_method: 即LDA的求解算法，有’batch’和’online’两种选择
其余sklearn提供的参数：根据LDA求解算法的不同，存在一些其它参数可以调节，参见最后的附录：sklearn LDA API 中文解释。

两种可行的调参方案

一、以n_topics为例，按照perplexity的大小选择最佳模型。当然，topic数目的不同势必会导致perplexity计算的不同，因此perplexity仅能作为参考，topic数目还需要根据实际需求主观指定。n_topics调参代码如下：

# 同迭代次数下，维度的个数
from time import time
docList = []
with open('D:/data/LDA/20newsgroups(2000).txt', 'r') as f:
    for line in f.readlines():
        if line != '':
            docList.append(line.strip())
from sklearn.externals import joblib
tf_vectorizer = joblib.load('D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
tf = tf_vectorizer.fit_transform(docList)
from sklearn.decomposition import LatentDirichletAllocation
n_topics = range(20, 35, 5)
perplexityLst = [1.0]*len(n_topics)
#训练LDA并打印训练时间
lda_models = []
for idx, n_topic in enumerate(n_topics):
    lda = LatentDirichletAllocation(n_components=n_topic,
                                    max_iter=20,
                                    learning_method='batch',
                                    evaluate_every=200,
#                                    perp_tol=0.1, #default
#                                    doc_topic_prior=1/n_topic, #default
#                                    topic_word_prior=1/n_topic, #default
                                    verbose=0)
    t0 = time()
    lda.fit(tf)
    perplexityLst[idx] = lda.perplexity(tf)
    lda_models.append(lda)
    print("# of Topic: %d, " % n_topics[idx])
    print("done in %0.3fs, N_iter %d, " % ((time() - t0), lda.n_iter_))
    print("Perplexity Score %0.3f" % perplexityLst[idx])

#打印最佳模型
best_index = perplexityLst.index(min(perplexityLst))
best_n_topic = n_topics[best_index]
best_model = lda_models[best_index]
print("Best # of Topic: ", best_n_topic)

import matplotlib.pyplot as plt
import os
#绘制不同主题数perplexity的不同
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(n_topics, perplexityLst)
ax.set_xlabel("# of topics")
ax.set_ylabel("Approximate Perplexity")
plt.grid(True)
plt.savefig(os.path.join('lda_result', 'perplexityTrend.png'))
plt.show()

二、如果想一次性调整所有参数也可以直接利用sklearn作cv，但是这样做的结果一定是，耗时十分长。以下代码仅供参考，可以根据自身的需求进行增减。

from sklearn.model_selection import GridSearchCV
parameters = {'learning_method':('batch', 'online'), 
              'n_topics':range(20, 75, 5),
              'perp_tol': (0.001, 0.01, 0.1),
              'doc_topic_prior':(0.001, 0.01, 0.05, 0.1, 0.2),
              'topic_word_prior':(0.001, 0.01, 0.05, 0.1, 0.2)
              'max_iter':1000}
lda = LatentDirichletAllocation()
model = GridSearch(lda, parameters)
model.fit(tf)

sorted(model.cv_results_.keys())

附录：sklearn LDA API 中文解释

Class sklearn.decomposition.LatentDirichletAllocation(n_topics=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None)

参数：
1) n_topics: 即我们的隐含主题数K,需要调参。K的大小取决于我们对主题划分的需求，比如我们只需要类似区分是动物，植物，还是非生物这样的粗粒度需求，那么K值可以取的很小，个位数即可。如果我们的目标是类似区分不同的动物以及不同的植物，不同的非生物这样的细粒度需求，则K值需要取的很大，比如上千上万。此时要求我们的训练文档数量要非常的多。
2) doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
3) topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
4) learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”online”。选择了‘online’则我们可以在训练时使用partial_fit函数分布训练。不过在scikit-learn 0.20版本中默认算法会改回到”batch”。建议样本量不大只是用来学习的话用”batch”比较好，这样可以少很多参数要调。而样本太多太大的话，”online”则是首先了。
5）learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]，以保证”online”算法渐进的收敛。主要控制”online”算法的学习率，默认是0.7。一般不用修改这个参数。
6）learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
7）max_iter ：EM算法的最大迭代次数。
8）total_samples：仅仅在算法使用”online”时有意义，即分步训练时每一批文档样本的数量。在使用partial_fit函数时需要。
9）batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。
10）mean_change_tol :即E步更新变分参数的阈值，所有变分参数更新小于阈值则E步结束，转入M步。一般不用修改默认值。
11） max_doc_update_iter: 即E步更新变分参数的最大迭代次数，如果E步迭代次数达到阈值，则转入M步。

方法：
1）fit(X[, y])：利用训练数据训练模型，输入的X为文本词频统计矩阵。
2）fit_transform(X[, y])：利用训练数据训练模型，并返回训练数据的主题分布。
3）get_params([deep])：获取参数
4）partial_fit(X[, y])：利用小batch数据进行Online方式的模型训练。
5）perplexity(X[, doc_topic_distr, sub_sampling])：计算X数据的approximate perplexity。
6）score(X[, y])：计算approximate log-likelihood。
7）set_params(**params)：设置参数。
8）transform(X)：利用已有模型得到语料X中每篇文档的主题分布。

链接：https://pan.baidu.com/s/1RrckbSNEs1dZB4NItlg07Q
提取码：s5p6

原文地址：https://www.cnblogs.com/MaggieForest/p/12457093.html

利用sklearn训练LDA主题模型及调参详解