【自然语言处理】Wordcloud词云分析+LDA主题分类

2020-04-02

字数统计: 990字 | 阅读时长≈ 4分钟

最近在一门课上，老师分享了一篇关于应用自然语言处理（NLP）模型分析交通领域研究主题分类的论文，我觉得写得很好很有意思。所以，自己也进行一下尝试。主要分为Wordcloud词云展示和Latent Dirichlet Allocation（LDA）模型主题分类两部分。

参考文献：Sun, L., & Yin, Y. (2017). Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies, 77, 49-66. doi:10.1016/j.trc.2017.01.013

本次分析的是2017年至2020年发表的交通领域关于“Mobility-as-a-Service(MaaS)”的30篇论文的摘要部分。

文献展示

wordcloud词云展示

首先对文本进行预处理：

去除标点符号，按空格拆分句子 mytext=re.sub(r'[.,()]','',mytext).split(' ') #正侧表达式去除标点

用NLTK的wordnet模块进行词形还原

  from nltk.stem import WordNetLemmatizer
  wnl = WordNetLemmatizer()
  remytext = ''
  for i in range(len(mytext)):
      mytext[i]=wnl.lemmatize(mytext[i]) #词性还原
      remytext=remytext+' '+mytext[i]
  mytext=remytext

<strong>注：wordnet包需要另外独自安装，不然会报如下错误：</strong>`Resource wordnet not found. Please use the NLTK Downloader to obtain the resource:`

![wordnet缺失报错](wordnet缺失报错.PNG)
<strong>这里补充wordnet包安装方法：</strong>
(1) 命令行cmd打开，输入python；
(2) 继续输入<code>import nltk</code>和<code>nltk.download()</code>如下图所示：

![wordnet下载cmd](wordnet下载cmd.PNG)(3) 可能会报错远程主机强制关闭了一个连接。
  解决办法：(1)连VPN (2)利用别人已经下载到本地的wordnet包,具体参照这篇博客。<a href="https://blog.csdn.net/Charchunchiu/article/details/96436736?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task">Python NLTK WordNet的在线与手动安装方法</a>
(4) 正常情况下,弹出如下窗口，选择<code>Corpora/wordnet</code>下载即可。
![wordnet下载界面](wordnet下载界面.PNG)

设置停用词

from wordcloud import STOPWORDS#导入原有的停用词包
additional_stopwords=[]
with open(stopwordsfile,'r',encoding='UTF-8') as f:
for line in f:
      additional_stopwords.append(line.replace("\n",""))
stop_words = additional_stopwords + list(STOPWORDS)#额外添加自己的停用词

应用wordcloud包

from wordcloud import WordCloud
wordcloud = WordCloud(stopwords=stop_words,background_color='white',collocations=True,colormap='viridis',width=2160,height=1080,max_words=50,prefer_horizontal=1).generate(mytext)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

词云展示

LDA主题分类

这里使用sklearn的LDA包。基于之前的文本预处理，还需要转换为词向量的形式，才可放入LDA模型训练。具体代码如下：

from sklearn.feature_extraction.text import CountVectorizer #词向量转换模块
from sklearn.decomposition import LatentDirichletAllocation as LDA
import matplotlib.pyplot as plt
import time
from matplotlib.pyplot import MultipleLocator

#Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=stop_words)#这里也可以设置停用词
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(mytext)

def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
n_topics = range(3,7,1)
perplexityLst = [1.0]*len(n_topics)

#训练LDA并打印训练时间
lda_models = []
for idx, n_topic in enumerate(n_topics):
    print(idx,n_topic)
    lda = LDA(n_components=n_topic, max_iter=2000,learning_method='batch',

#                                    perp_tol=0.1, #default

#                                    doc_topic_prior=1/n_topic, #default

#                                    topic_word_prior=1/n_topic, #default

                                    verbose=0)
    t0 = time.time()
    lda.fit(count_data)
    perplexityLst[idx] = lda.perplexity(count_data)
    lda_models.append(lda)
    print ("# of Topic: %d, " % n_topics[idx])
    print ("done in %0.3fs, N_iter %d, " % ((time.time() - t0), lda.n_iter_))
    print ("Perplexity Score %0.3f" % perplexityLst[idx])

#打印最佳模型
number_words=10
best_index = perplexityLst.index(min(perplexityLst))
best_n_topic = n_topics[best_index]
best_model = lda_models[best_index]
print ("Best # of Topic: ", best_n_topic)
print_topics(best_model, count_vectorizer, number_words)

#绘制不同主题数perplexity的不同
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
xaxis=[]
for i in range(0,len(n_topics)):
    xaxis.append(n_topics[i])
ax.plot(xaxis, perplexityLst)
x_major_locator=MultipleLocator(1)
ax.set_xlabel("number of topics")
ax.set_ylabel("Approximate Perplexity")
plt.grid(True)
plt.show()

最终我这里选择了分为3个主题。虽然2个的perplexity效果好一些，但我根据需要选择了3个。

perplexity指数

Topic #1: 城市系统角度 model city public business alliance actor efficiency factor system new Topic #2: 各出行模式运营商角度 user time market adoption road potential intermediary car operator public Topic #3: 用户出行模式选择角度 car public model mode bundle sharing new travel choice demand

好了，这次就到这里了。具体的参数设置还是参照官方文档。

打赏