【自然语言处理】Wordcloud词云分析+LDA主题分类

最近在一门课上,老师分享了一篇关于应用自然语言处理(NLP)模型分析交通领域研究主题分类的论文,我觉得写得很好很有意思。所以,自己也进行一下尝试。 主要分为Wordcloud词云展示Latent Dirichlet Allocation(LDA)模型主题分类两部分。

参考文献:Sun, L., & Yin, Y. (2017). Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies, 77, 49-66. doi:10.1016/j.trc.2017.01.013

本次分析的是2017年至2020年发表的交通领域关于“Mobility-as-a-Service(MaaS)”的30篇论文的摘要部分。

文献展示

wordcloud词云展示

  1. 首先对文本进行预处理:

    • 去除标点符号,按空格拆分句子 mytext=re.sub(r'[.,()]','',mytext).split(' ') #正侧表达式去除标点

    • 用NLTK的wordnet模块进行词形还原

        from nltk.stem import WordNetLemmatizer
        wnl = WordNetLemmatizer()
        remytext = ''
        for i in range(len(mytext)):
            mytext[i]=wnl.lemmatize(mytext[i]) #词性还原
            remytext=remytext+' '+mytext[i]
        mytext=remytext
      
      <strong>注:wordnet包需要另外独自安装,不然会报如下错误:</strong>`Resource wordnet not found. Please use the NLTK Downloader to obtain the resource:`
      
      ![wordnet缺失报错](wordnet缺失报错.PNG)
      <strong>这里补充wordnet包安装方法:</strong>
      (1) 命令行cmd打开,输入python;
      (2) 继续输入<code>import nltk</code><code>nltk.download()</code>如下图所示:
      
      ![wordnet下载cmd](wordnet下载cmd.PNG)(3) 可能会报错远程主机强制关闭了一个连接。
        解决办法:(1)连VPN (2)利用别人已经下载到本地的wordnet包,具体参照这篇博客。<a href="https://blog.csdn.net/Charchunchiu/article/details/96436736?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task">Python NLTK WordNet的在线与手动安装方法</a>
      (4) 正常情况下,弹出如下窗口,选择<code>Corpora/wordnet</code>下载即可。
      ![wordnet下载界面](wordnet下载界面.PNG)
    • 设置停用词

      from wordcloud import STOPWORDS#导入原有的停用词包
      additional_stopwords=[]
      with open(stopwordsfile,'r',encoding='UTF-8') as f:
      for line in f:
            additional_stopwords.append(line.replace("\n",""))
      stop_words = additional_stopwords + list(STOPWORDS)#额外添加自己的停用词
  2. 应用wordcloud包

    from wordcloud import WordCloud
    wordcloud = WordCloud(stopwords=stop_words,background_color='white',collocations=True,colormap='viridis',width=2160,height=1080,max_words=50,prefer_horizontal=1).generate(mytext)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

词云展示

LDA主题分类

这里使用sklearnLDA包。 基于之前的文本预处理,还需要转换为词向量的形式,才可放入LDA模型训练。 具体代码如下:

from sklearn.feature_extraction.text import CountVectorizer #词向量转换模块
from sklearn.decomposition import LatentDirichletAllocation as LDA
import matplotlib.pyplot as plt
import time
from matplotlib.pyplot import MultipleLocator

#Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words=stop_words)#这里也可以设置停用词
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(mytext)

def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
n_topics = range(3,7,1)
perplexityLst = [1.0]*len(n_topics)

#训练LDA并打印训练时间
lda_models = []
for idx, n_topic in enumerate(n_topics):
    print(idx,n_topic)
    lda = LDA(n_components=n_topic, max_iter=2000,learning_method='batch',

#                                    perp_tol=0.1, #default

#                                    doc_topic_prior=1/n_topic, #default

#                                    topic_word_prior=1/n_topic, #default

                                    verbose=0)
    t0 = time.time()
    lda.fit(count_data)
    perplexityLst[idx] = lda.perplexity(count_data)
    lda_models.append(lda)
    print ("# of Topic: %d, " % n_topics[idx])
    print ("done in %0.3fs, N_iter %d, " % ((time.time() - t0), lda.n_iter_))
    print ("Perplexity Score %0.3f" % perplexityLst[idx])

#打印最佳模型
number_words=10
best_index = perplexityLst.index(min(perplexityLst))
best_n_topic = n_topics[best_index]
best_model = lda_models[best_index]
print ("Best # of Topic: ", best_n_topic)
print_topics(best_model, count_vectorizer, number_words)

#绘制不同主题数perplexity的不同
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
xaxis=[]
for i in range(0,len(n_topics)):
    xaxis.append(n_topics[i])
ax.plot(xaxis, perplexityLst)
x_major_locator=MultipleLocator(1)
ax.set_xlabel("number of topics")
ax.set_ylabel("Approximate Perplexity")
plt.grid(True)
plt.show()

最终我这里选择了分为3个主题。虽然2个的perplexity效果好一些,但我根据需要选择了3个。

perplexity指数

Topic #1: 城市系统角度 model city public business alliance actor efficiency factor system new Topic #2: 各出行模式运营商角度 user time market adoption road potential intermediary car operator public Topic #3: 用户出行模式选择角度 car public model mode bundle sharing new travel choice demand

好了,这次就到这里了。具体的参数设置还是参照官方文档。

打赏
  • © 2020-2022 Eva Ren
  • Powered by Hexo Theme Ayer
    • 本站总访问量
    • 本页面总访问量
  • 载入运行时间...

请我喝杯咖啡吧~

支付宝
微信