So drucken Sie den LDA-Themen-Modelle von gensim? Python

Mit gensim ich war in der Lage zu extrahieren Themen aus einem Satz von Dokumenten in LSA, aber wie kann ich den Zugriff auf die generierten Themen aus der LDA-Modelle?

Beim drucken der lda.print_topics(10) dem code, gab den folgenden Fehler aus, weil print_topics() Gegenzug eine NoneType:

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

Code:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]

for l,t in izip(corpus_lsi,corpus):
  print l,"#",t
print
for top in lsi.print_topics(2):
  print top

# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

for l,t in izip(corpus_lda,corpus):
  print l,"#",t
print

# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
  print top

Etwas fehlt in deinem code, nämlich corpus_tfidf Berechnung. Würdest du bitte fügen Sie die restlichen Stück?

InformationsquelleAutor alvas | 2013-02-22

14

Nach etwas Herumspielen, wie es scheint print_topics(numoftopics) für die ldamodel hat einige Fehler. So mein workaround ist die Verwendung print_topic(topicid):
```
>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>>  print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...
```
print_topics ist ein alias für show_topics mit den ersten fünf Themen. Schreiben Sie einfach lda.show_topics() keine print notwendig.

InformationsquelleAutor alvas
7

Ich denke, die syntax von show_topics Laufe der Zeit verändert hat:
```
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
```
Für num_topics Anzahl der Themen, die Rückkehr num_words wichtigsten Wörter (10 Wörter pro Thema, standardmäßig).

Den Themen zurück, wie eine Liste – eine Liste von strings, wenn Sie formatiert ist True, oder eine Liste der (Wahrscheinlichkeit, Wort) 2-Tupel, wenn False.

Wenn log Wahr ist, auch der Ausgang dieses Ergebnis ist zu protokollieren.

Im Gegensatz zu LSA, es gibt keine natürlichen Bestellung zwischen den Themen in der LDA. Die zurückgegebenen num_topics <= selbst.num_topics Teilmenge aller Themen ist daher willkürlich und kann den Wechsel zwischen zwei LDA-training läuft.

InformationsquelleAutor user2597000
6

Sind Sie mit einer Anmeldung? print_topics druckt die logfile, wie es in dem docs.

Als @mac389 sagt, lda.show_topics() ist der Weg zu gehen, um Druck auf den Bildschirm.

ich bin nicht mit jeder Anmeldung, weil ich die Themen sofort. du hast Recht, die lda.show_topics() oder lda.print_topic(i) ist der Weg zu gehen.

InformationsquelleAutor zanbri

können Sie verwenden:

for i in  lda_model.show_topics():
    print i[0], i[1]

InformationsquelleAutor xu2mao

Hier ist Beispielcode zu drucken Themen:

def ExtractTopics(filename, numTopics=5):
    # filename is a pickle file where I have lists of lists containing bag of words
    texts = pickle.load(open(filename, "rb"))

    # generate dictionary
    dict = corpora.Dictionary(texts)

    # remove words with low freq.  3 is an arbitrary number I have picked here
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3]
    dict.filter_tokens(low_occerance_ids)
    dict.compactify()
    corpus = [dict.doc2bow(t) for t in texts]
    # Generate LDA Model
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics)
    i = 0
    # We print the topics
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20):
        i = i + 1
        print "Topic #" + str(i) + ":",
        for p, id in topic:
            print dict[int(id)],

        print ""

Ich versuchte Ihren code, wo ich den pass Liste mit BOGEN text. Ich bekomme folgende Fehlermeldung: TypeError: show_topics() got an unexpected keyword argument 'Themen'
versuchen num_topics. Korrigiert habe ich den oben stehenden code.

InformationsquelleAutor Shirish Kumar

Ich denke, es ist immer noch hilfreich, um zu sehen, das Themen wie eine Liste von Wörtern. Der folgende code-Schnipsel hilft acchieve das Ziel. Ich nehme an, Sie haben bereits ein lda-Modell namens lda_model.

for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

Im obigen code habe ich beschlossen zu zeigen, die ersten 30 Wörter, die Zugehörigkeit zu jedem Thema. Der Einfachheit halber habe ich gezeigt, das erste Thema, die ich bekomme.

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

Ich nicht wirklich die Art, wie die oben genannten Themen suchen, so dass ich in der Regel ändern mein code, wie dargestellt:

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic])))

... und die Ausgabe (die ersten 2 Themen angezeigt) Aussehen wird.

Topic: 0 
Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental
Topic: 1 
Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head

Es tut mir Leid, meine Antwort kommt ziemlich spät, aber ich würde gerne wissen, was Sie darüber denken.

InformationsquelleAutor Nde Samuel Mbah

0

Vor kurzem stieß auf ein ähnliches Problem beim arbeiten mit Python 3 und Gensim 2.3.0. print_topics() und show_topics() waren nicht das er keine Fehler, aber auch nicht den Druck alles. Stellt sich heraus, dass show_topics() eine Liste zurückgibt. So kann man einfach tun:
```
topic_list = show_topics()
print(topic_list)
```
InformationsquelleAutor Maneet

Können Sie auch exportieren Sie die wichtigsten Begriffe aus jedem Thema eine csv-Datei. topn steuert, wie viele Wörter unter jedem Thema zu exportieren.

import pandas as pd

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)])

pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")

Die CSV-Datei hat Folgendes format

Topic Word  P  
0     w1    0.004437  
0     w2    0.003553  
0     w3    0.002953  
0     w4    0.002866  
0     w5    0.008813  
1     w6    0.003393  
1     w7    0.003289  
1     w8    0.003197 
...

InformationsquelleAutor Feng Mai

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.