Wie zu verwenden pos_tag in NLTK?

Also habe ich versucht-tag ein paar Worte in einer Liste (POS-tagging, um genau zu sein) wie so:

pos = [nltk.pos_tag(i,tagset='universal') for i in lw]

wo lw ist eine Liste von Wörtern (es ist wirklich lange oder ich hätte es gepostet, aber es ist wie [['hello'],['world']] (aka eine Liste von Listen, jede Liste mit einem Wort), aber wenn ich versuche es auszuführen bekomme ich:

Traceback (most recent call last):
  File "<pyshell#183>", line 1, in <module>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "<pyshell#183>", line 1, in <listcomp>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
    return _pos_tag(tokens, tagset, tagger)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
    tagged_tokens = tagger.tag(tokens)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
    elif word[0].isdigit():
IndexError: string index out of range

Kann mir jemand sagen warum und wie ich diesen Fehler bekommen und wie es zu lösen ist? Vielen Dank.

InformationsquelleAutor EighteenthVariable | 2017-11-27

Erstens, verwenden Sie lesbare Variablen-Namen, es hilft =)

Weiter pos_tag Eingabe ist eine Liste von string. So ist es

>>> from nltk import pos_tag
>>> sentences = [ ['hello', 'world'], ['good', 'morning'] ]
>>> [pos_tag(sent) for sent in sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

Auch, wenn Sie den Eingang als raw-strings verwenden, können Sie word_tokenize vor pos_tag:

>>> from nltk import pos_tag, word_tokenize
>>> a_sentence = 'hello world'
>>> word_tokenize(a_sentence)
['hello', 'world']
>>> pos_tag(word_tokenize(a_sentence))
[('hello', 'NN'), ('world', 'NN')]

>>> two_sentences = ['hello world', 'good morning']
>>> [word_tokenize(sent) for sent in two_sentences]
[['hello', 'world'], ['good', 'morning']]
>>> [pos_tag(word_tokenize(sent)) for sent in two_sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

Haben, und Sie haben die Sätze in einem Absatz, die Sie verwenden können sent_tokenize split den Satz bis.

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Hello world. Good morning."
>>> sent_tokenize(text)
['Hello world.', 'Good morning.']
>>> [word_tokenize(sent) for sent in sent_tokenize(text)]
[['Hello', 'world', '.'], ['Good', 'morning', '.']]
>>> [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
[[('Hello', 'NNP'), ('world', 'NN'), ('.', '.')], [('Good', 'JJ'), ('morning', 'NN'), ('.', '.')]]

Siehe auch: Wie mache POS-tagging mit dem NLTK POS-tagger, in Python?

Danke für die Antwort, und es funktioniert, nur das Problem hier ist, dass ich wurde außerdem gefragt, warum das geschehen war. Aber ich Schätze Ihre Antwort dennoch.

InformationsquelleAutor alvas

Einer gemeinsamen Funktion zum Parsen eines Dokuments mit pos-tags,

def get_pos(string):
    string = nltk.word_tokenize(string)
    pos_string = nltk.pos_tag(string)
    return pos_string

get_post(sentence)

Hoffe, das hilft !

InformationsquelleAutor Vivek Ananthan

wenn Sie den Eingang als raw-strings verwenden, können Sie word_tokenize vor pos_tag:

import nltk

is_noun = lambda pos: pos[:2] == 'NN'

lines = 'You can never plan the future by the past'

lines = lines.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

print(nouns) # ['future', 'past']

InformationsquelleAutor MOHA OUSAID SAAID

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.