Was herunterladen um nltk.die tokenisierung.word_tokenize Arbeit?

Werde ich verwenden nltk.tokenize.word_tokenize auf einem cluster, wo mein Konto ist sehr begrenzt durch Speicherplatz-Kontingent. Zu Hause habe ich heruntergeladen alle nltk Ressourcen durch nltk.download() aber, wie ich herausgefunden habe, dauert es ~2,5 GB.

Scheint das ein bisschen übertrieben für mich. Könnten Sie empfehlen, was sind die minimalen (oder fast minimale) Abhängigkeiten für nltk.tokenize.word_tokenize? So weit ich gesehen habe nltk.download('punkt') aber ich bin nicht sicher, ob es ausreichend ist, und was ist die Größe. Was genau soll ich ausführen, damit es funktioniert?

Leicht nichts zu tun haben, aber möchten Sie vielleicht zu "Auschecken" spacig als alternative zu NLTK.

InformationsquelleAutor petrbel | 2016-05-08

nltk python

22

Sind Sie richtig. Sie müssen Punkt Tokenizer-Modelle. Es hat 13 MB und nltk.download('punkt') sollte den trick tun.
- Auch, wenn Sie nltk.download(), die NLTK-Downloader geöffnet werden soll (eine GUI-Anwendung), so können Sie alle Pakete.
- oder verwenden Sie das terminal: python -m nltk.downloader 'punkt'. Beachten Sie auch, dass die 13 MB die Zip-Datei, die Letzte Sache ist ~ 36 MB.
InformationsquelleAutor Tulio Casagrande

In kurzen:

nltk.download('punkt')

ausreichen würde.

In langen:

Sie nicht erforderlich, um download alle Modelle und Korpora zur Verfügung, die in NLTk wenn Sie sind nur zu verwenden NLTK für tokenization.

Eigentlich, wenn Sie nur mit word_tokenize() ist, dann wirst du nicht wirklich brauchen, die Ressourcen von nltk.download(). Wenn wir uns den code, der Standard - word_tokenize() das ist im Grunde die TreebankWordTokenizer sollten keine zusätzlichen Ressourcen:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']

Aber:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

Aber es sieht aus wie das ist nicht der Fall, wenn wir uns anschauen,https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. Es scheint, wie word_tokenize hat implizit genannt sent_tokenize() erfordert die punkt Modell.

Ich bin mir nicht sicher, ob das ein bug oder ein feature ist, aber es scheint wie die alte idiom veraltet ist angesichts des aktuellen Codes:

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]

Kann es einfach:

>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

Aber wir sehen, dass die word_tokenize() flacht die Liste der Liste von string zu einer Liste von string.

Alternativ können Sie versuchen, eine neue tokenizer Hinzugefügt werden NLTK toktok.py basierend auf https://github.com/jonsafari/tok-tok erfordert keine vortrainierte Modelle.

InformationsquelleAutor alvas

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.