ValueError: leer Wortschatz; vielleicht sind die nur Dokumente enthalten, die stop-Wörter

Ich bin mit (zum ersten mal) die scikit-Bibliothek und ich habe diesen Fehler:

ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:\Users\A605563\Desktop\velibProjetPreso\TraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words

Aber ich verstehe nicht, warum das passiert.

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk


TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets =  TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
    f2.write(line.replace('\n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()

Meine Daten sind raw-tweets:

11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11,    . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"

Weiß jemand, was ist hier Los?

Ich bin nicht vertraut mit der Bibliothek, aber sollte Sie sein, übergeben Sie eine Datei oder einen anderen parameter zu CountVectorizer(strip_accents='ascii')?
diese Linie ist init der countVectorizer, ich denke, das problem kommt von meinem Daten-Struktur, aber ich bin mir nicht sicher. Wenn ich eine kurze Liste von tweet (direkt roh in meinen code) das Programm arbeitet...
Ich vermute, wenn Sie Sie ausführen count_vect.fit_transform(FileTweets) die File Tweets leer ist. Könnten Sie uns zeigen, was FileTweets aussieht.
Wenn ich einen Druck von FileTweets ich habe : 11/8/2015 @ Paris Marriott Champs Elysees Hotel "2015-08-11 21:27:15"ich bin im Paris Marriott Hotel Champs-Elysees in Paris, FR <t.co/gAFspVw6FC>" 2015-08-11 21:24:08,"ich bin im Four Seasons Hotel George V in Paris, Ile-de-France <t.co/dtPALvziWy>" 2015-08-11 21:22:11, . @ Avenue des Champs-Elysées <t.co/8b7U05OAxG> 2015-08-11 20:54:18,Ihre Pistole gehen @ Raspoutine in Paris (Official) <t.co/le9l3dtdgM> 2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysées <t.co/R68JV3NT1z>" es ist ein kurzer Auszug.
hmmm, Satzzeichen vielleicht ist das Problem dann. Versuchen Sie, alle ' und ". Ich lief nur bei der Ausgabe, und es funktionierte gut für mich. Obwohl ich habe, um das entfernen Sie alle Anführungszeichen.
Nicht alle Daten sehen Sie auf den link
hier die Daten, sorry : <docs.google.com/document/d/...>

InformationsquelleAutor Honolulu | 2015-08-26

ich habe eine Lösung gefunden, hier der code :

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk 
import StringIO


TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape

InformationsquelleAutor Honolulu

Dies ist eine viel einfachere Lösung:

x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)

InformationsquelleAutor LoveMeow

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.