Spark - Word-count-test

Möchte ich nur die Anzahl der Wörter in der Funke (pyspark), aber ich kann die Karte entweder den Buchstaben oder den ganzen string.

Habe ich versucht:
(ganze Zeichenfolge)

v1='Hi hi hi bye bye bye word count' 
v1_temp=sc.parallelize([v1]) 
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()

oder (nur Buchstaben)

v1='Hi hi hi bye bye bye word count'
v1_temp=sc.parallelize(v1)
v1_map = v1_temp.flatMap(lambda x: x.split('\t'))
v1_counts = v1_map.map(lambda x: (x, 1))
v1_counts.collect()

Gut, problem hier ist nicht mit Spark, die Sie versuchen zu splitten, indem Sie tab: split('\t'), während das, was Sie brauchen, ist, einfach anrufen split().

InformationsquelleAutor Vinicius | 2015-01-16

4

Wenn Sie sc.parallelize(sequence) Sie sind die Schaffung eines RDD operiert werden parallel. Im ersten Fall können Sie Sequenz ist eine Liste mit einem einzigen element (den ganzen Satz). Im zweiten Fall Ihre Sequenz ist ein string, welcher in python ist ähnlich wie eine Liste von Zeichen.

Wenn Sie möchten, um die Anzahl der Wörter in der parallele, die Sie tun könnten:
```
from operator import add

s = 'Hi hi hi bye bye bye word count' 
seq = s.split()   # ['Hi', 'hi', 'hi', 'bye', 'bye', 'bye', 'word', 'count']
sc.parallelize(seq)\
  .map(lambda word: (word, 1))\
  .reduceByKey(add)\
  .collect()
```
Erhalten Sie:
```
[('count', 1), ('word', 1), ('bye', 3), ('hi', 2), ('Hi', 1)]
```
InformationsquelleAutor elyase

Wenn Sie nur wollen, zu zählen alphanumerische Worten, es kann eine Lösung sein:

import time, re
from pyspark import SparkContext, SparkConf

def linesToWordsFunc(line):
    wordsList = line.split()
    wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
    filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
    return filtered

def wordsToPairsFunc(word):
    return (word, 1)

def reduceToCount(a, b):
    return (a + b)

def main():
    conf = SparkConf().setAppName("Words count").setMaster("local")
    sc = SparkContext(conf=conf)
    rdd = sc.textFile("your_file.txt")

    words = rdd.flatMap(linesToWordsFunc)
    pairs = words.map(wordsToPairsFunc)
    counts = pairs.reduceByKey(reduceToCount)

    # Get the first top 100 words
    output = counts.takeOrdered(100, lambda (k, v): -v)

    for(word, count) in output:
        print word + ': ' + str(count)

    sc.stop()

if __name__ == "__main__":
    main()

InformationsquelleAutor f_ficarola

Gab es viele Versionen von wordcount online, unten ist nur von Ihnen;

#to count the words in a file hdfs:///of file:///or localfile "./samplefile.txt"
rdd=sc.textFile(filename)

#or you can initialize with your list
v1='Hi hi hi bye bye bye word count' 
rdd=sc.parallelize([v1])


wordcounts=rdd.flatMap(lambda l: l.split(' ')) \
        .map(lambda w:(w,1)) \
        .reduceByKey(lambda a,b:a+b) \
        .map(lambda (a,b):(b,a)) \
        .sortByKey(ascending=False)

output = wordcounts.collect()

for (count,word) in output:
    print("%s: %i" % (word,count))

InformationsquelleAutor jasminTi

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.