Error when running Mapreduce in Cloud - RuntimeException: PipeMapRed.waitOutputThreads()

21 views Asked by At

I am trying to find the top-n words using the hadoop map reduce where I am facing the error:

Error

My Mapper is:

#!/usr/bin/env python
import sys
import re
from collections import Counter

# This regular expression is used to split lines into words, filtering out punctuation and numbers.
WORD_RE = re.compile(r'\b[a-zA-Z]+\b')

# Initialize the counter outside the loop
counter = Counter()

# Define the separator for output
separator = "\t"

# Function to process lines
for line in sys.stdin:
    line = line.strip().lower()
    # Find all words in the current line using the regular expression
    words = WORD_RE.findall(line)
    # Update the counter with the list of words found
    counter.update(words)
# Print the 10 most common words and their counts
for word, count in counter.most_common(15):
    print(f"{word}{separator}{count}")

My reducer is:

#!/usr/bin/env python
import sys
from collections import Counter
from heapq import nlargest

# Define the number of top words to output
TOP_N = 10
global_counter = Counter()

# Read the input from STDIN and process it line by line
for line in sys.stdin:
    # Strip the line of leading and trailing whitespaces and split it on the tab character
    word, count = line.rstrip().split('\t', 1)
    # Convert the count to an integer
    count = int(count)
    # Update the global counter
    global_counter[word] += count

# Find the top N words globally using nlargest from the heapq module
top_n_words = nlargest(TOP_N, global_counter.items(), key=lambda i: i[1])

# Print the top N words and their counts
for word, count in top_n_words:
    print(f"{word}\t{count}")

I am trying to run the hadoop as below due to which I am getting the error:

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file topn_mapper.py \
-mapper 'python topn_mapper.py' \
-file topn_reducer.py \
-reducer 'python topn_reducer.py' \
-input /data/ \
-output /TempOutFolder

When I run it in the local using:

cat data/hello.txt| python topn_mapper.py | python topn_reducer.py it is working fine

I tried executing without hadoop using cat data/hello.txt | python topn_mapper.py | python topn_reducer.py and that seems to work fine

0

There are 0 answers