Hadoop mapreduce error : PipeMapRed.waitOutputThreads(): subprocess failed with code 1

81 views Asked by At

I'm trying to convert xml files through a mapreduce job and receive the error :

2023-04-04 09:41:52,515 INFO mapreduce.Job:  map 0% reduce 0%
2023-04-04 09:42:12,676 INFO mapreduce.Job: Task Id : attempt_1680592009322_0021_m_000000_0, Status : FAILED 
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 

I'm lauching it with the command :

yarn jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
        -file "script/mapper_convert_xml.py" -mapper "python3 mapper_convert_xml.py" \
        -file "script/reducer_convert_xml.py" -reducer "python3 reducer_convert_xml.py" \
        -input /input/philharmonie_data/AIC94.xml -output /output/philharmonie_data

My script have the header #!/usr/bin/env python and I give chmod 744 to mapper, the reducer and ElementTree.py scripts.

The script run fine in local mode with the command line

cat /home/philharmonie_data/AIC94.xml | python3 /script/mapper_convert_xml.py | python3 /script/reducer_convert_xml.py

Here is my mapper script :

#!/usr/bin/env python

from operator import itemgetter
import xml.etree.ElementTree as ET
import sys
import json

parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(sys.stdin.read().encode("utf-8", "replace"))

for element in tree:
    child_values = ""
    for child in element:
        if child.tag == "{http://www.loc.gov/MARC21/slim}controlfield":
            child_values = child_values + child.attrib["tag"] + "\t" + child.text + "_|_"
            
        if child.tag == "{http://www.loc.gov/MARC21/slim}datafield":
            for field in child:
                code_uni = child.attrib["tag"] + "$" + field.attrib["code"]
                value = field.text
                if code_uni is not None:
                    child_values = child_values + code_uni + "\t" + value + "_|_"
            
    print(child_values)

I've try to chmod -x the scripts, add the #!/usr/bin/env python3 headers on scripts.

The use of -files instead of -file in the command line don't change the error.

According to this question I try to execute this command with no change in the error :

yarn jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \
        -file "script/mapper_convert_xml.py" -mapper "python3 mapper_convert_xml.py --python-bin /usr/bin/python3" \
        -file "script/reducer_convert_xml.py" -reducer "python3 reducer_convert_xml.py --python-bin /usr/bin/python3" \
        -input /input/philharmonie_data/AIC94.xml -output /output/philharmonie_data

Edit : I also tryed to change the sys.stdin with no change

import io

# Set encoding explicitly
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

As the wordcount example run, I suspect the import xml.etree.ElementTree as ET line to break the code in HDFS. Any Idea on how to make the job work ? Thanks

0

There are 0 answers