I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.
However, when I want to go through the objects I get the following error:
byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte
Here is the code I am using so far:
import ijson
with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
objects = ijson.items(json_file, 'SagList.item')
byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')
How can I deal with the encoding of the input file?
The problem is with the python script itself, which is encoded with
cp1252but python expects it to be inutf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).First, note that the error is a
SyntaxError, which probably happens when you are loading your script/module.Secondly, note how in the first bit of code you shared
hændelseappears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte0xe6. This is becase the characteræ(U+00E6, https://www.compart.com/de/unicode/U+00E6) is encoded as0xe6incp1252, which isn't a valid utf8 byte sequence; hence the error.To solve it save your python script with utf8 encoding, or specify that it's saved with
cp1252(see https://peps.python.org/pep-0263/ for reference).