How to load whole large (100 GB) json to memory with python

Question

How to load whole large (100 GB) json to memory with python

202 views Asked by SollPicher At 08 May 2023 at 21:17

I have json file, which size is 100 gb. It scheme looks like:

json_f = {"main_index":{"0":3,"1":7},"lemmas":{"0":["test0", "test0"],"1":["test1","test1"]}}

*"lemmas" elements contain large lists with words. Len of "lemmas" elements about 2kk.

As a result I need it whole in memory as:

List of "lemmas"

[["test0", "test0"], ["test1","test1"]]

Or a pd.DataFrame of json_f, which I'll process further to 1.

What I have tried:

pd.read_json - gives a memory error (which is not about RAM, as I can see, cause I have 256 GB);
ijson and iteratively load it. But something goes wrong with real file (on example data its ok) - kernel is busy but the iterator does not grow.

f = open("json_f.json", 'rb')

texts = []

for j in ijson.items(f, 'lemmas.0'):
    texts.append(j)

One of my thoughts is to split that to some less weighted files, load after that and merge. But I ran into the same problem load it first. I would be very grateful for tips on how to deal with this.

Original Q&A

There are 1 answers

**Rodrigo Tobar** · Answer 1 · 2023-05-17T01:12:33+00:00

Your usage of ijson doesn't populate the list because you are using an inappropriate function.

ijson.items yield multiple objects only if you give a prefix that matches multiple objects. Those usually traverse a list of elements, so you'll see somewhere else n the prefix the word item.

OTOH, what you want to traverse iteratively is the lemmas object, which has many keys and values -- and you want to accumulate only the values. Using ijson.kvitems should do the trick:

for key, lemmas in ijson.kvitens(f, 'lemmas'):
    # key is "0", "1", ...
    # value is ["test0", "test0"], ["test1", "test1"], ...

This should allow you to traverse the entire file and do something sensible with it. Note that trying to load all these lemmas into memory might still not be possible if there are too many of them. In that case you could, as you were suggesting, use ijson to break down the file into smaller ones that could be processed separately, which might or not be possible depending on what you're trying to do.

TechQA.

How to load whole large (100 GB) json to memory with python

There are 1 answers

Related Questions in PYTHON

Related Questions in JSON

Related Questions in LARGE-DATA

Related Questions in LARGE-FILES

Related Questions in IJSON

Popular Questions

Trending Questions