I have json file, which size is 100 gb. It scheme looks like:
json_f = {"main_index":{"0":3,"1":7},"lemmas":{"0":["test0", "test0"],"1":["test1","test1"]}}
*"lemmas" elements contain large lists with words. Len of "lemmas" elements about 2kk.
As a result I need it whole in memory as:
- List of "lemmas"
[["test0", "test0"], ["test1","test1"]]
- Or a pd.DataFrame of json_f, which I'll process further to 1.
What I have tried:
- pd.read_json - gives a memory error (which is not about RAM, as I can see, cause I have 256 GB);
- ijson and iteratively load it. But something goes wrong with real file (on example data its ok) - kernel is busy but the iterator does not grow.
f = open("json_f.json", 'rb')
texts = []
for j in ijson.items(f, 'lemmas.0'):
texts.append(j)
One of my thoughts is to split that to some less weighted files, load after that and merge. But I ran into the same problem load it first. I would be very grateful for tips on how to deal with this.
Your usage of ijson doesn't populate the list because you are using an inappropriate function.
ijson.itemsyield multiple objects only if you give a prefix that matches multiple objects. Those usually traverse a list of elements, so you'll see somewhere else n the prefix the worditem.OTOH, what you want to traverse iteratively is the
lemmasobject, which has many keys and values -- and you want to accumulate only the values. Usingijson.kvitemsshould do the trick:This should allow you to traverse the entire file and do something sensible with it. Note that trying to load all these lemmas into memory might still not be possible if there are too many of them. In that case you could, as you were suggesting, use ijson to break down the file into smaller ones that could be processed separately, which might or not be possible depending on what you're trying to do.