stream json file from s3 bucket

815 views Asked by At

So, I have uploaded a json file into s3 bucket. However, when I retrive it using python using a get() method, it gives me a type(str) of the file instead of a json file. If I download it, I can get a json file. However, the actual file would be too big to download. I want to process and validate the json file using json_stream later which is not possible if I get the file in a str format. However, without downloading the file, I only get a string. I know I can convert str into a json using json_loads, however, I don't want to do that since I am creating a validator that will validate a file's json syntax using json_stream. Is there any other way?

1

There are 1 answers

0
Sandesh Jadhav M On

I know I am a bit late to the solution. But since this keeps coming up in s3 large file search updating the solution I have used.

This code be used for non zipped files as well. Since most of the times large json ideally are saved as zipped here is my solution based on gzip files of large json.

import boto3
import gzip
import ijson

my_bucket = <<bucket>>
file_to_read = <<bucket_key>>    
s3 = boto3.resource("s3")
obj = s3.Object(my_bucket, file_to_read)

with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    json_parser = ijson.parse(gzipfile)

    for prefix, type1, value in json_parser:  
        print(prefix, type1, value) 

My solution is built around ijson, but I guess this can be used for json_stream as well.