Question: when reading newline delimited JSON files as a PyArrow dataset, what is the correct way to specify the JSON block_size?
Context
I have newline-delimited JSON files which I wish to convert to parquet.
This code worked for one set of files:
# ... imports, schema definition
dataset = ds.dataset(
'data/landing/type_one',
format='json',
schema=schema_type_one
)
ds.write_dataset(
dataset,
'data/intermediate/type_one',
format='parquet',
min_rows_per_group=3*512**2,
max_rows_per_file=5*10**6
)
For another type of files (different data structure), my code was failing with pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?). Some Googling took me to this Github issue, where one of the comments hinted at what my problem was:
We're using the JSON loader of pyarrow. It parses the file chunk by chunk to load the dataset. This issue happens when there's no delimiter in one chunk of data. For json line, the delimiter is the end of line. So with a big value for chunk_size this should have worked unless you have one extremely long line in your file.
And in fact yes, I have some extremely long lines in my second set of files, so I need to increase the block size somehow.
pyarrow.dataset.dataset does not have a way to directly specify block size for JSON, but I was able to get it to work by specifying format as a JsonFileFormat object with specific read_options:
import pyarrow.dataset as ds
from pyarrow._dataset import JsonFileFormat
from pyarrow.json import ReadOptions
# ... schema definitions
fileformat = JsonFileFormat(
read_options=ReadOptions(block_size=5*2**20)
)
dataset = ds.dataset(
'data/landing/type_two',
format=fileformat,
schema=schema_type_two
)
ds.write_dataset(
dataset,
'data/intermediate/type_two',
format='parquet',
min_rows_per_group=3*512**2,
max_rows_per_file=5*10**6
)
However, this import makes me uncomfortable: from pyarrow._dataset import JsonFileFormat. Whereas pyarrow.dataset.ParquetFileFormat is documented in the public API, I had to dig through the code to find the existence of pyarrow._dataset.JsonFileFormat which resides in a separate "private" _dataset package.
Is there a more "correct" way to achieve what I'm trying to accomplish?
It appears that at the time of this writing
JsonFileFormatis simply missing from the documentation forpyarrow.dataset. The documentation hasCsvFileFormat,IpcFileFormat,ParquetFileFormatandOrcFileFormatbut notJsonFileFormat.In fact however,
pyarrow.datasetimportsJsonFileFormatalong with the other file formats frompyarrow._dataset.So instead of doing:
one can directly do
And then the
block_sizecan be specified by providing an appropriately configuredpyarrow.Json.ReadOptionsobject toJsonFileFormat, as shown in the original code in the question.