I am trying to create a parquet file from mongoDB records, in order to do this I did create a schema first like this
import pyarrow as pa
import pyarrow.parquet as pq
USER = pa.schema([
pa.field("_id", pa.string(), nullable=True),
pa.field("appID", pa.string(), nullable=True),
pa.field("group", pa.string(), nullable=True),
pa.field("_created", pa.int64(), nullable=True),
pa.field("_touched", pa.int64(), nullable=True),
pa.field("_updated", pa.int64(), nullable=True)
])
writer = pq.ParquetWriter('output.parquet', USER)
and trying to use the following to add the data after looping the mongo docs to the parquet file
batch = pa.RecordBatch.from_pylist(chunk)
writer.write_batch(batch)
I got this error Table schema does not match schema used to create file this due to not all mongo record contain group field, how to solve this ?
To fix the "Table schema does not match schema used to create file" error while making a Parquet file out of your MongoDB records, it's necessary to make sure the schema you've laid out aligns with the data structure of the records you're attempting to write.
Here's an example of how you will change your code to fill in the missing areas:
After writing all batches don't forget to close the writer with writer.close(), that the Parquet file is properly finalized