Pyarrow Schema definition

Question

Pyarrow Schema definition

49 views Asked by MuGh At 12 March 2024 at 15:37

I am trying to create a parquet file from mongoDB records, in order to do this I did create a schema first like this

import pyarrow as pa
import pyarrow.parquet as pq

USER = pa.schema([
    pa.field("_id", pa.string(), nullable=True),
    pa.field("appID", pa.string(), nullable=True),
    pa.field("group", pa.string(), nullable=True),
    pa.field("_created", pa.int64(), nullable=True),
    pa.field("_touched", pa.int64(), nullable=True),
    pa.field("_updated", pa.int64(), nullable=True)
])

writer = pq.ParquetWriter('output.parquet', USER)

and trying to use the following to add the data after looping the mongo docs to the parquet file

batch = pa.RecordBatch.from_pylist(chunk)
    
writer.write_batch(batch)

I got this error Table schema does not match schema used to create file this due to not all mongo record contain group field, how to solve this ?

Original Q&A

There are 1 answers

**Umut** · Answer 1 · 2024-03-12T15:50:28+00:00

To fix the "Table schema does not match schema used to create file" error while making a Parquet file out of your MongoDB records, it's necessary to make sure the schema you've laid out aligns with the data structure of the records you're attempting to write.

Here's an example of how you will change your code to fill in the missing areas:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


USER = pa.schema([
pa.field("_id", pa.string(), nullable=True),
pa.field("appID", pa.string(), nullable=True),
pa.field("group", pa.string(), nullable=True),
pa.field("_created", pa.int64(), nullable=True),
pa.field("_touched", pa.int64(), nullable=True),
pa.field("_updated", pa.int64(), nullable=True)
])

writer = pq.ParquetWriter('output.parquet', USER)

for doc in mongo_docs:
filled_doc = {field.name: doc.get(field.name, None) for field in USER}

batch = pa.RecordBatch.from_pandas(pd.DataFrame([filled_doc]), 
schema=USER)

writer.write_batch(batch)

writer.close()

After writing all batches don't forget to close the writer with writer.close(), that the Parquet file is properly finalized

TechQA.

Pyarrow Schema definition

There are 1 answers

Related Questions in PYTHON

Related Questions in PARQUET

Related Questions in PYARROW

Popular Questions

Trending Questions