Polars PanicException when reading a parquet file

44 views Asked by At

I am trying to read a ~300M .parquet file with Polars. I am getting the following error:

df1_path:  /home/xxxxxxx/data/xxxxxx/60f70ff096f10d001e523ba9/655bf4495515510b763a8a03/655bf44ad172673b0636cfd3/tables/gnrl_ldgr_655bf7fbd172673b0636cfe1.parquet
thread 'polars-0' panicked at crates/polars-arrow/src/compute/cast/utf8_to.rs:79:47:
called `Result::unwrap()` on an `Err` value: TryFromIntError(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/xxxxxxx/git/moatengine/src/filter_func_v2.py", line 479, in <module>
    r.main()
  File "/home/xxxxxxx/git/moatengine/src/filter_func_v2.py", line 462, in main
    gl_df = self.large_join(df1_path=ldgr_path, df2_path=acct_path, save_path=ldgr_save_path, word_dict=word_dict)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/git/moatengine/src/filter_func_v2.py", line 178, in large_join
    df1 = pl.read_parquet(df1_path, use_pyarrow=True, columns=clean_cols)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 146, in read_parquet
    return from_arrow(  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/convert.py", line 592, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/dataframe/frame.py", line 581, in _from_arrow
    arrow_to_pydf(
  File "/home/xxxxxxx/miniconda3/envs/moatengine_2/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 1076, in arrow_to_pydf
    pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())

A similar file half the size was read successfully and I am able to read the file meta data eg. use fastparquet to get file info. So I'm assuming there is something in the file causing the problem. What I want to know:

  1. am I correct that the problem is in the file?
  2. how do I trouble shoot something like this? Any recommendations on next steps, tools, etc?

Note: I haven't reported this in the Polars github issues b/c I don't know how to produce a minimal example, but I did not see any issues exactly relating to this error. I know that the error is happening on line 79 here but not sure exactly what that means is happening.

Update:

Using this:

import pyarrow.parquet as pq
tabl= pq.read_table(file_path)

then tabl.schema is:

rowid: int64
txid: int64
debit: int64
credit: int64
effective_date: date32[day] not null
entered_date: date32[day] not null
account: string
user_id: string
transaction: string
memo: string
total_amt: int64
-- schema metadata --
parquet.avro.schema: '{"type":"record","name":"gnrl_ldgr_655bf7fbd172673b' + 699
writer.model.name: 'avro'

Switching to use_pyarrow=False for the read results in a seg fault:

Mar 19 15:05:18 xxxx kernel: [1108996.070242] polars-8[281558]: segfault at 7fd0d442d5a0 ip 00007fcb20c4ac1f sp 00007fcb111fc240 error 4 in polars.abi3.so[7fcb1fb4e000+366e000]
Mar 19 15:05:51 xxxx kernel: [1109028.279971] polars-4[281611]: segfault at 7f3034a2d8a0 ip 00007f2ef0c4ac1f sp 00007f2ee547d240 error 4 in polars.abi3.so[7f2eefb4e000+366e000]
0

There are 0 answers