Summary
I'm trying to extract text from PDF using PDFMiner, cut it into chunks, and then embed it with a model from Huggingface. The problem is that the list returned by RecursiveCharacterTextSplitter() is not json serializable by requests.post()
The code fails when querying the Huggingface model and it returns an error message:
TypeError: Object of type Document is not JSON serializable
Note: Full error at the end of this question
I don't know how to convert my data that I receive from my RecursiveCharacterTextSplitter() to JSON serializable object.
Source code:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import pandas as pd
# Extract text from pdf file
output_string = StringIO()
with open('info.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
# Split text into chunks via text_splitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 100,
chunk_overlap = 20,
length_function = len,
is_separator_regex = False,
)
texts = text_splitter.create_documents([output_string.getvalue()])
# model_id: Embedding Model we use on Huggingface
# hf_token: Huggingface token so that you can authorize against huggingface models
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "hf..."
# Build request header
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
# Issue query for the model to embed our different text chunks
def query(texts):
response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
return response.json()
output = query(texts)
The error code at runtime is the following:
Traceback (most recent call last):
File "troubleshoot.py", line 59, in <module>
output = query(texts)
File "troubleshoot.py", line 56, in query
response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/sessions.py", line 575, in request
prep = self.prepare_request(req)
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/sessions.py", line 486, in prepare_request
p.prepare(
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/models.py", line 371, in prepare
self.prepare_body(data, files, json)
File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/models.py", line 511, in prepare_body
body = complexjson.dumps(json, allow_nan=False)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 234, in dumps
return cls(
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Document is not JSON serializable
The list I'm trying to JSON serialize looks the following:
output of print(type(texts)): <class 'list'>
output of print(texts): [Document(page_content='There is a land in the middle of the Pacific Ocean, it’s called AmazingLand.', metadata={}), Document(page_content='The population of it is about 1.4 million and it’s 72% inhabited by Amazings. The other 28%', metadata={}), Document(page_content='consists of hungarians, germans and mongoloids.', metadata={}), Document(page_content='The country is a monarchy, and it is ruled by the Big Amazing King. The Big Amazing King is', metadata={}), Document(page_content='someone who can rap classical music, and it is the best doing it among the Amazing population', metadata={}), Document(page_content='of the AmazingLand.', metadata={})]
Question
How do I convert the data that I receive from my RecursiveCharacterTextSplitter() to JSON serializable object?