I'm using multer in nodejs to handle file uploads. When a PDF file is uploaded I want to split it into chunks and store those chunks into a vector store (using langchain.js) for a RAG application.
import { WebPDFLoader } from 'langchain/document_loaders/web/pdf';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
// file is provided by multer
const data = file.buffer
const mimetype = file.mimetype
const blob = new Blob([data]);
const loader = new WebPDFLoader(blob, {
splitPages: false,
});
const docs = await loader.load();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
When fetching a PDF from a URL instead of from the multer buffer, this method works as expected:
const url = "https://dagrs.berkeley.edu/sites/default/files/2020-01/sample.pdf"
const response = await fetch(url);
const data = await response.blob();
console.log(data)
const loader = new WebPDFLoader(data, {
splitPages: false,
});
When I console.log(data) in the above code, I get:
Blob { size: 54836, type: 'application/pdf' }
When creating the blob from multer do I need to include more data in the blob than just the buffer from multer.file? How would I do that?
You can modify your code to create the Blob object with the correct MIME type: