Some pages are skipped when using pdf-parse to extract text from a PDF

83 views Asked by At

I'm currently using pdf-parse, a Node.js library to extract text from a PDF file. However, I've encountered an issue where certain pages are being skipped during the extraction process. I've checked the PDF file, and it doesn't seem to be encrypted or corrupted. Also, when I use my mac PDF viewer, it shows the missing page to be searchable and not scanned.

Here's the code I'm using:

const pdfParse = require('pdf-parse');
const fs = require('fs');

// Read the PDF file
const pdfPath = 'path/to/pdf/file.pdf';
const pdfBuffer = fs.readFileSync(pdfPath);

// Parse the PDF
pdfParse(pdfBuffer).then((data) => {
  console.log(data.text);
}).catch((error) => {
  console.error('An error occurred:', error);
});

Despite running the above code, certain pages from the PDF are skipped during text extraction. I'm wondering what could be causing this issue and how I can ensure that all pages are properly parsed by pdf-parse. Any insights or suggestions on resolving this problem would be greatly appreciated. Thank you!

I tried to extract the text from all the pages from the PDF with the above code but I noticed some pages are skipped. I expect all the pages in the PDF to be extracted and I should get back the corresponding texts.

0

There are 0 answers