Is there a way to read the first page of a PDF document from a URL without saving it locally? I need to read a request for a PDF document on the website. In the following, you will find the code that I tried to execute. The code works well with some http URLs but not with others.
import urllib3
urllib3.disable_warnings()
with urllib3.PoolManager() as http:
r = http.request('GET', url)
with io.BytesIO(r.data) as f:
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
Here is the output when I run this code with the following url: "http://www.ain.gouv.fr/IMG/pdf/aprejetdae20210709enligne.pdf"
['', '', '', '', '˘ˇˆ', '˙˝', '˚', '!˛', '˛ ', '', 'ˆ˙ˆ#$%', '$', "#'˙", '( ', '', '', '', '˘ˇˆˇ˙', '˝˘ˇˆˇ˛˚', '˜', ' !"ˇ#ˆ"!$%!"ˇ#&', "ˇ'", '˜', '(', '!"ˇ#ˆˇ!$%!"ˇ#)&*', '˜', '((ˇˇ%!"!ˇ', '+,+-', '(./', '01(', '!,(2$˙', '""˚345', '6', '7((&(1(8', '1ˆ(1((˛.', '˜', '$!"!ˇ(1*(1', '1', ',1˝/9,', '/1(', '˜', '\'%!"!ˇ(1(1', '1,1˝/9,(6', '˜', ')%:(()', '˜', '+,+-(.$!"!ˇ()', '˜', '(!˙%!"!ˇ()5,5,((', '=( ...
Python version : Python 3.10.0
Short answer NO (not normally), longer answer MAYBE BUT in a controlled setting.
For your question there are three types of PDF, in common order Non Linearized, Linearized, Custom Streamed. and the custom streaming requires pay for libraries both ends so lets reject that.
When you download the start of a WEB linearized PDF you will see the first page quickly but cant interrogate that page easily unless you save the download as Zer0page.pdf
To enable any viewer to interrogate pages in order you need to download the full objects dictionary which is often at the end of the fully downloaded pdf.
Your example link is to the most common type so "Page 0" address is stored at end of file requiring full download. see here, and note the scrollbar position on the right this is the PDF seen by any editor such as pyton extractor etc. ALL the IMPORTANT DATA for reading and extraction is AT THE END of the Downloaded FILE (into memory or not). It is possible as you see for objects to be in any order, here 12 is before 10 and 45 (the root of the file) is after 11, Thus the first page (HERE in your example, highlighted as 1 0 obj) could be any number and easily be (sometimes is) the last object to download. Normally you don't see a first page until the full progress bar is at the end.