PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?
How Adobe Acrobat does break words in PDF documents when copying text?
174 views Asked by ceztko At
1
There are 1 answers
Related Questions in PDF
- How to use custom font during html to pdf conversion?
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- PDF form checkbox/radio button ignores content stream
- Suggest python library for rendering html to pdf files
- Problems with the order in which PDF files are created
- Centering a map element on a generated PDF
- download all pdf files from website doesn't support wildcard
- How to enter external pdf into quarto book while keeping page layout+numbering
- How do I create a website that combines user input and standard text and converts it into a pdf?
- Excel VBA error 1004 on PDF export - not a path issue
- downloading pdf using requests not working
- Creating pdf on Firestore with Pdfplum: Template path "no such object"
- Export password protected PDF from QGIS
- XPS convert PDF with Ghostscript
- Download PDF in ASP.NET MVC application
Related Questions in TEXT
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- How to increase quality of mathjax output?
- How to appropriately handle newlines and the escaping of them?
- How to store data with lots of subdata but keep easy and simple access in python
- Can I make this kind of radio button?
- I am findind it dificult to create a box containing text
- Replacing Text using Javascript
- How to set text inside a div using JavaScript and CSS
- How to get new text input after entering a password in a tab?
- How can I get my hero section to look like this?
- Find text and numbers Formatted: "Case: BE########" and format them, regardless of the number
- Auto style text in flutter
- Text analytics and Insights
- Combine an audio and a text file as one single file
- How to align side text and table horizontally in R-markdown
Related Questions in WHITESPACE
- Wrote an Invisible whitespace in VScode, but showed in command line
- First whitespace doesn't trigger the onEditorChange function in react JS TinyMCE version 6.8.2
- How can I get rid of the large blank space at the bottom of my website pages?
- White Space appearing on the right side of my app when opening from mobile
- White space pushing website elements out of format
- white bar on-top of keyboard on focusing TextInput :React Native
- changing white space character in IAR embedded
- Text automatically double spaced when there are no spaces or breaks in the code
- In HTML4.01, are comments enclosed between “<!--” and “-->” or between “<!-- ” and “ -->”?
- I can't figure out how to fix the whitespace in my java code for a class project. I'm sure that the code should provide the intended output
- How Do You Skip Whitespace In ANTLR Parser Rules?
- Spacing in Java - 3.27.1: LAB: Number pattern
- Existing list of all punctuation/whitespace chars in C#
- Limit more than one linebreak, limit double space, RegEx replace
- Large white space on email in iphone
Related Questions in PDF-VIEWER
- android-pdf-viewer Received status code 401 from server: Unauthorized
- Apryse PDFTron SetDoc method throws AccessViolationException
- How to make "fit to width scrolling" default in pdf viewer by vb.net
- How to set the bounds in Syncfusion React PDF View using annotation
- Iframe is not compatible in mobile devices(especially Iphone)
- Unhandled Exception: RangeError (index): Invalid value: Valid value range is empty: 0 - For Google Drive PDF
- How to Hide or disable the webview top corner icon in android weview
- Facing error while installing @react-pdf-viewer/pdfjs-dist library
- How can I zoom out even further with pdfviewer flutter web/mobile?
- Unable to display PDF highlights in PySide6, while annotations are visible in web viewer
- Display PDF (stored in the backend) from React JS frontend
- Gitlab runner not opening application pdf file in chrome browser instead its get downloaded when running ci/cd pipeline
- Chrome Extension: Try to read PDF file opened in pop-out window and browser's in-built PDF reader
- Selenium Shadow-Root / PDF-Viewer accessing Download Button
- Mobile View mode in pdf viewer flutter
Related Questions in PDF-SPECIFICATION
- PDF dictionary object logic
- Manual creation of pdf files without using third party libs
- How Adobe Acrobat does break words in PDF documents when copying text?
- PDF formfield radiobutton syntax latex3 generation
- Detect PDF form field radio button (radiobutton) shape / style
- PDF Formfield font size: Default appearance vs. appearance stream
- Adding certification / docMDP signature using openPDF
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:
unscaledSpacingWidth = (average of non zero glyph widths obtained from
/Wor/Widthsarrays) /7Where
7is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of0.1PDF points.The found spacing width is subjected to scaling according to font size and other text state context.