tesseract: Specifying (German) language reduces accuracy - what am I missing?

66 views Asked by simfinite At 20 December 2023 at 14:54

I am implementing a script to convert PGS (i.e. image-based) subtitles to SRT (i.e. text-based) subtitles and I use tesseract for OCR (btw, what a great tool!).

After extracting the subtitle phrases as images and applying some pre-processing, I get decent results. For German subtitles, I have to specify the language (-l deu) to have umlauts properly detected. Interestingly, I get some obviously wrong results which are detected correctly if I don't specify the language to be English or none at all:

For example, this image with a German phrase (eng: ... let's go.)

gives the following results:

without language specified: ... lass uns gehen.
with -l eng: ... lass uns gehen
with -l deu: ...|48s uns gehen.

Any ideas why this happens and how to improve the results? Could it be that the German model does not cope well with italic fonts?

I have played around with user word files, but could not observe any impact. Using tesseract version 4.1.1 on Ubuntu 20.04.

Original Q&A

TechQA.

tesseract: Specifying (German) language reduces accuracy - what am I missing?

There are 0 answers

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in SUBTITLE

Related Questions in VIDEO-SUBTITLES

Popular Questions

Trending Questions