tesseract: Specifying (German) language reduces accuracy - what am I missing?

66 views Asked by At

I am implementing a script to convert PGS (i.e. image-based) subtitles to SRT (i.e. text-based) subtitles and I use tesseract for OCR (btw, what a great tool!).

After extracting the subtitle phrases as images and applying some pre-processing, I get decent results. For German subtitles, I have to specify the language (-l deu) to have umlauts properly detected. Interestingly, I get some obviously wrong results which are detected correctly if I don't specify the language to be English or none at all:

For example, this image with a German phrase (eng: ... let's go.)

enter image description here

gives the following results:

  • without language specified: ... lass uns gehen.
  • with -l eng: ... lass uns gehen
  • with -l deu: ...|48s uns gehen.

Any ideas why this happens and how to improve the results? Could it be that the German model does not cope well with italic fonts?

I have played around with user word files, but could not observe any impact. Using tesseract version 4.1.1 on Ubuntu 20.04.

0

There are 0 answers