I'm writing a PDF to text solution using OCR in Golang.
The libraries I employed are Gosseract and Go-Fitz
The program works until I'm trying to load an image from memory with Gosseract:
func ProcessDoc(file []byte) (string, error) {
var text strings.Builder
client := gosseract.NewClient()
doc, err := fitz.NewFromMemory(file)
if err != nil {
log.Println(err)
return "", nil
}
for n := 0; n < doc.NumPage(); n++ {
img, err := doc.Image(n)
if err != nil {
log.Println(err)
return "", err
}
buf := new(bytes.Buffer)
err = jpeg.Encode(buf, img, nil)
if err != nil {
log.Println(err)
return "", err
}
client.SetImageFromBytes(buf.Bytes())
res, err := client.Text()
if err != nil {
return "", err
}
text.WriteString(res)
}
return text.String(), nil
}
Then I get this error:
JPEG parameter struct mismatch: library thinks size is 624, caller expects 656
Error in pixReadStreamJpeg: internal jpeg error
Error in pixReadMemJpeg: pix not read
Error in pixReadMem: jpeg: no pix returned
After a lot of searching, I learned there was the possibility of libleptonica or mupdf using different versions of jpeglib.h. But there's only one instance of that file in the whole system.
I should also note that I compiled libjpeg from source and then libmupdf and libleptonica to use that version of libjpeg to avoid any form of conflicts but it still returns the Struct Mismatch error.
Are you compiling mupdf from source?
By default mupdf includes it's own version of libjpeg - it is possible that mupdf compiled against it's own version and libleptonica against the system version.