I scraped some text found in a Taiwanese website. I got rid of the HTML and only kept what I needed as txt files. The content of the txt file displays correctly in Firefox/Chrome. With Python3, if I do f = open(text_file).read() I get this error:
'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte
ETA: I use ubuntu, so I'm happy for any solution in Python or in the terminal!
And if I do f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5') and then read() I get this message:
'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence
I only need the Chinese characters, how can I only keep those encoded as Big5? This would get rid of the error,yes?
The builtin
open()function haserrorsparameter:It is possible that your file uses some other character encoding or even (if the code that saves the text is buggy) a mixture of several character encodings.
You can see what encoding is used by your browser e.g., in Chrome: "More tools -> Encoding".