I have a strange problem with converting special characters from HTML. I have a Django project where text is stored HTML-encoded in a MySQL database. This is necessary, because I don't want to lose any formatting of the text.
In a preliminary step I must do operational things on the text like calculating positions, so I need to convert it first and clear it from all HTML-Tags. This is done by BeautifulSoup:
convertedText = str(BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES))
convertedText = ''.join(BeautifulSoup(convertedText).findAll(text=True))
By working on my Django-default test-server everything works fine, but when I run it on my production server there are strange behaviors when converting special characters.
An example:
Test server
MySQL-Query gives me: <p>bassverstärker</p>
is correctly converted to: bassverstärker
Production server
MySQL-Query gives me: <p>bassverstärker</p>
This is is wrongly converted to: bassverst\ucc44rker
Somehow the ä is converted into \ucc44 and this results in a wrong character.
My configuration:
Test server:
- Django build-in solution (
python manage.py runserver) - BeautifulSoup 3.2.1
- Python 2.6.5
- Ubuntu 2.6.32-43-generic
Production server:
- Cherokee 1.2.101
- BeautifulSoup 3.2.1
- python 2.7.3
- Ubuntu 3.2.0-32-generic
Because I don't know at which level the error occurs, I would like to ask if anybody can help me with this. Many thanks in advance.
I found a way to fix this. I didn't know that BeautifulSoup has the builtin method
getText(). When converting HTML through:eveything works fine on both servers. Although this works, it would be interesting to know why both servers are behaving differently when working with the example in the question.
However, thanks to all.