How can I remove non-printable invisible characters from string?
Ruby version: 2.4.1
2.4.1 :209 > product.name.gsub(/[^[:print:]]/,'.')
=> "Kanha"
2.4.1 :210 > product.name.gsub(/[^[:print:]]/,'.').length
=> 6
2.4.1 :212 > product.name.gsub(/[\u0080-\u00ff]/, '').length
=> 6
2.4.1 :214 > product.name.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join.length
=> 6
2.4.1 :216 > product.name.gsub(/[^[:print:]]/i, '').length
=> 6
The word "Kanha" has 5 letters. However there is a 6th character that is not printable. How can I remove it?
By googling and SOing I have already tried few approaches, but as you can see none of those are helpful.
It is causing problems when I try to integrate out data with other systems.
First, let's figure out what the offending character is:
The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:
That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (
8236.to_s(16) # => "202c"), so we just have to google for U+202C.Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:
It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the
\p{Cf}property in a Ruby regular expression. You can also use\P{Print}(note the capitalP) as an equivalent to[^[:print]]:See it on repl.it: https://repl.it/@jrunning/DutifulRashTag