I am using the following to get raw html code for a web page and then pass it to html2text to then get only the text content, omitting the html tags:
wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text --ignore-links
It works just fine, consistently, a sample of the expected output:
...
vscroll_wide
scrollbar_back_color
scrollbar_back_dark_color
scrollbar_button_color
scrollbar_button_dark_color
scrollbar_button_icon_color
scrollbar_button_icon_dark_color
scrollbar_button_hot_color
scrollbar_button_hot_dark_color
scrollbar_button_hot_icon_color
scrollbar_button_hot_icon_dark_color
scrollbar_button_down_color
scrollbar_button_down_dark_color
scrollbar_button_down_icon_color
scrollbar_button_down_icon_dark_color
...
but as soon as I attempt to save the data to a variable or even pipe to select-object for further processing I get an error:
$text = wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text
#wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text --ignore-links |Select-Object -f 5
The error I am getting is:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Scripts\html2text.exe\__main__.py", line 7, in <module>
File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Lib\site-packages\html2text\cli.py", line 330, in main
sys.stdout.write(h.handle(html))
File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u21b3' in position 202140: character maps to <undefined>
I searched around for this error but all of the solutions I am coming across are to do with navigating around this error inside a python script, I am in need of a solution of getting around this issue when piping/stdout
Why is PowerShell failing when redirecting the output?
PowerShell 7.4 on win 11
The reason is that
python(which underlieshtml2text) - like many other Windows CLIs (console applications) - modifies its output behavior based on whether the output target is a console (terminal) or is redirected:In the former case, such CLIs use the Unicode version of the WinAPI
WriteConsolefunction, meaning that any character from the global Unicode alphabet is accepted.In the latter case, CLIs must encode their output, and are expected to respect the legacy Windows OEM code page associated with the current console window, as reflected in the output from
chchp.comand - by default - in[Console]::OutputEncodinginside a PowerShell session:E.g., the OEM code page is
437on US-English systems, and if the text to output contains characters that cannot be represented in that code page - which (for non-CJK locales) is a single-byte encoding limited to 256 characters in total.1252on US-English systems) rather than the OEM code page (both of which are determined by the system's active legacy system locale, aka language for non-Unicode programs). However, like the OEM code page (in non-CJK locales), ANSI code pages too are limited to 256 characters, and trying to encode a character outside that set results in the error you saw.To avoid this limitation, modern CLIs increasingly encode their output using UTF-8 instead, either by default (e.g., Node.js), or on an opt-in basis (e.g., Python).
In the context of PowerShell, an external program's (stdout) output is considered redirected (not targeting the console/terminal) in one of the following cases:
capturing external-program output in a variable (
$text = wget ..., as in your case), or using it as part an of expression (e.g.,"foo" + (wget ...))relaying external-program output via the pipeline (e.g.,
wget ... | ...)in Windows PowerShell and PowerShell (Core) 7 up to v7.3.x: also with
>, the redirection operator; in v7.4+, using>directly on an external-program call now passes the raw bytes through to the target file.That is, in all those cases decoding the external program-output comes into play, into .NET strings, based on the encoding stored in
[Console]::OutputEncoding.In the case at hand, this stage wasn't even reached, because Python itself wasn't able to encode its output.
The solution in your case is therefore two-pronged, as suggested by zett42:
Make sure that
html2textoutputs UTF-8-encoded text.html2textis a Python-based script/executable, so (temporarily) set$env:PYTHONUTF8=1before invoking it.Make sure that PowerShell interprets the output as UTF-8:
[Console]::OutputEncodingto[System.Text.UTF8Encoding]::new()To put it all together:
Note:
When you pipe data from PowerShell TO an external program (not the case here), PowerShell uses the
$OutputEncodingpreference variable to encode it, in which case you may have to (temporarily) change$OutputEncodingtoo; it defaults to ASCII(!) in Windows PowerShell, and to (BOM-less) UTF-8 in PowerShell (Core) 7 - which is problematic in both cases, as it doesn't match the default value of[Console]::OutputEncoding.For instance, to both send data as UTF-8 and to decode it as such, you can (temporarily) set:
It is possible to configure a given system to use UTF-8 system-wide by default, which would make things just work without extra effort in this case (though in Windows PowerShell you may situationally still have to set
$OutputEncoding); however, this configuration, which sets the system locale in a way that sets both the OEM and the ANSI code page to65001(UTF-8), has far-reaching consequences that may break existing scripts - see this answer.