xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - sfeed.git

diff options

author	Hiltjo Posthuma <hiltjo@codemadness.org>	2021-01-22 01:11:19 +0100
committer	Hiltjo Posthuma <hiltjo@codemadness.org>	2021-01-22 01:11:19 +0100
commit	12b279581fbbcde2b36eb4b78d70a1c52d4a209a (patch)
tree	e16c1ba9e78cf945c3406451af2ba9c68b0e092a /util.c
parent	57d341d9826ff742b5f69cab8228d0d06c3997a3 (diff)

xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence

Simple way to reproduce: printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8 Result: iconv: (stdin):1:8: cannot convert Output result: printf '<item><title>&#xdc00;</title></item>' | sfeed Before: 00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............| 0000000c After: 00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.&#xdc00;.......| 00000010 0a |.| 00000011 The entity is output as a literal string. This allows to see more easily whats wrong and debug the feed and it is consistent with the current behaviour of invalid named entities (&bla;). An alternative could be a UTF-8 replacement symbol (codepoint 0xfffd). Reference: https://unicode.org/faq/utf_bom.html , specificly: "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? " "A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error. [AF]"

Diffstat (limited to 'util.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: