diff options
author | Hiltjo Posthuma <hiltjo@codemadness.org> | 2021-01-22 01:11:19 +0100 |
---|---|---|
committer | Hiltjo Posthuma <hiltjo@codemadness.org> | 2021-01-22 01:11:19 +0100 |
commit | 12b279581fbbcde2b36eb4b78d70a1c52d4a209a (patch) | |
tree | e16c1ba9e78cf945c3406451af2ba9c68b0e092a | |
parent | 57d341d9826ff742b5f69cab8228d0d06c3997a3 (diff) |
xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence
Simple way to reproduce:
printf '<item><title>�</title></item>' | sfeed | iconv -t utf-8
Result:
iconv: (stdin):1:8: cannot convert
Output result:
printf '<item><title>�</title></item>' | sfeed
Before:
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............|
0000000c
After:
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.�.......|
00000010 0a |.|
00000011
The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).
Reference: https://unicode.org/faq/utf_bom.html , specificly:
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"
-rw-r--r-- | xml.c | 3 |
1 files changed, 2 insertions, 1 deletions
@@ -252,7 +252,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz) else l = strtol(e, &end, 10); /* invalid value or not a well-formed entity or invalid code point */ - if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff) + if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff || + (l >= 0xd800 && l <= 0xdffff)) return -1; len = codepointtoutf8(l, buf); buf[len] = '\0'; |