Age | Commit message (Collapse) | Author |
|
|
|
surrogate pair
Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
0xdffff should be 0xdfff.
printf '<item><title>👈</title></item>' | sfeed
Before (bad):
👈
After:
👈
|
|
Simple way to reproduce:
printf '<item><title>�</title></item>' | sfeed | iconv -t utf-8
Result:
iconv: (stdin):1:8: cannot convert
Output result:
printf '<item><title>�</title></item>' | sfeed
Before:
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............|
0000000c
After:
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.�.......|
00000010 0a |.|
00000011
The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).
Reference: https://unicode.org/faq/utf_bom.html , specificly:
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"
|
|
Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
|
|
|
|
|
|
|
|
|
|
- return -1 for invalid XML entities.
- separate between NUL (�) and invalid entities: although both are
unwanted in sfeed.
- validate the number range more strictly and don't wrap to unsigned.
entities lik: "&#-1;" are handled as invalid now. "&#;" is also invalid
instead of the same as "�".
|
|
Named entities are case-sensitive and in XML lower-case.
(In HTML some of these are valid. Although &APOS; is invalid there too).
References:
4.6 Predefined entities: https://www.w3.org/TR/xml/#sec-predefined-ent
In the definition of "match": https://www.w3.org/TR/xml/#dt-match
"No case folding is performed."
|
|
it used to call both handlers twice at the end for "-->" (comment) or "]]>"
(CDATA) with the data "" and length 0.
Now it is only called when non-empty. The start and end handlers can still be
used.
|
|
this allows to override x->getnext to expand to global context parsing and
allows the compiler to optimize this inline.
also remove checking if the x->getnext function exists (just crash hard).
|
|
- reduce amount of data to check.
- remove unnecesary checks from (now) internal functions.
|
|
- Stricter parsing of tags, no whitespace stripping after <.
- For end tags the "internal" context x->tag would be "/sometag". Make sure
this matches exactly with the parameter tag.
- Reset tagname after parsing an end tag.
- Make end tag handling more consistent.
- Remove temporary variable taglen.
|
|
long is atleast 32-bits, codepointtoutf8() works with >= 32-bit types. Valid
codepoint ranges are not larger than this. unsigned char is not needed because
converted unicode bytes don't use this range.
tested all valid codepoints and output on amd64, i386 and SPARC64.
|
|
|
|
It is invalid XML, but this allows parsing old HTML pages aswell.
For example:
<input id=cb checked type="checkbox" title='checkbox' />
or
<FONT FACE=wingdings SIZE=12><BLINK>oh hai</BLINK></FONT>
|
|
this also fixes an issue with truncating and missing data on invalid input.
|
|
No more converting to a uint32_t type. Just convert to a byte buffer.
Tested on little- and big-endian.
The code should be more clear too hopefully.
|
|
... this affects "tags" starting with < such as CDATA and processing
instructions.
|
|
... the entity had to be invalid (start with &) and longer than the buffer
size.
+ tiny style fix.
|
|
... this does not expose the uint* types either.
|
|
|
|
|
|
This makes sure xml.c in particular can be compiled without further
feature macros.
|
|
note that ---> is officially invalid XML, but we allow it anyway.
|
|
|
|
thanks Svyatoslav Mishyn for the feedback!
|
|
|
|
... zero output buffer if codepoint length is 0
|
|
|
|
|
|
In the parser itself allow reading '\0' in the XML itself. Add a length
parameter to specify the buffer size.
|
|
|
|
|
|
also:
- rename xmlparser_ prefix to xml_.
- make xml_parse public, this allows a custom reader like a direct mmap,
see: XMLParser.getnext and (optionall) XMLParser.getnext_data.
- improve the README text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|