summaryrefslogtreecommitdiff
path: root/xml.c
AgeCommit message (Collapse)Author
2021-01-27typofixesHiltjo Posthuma
2021-01-22xml.c: fix typo / regression in checking codepoint range for utf-16 ↵Hiltjo Posthuma
surrogate pair Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a 0xdffff should be 0xdfff. printf '<item><title>&#x1f448;</title></item>' | sfeed Before (bad): &#x1f448; After: 👈
2021-01-22xml.c: do not convert UTF-16 surrogate pairs to an invalid sequenceHiltjo Posthuma
Simple way to reproduce: printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8 Result: iconv: (stdin):1:8: cannot convert Output result: printf '<item><title>&#xdc00;</title></item>' | sfeed Before: 00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............| 0000000c After: 00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.&#xdc00;.......| 00000010 0a |.| 00000011 The entity is output as a literal string. This allows to see more easily whats wrong and debug the feed and it is consistent with the current behaviour of invalid named entities (&bla;). An alternative could be a UTF-8 replacement symbol (codepoint 0xfffd). Reference: https://unicode.org/faq/utf_bom.html , specificly: "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? " "A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error. [AF]"
2020-10-18xml.c: initialize i = 0Hiltjo Posthuma
Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
2020-10-09xml: remove unused code for sfeedHiltjo Posthuma
2020-10-09xml.c: remove buffering of comment data, which is unused anywayHiltjo Posthuma
2020-06-01fix typoHiltjo Posthuma
2020-01-24cleanup some includesHiltjo Posthuma
2020-01-18improve XML entity conversionHiltjo Posthuma
- return -1 for invalid XML entities. - separate between NUL (&#0;) and invalid entities: although both are unwanted in sfeed. - validate the number range more strictly and don't wrap to unsigned. entities lik: "&#-1;" are handled as invalid now. "&#;" is also invalid instead of the same as "&#0;".
2019-11-22xml.c: upper-case named-entities are invalid in XMLHiltjo Posthuma
Named entities are case-sensitive and in XML lower-case. (In HTML some of these are valid. Although &APOS; is invalid there too). References: 4.6 Predefined entities: https://www.w3.org/TR/xml/#sec-predefined-ent In the definition of "match": https://www.w3.org/TR/xml/#dt-match "No case folding is performed."
2019-06-11xml: improve cdata and comment callback logicHiltjo Posthuma
it used to call both handlers twice at the end for "-->" (comment) or "]]>" (CDATA) with the data "" and length 0. Now it is only called when non-empty. The start and end handlers can still be used.
2019-03-16xml: write x->getnext to a default GETNEXT macroHiltjo Posthuma
this allows to override x->getnext to expand to global context parsing and allows the compiler to optimize this inline. also remove checking if the x->getnext function exists (just crash hard).
2019-01-08xml: remove unnecesary checksHiltjo Posthuma
- reduce amount of data to check. - remove unnecesary checks from (now) internal functions.
2018-12-02XML tag parse improvements for PI and end tagsHiltjo Posthuma
- Stricter parsing of tags, no whitespace stripping after <. - For end tags the "internal" context x->tag would be "/sometag". Make sure this matches exactly with the parameter tag. - Reset tagname after parsing an end tag. - Make end tag handling more consistent. - Remove temporary variable taglen.
2018-08-26xml: use ANSI types and struct initializationHiltjo Posthuma
long is atleast 32-bits, codepointtoutf8() works with >= 32-bit types. Valid codepoint ranges are not larger than this. unsigned char is not needed because converted unicode bytes don't use this range. tested all valid codepoints and output on amd64, i386 and SPARC64.
2018-08-23xml: remove TODO comments and add a noteHiltjo Posthuma
2018-08-22xml: improve parsing of invalid attribute values separated by whitespaceHiltjo Posthuma
It is invalid XML, but this allows parsing old HTML pages aswell. For example: <input id=cb checked type="checkbox" title='checkbox' /> or <FONT FACE=wingdings SIZE=12><BLINK>oh hai</BLINK></FONT>
2018-08-22xml: improve handling of invalid long data entitiesHiltjo Posthuma
this also fixes an issue with truncating and missing data on invalid input.
2018-08-21xml: rewrite codepointtoutf8 functionHiltjo Posthuma
No more converting to a uint32_t type. Just convert to a byte buffer. Tested on little- and big-endian. The code should be more clear too hopefully.
2018-08-21xml: don't reset internal tagname when parsing non-tag types like CDATAHiltjo Posthuma
... this affects "tags" starting with < such as CDATA and processing instructions.
2018-08-21xml: fix missing first byte when parsing a long incorrect attribute entityHiltjo Posthuma
... the entity had to be invalid (start with &) and longer than the buffer size. + tiny style fix.
2018-08-21xml: interface change: make some functions privateHiltjo Posthuma
... this does not expose the uint* types either.
2018-08-21xml: increase allowed size of attribute namesHiltjo Posthuma
2018-08-16XML parser: numeric entity: check unicode codepoint rangeHiltjo Posthuma
2018-03-11include <sys/types.h> for types size_t, ssize_t etcHiltjo Posthuma
This makes sure xml.c in particular can be compiled without further feature macros.
2018-03-11xml: improve comment parsingHiltjo Posthuma
note that ---> is officially invalid XML, but we allow it anyway.
2018-03-11xml: fix parsing of cdata when a handler is unsetHiltjo Posthuma
2018-03-11xml: improve CDATA parsingHiltjo Posthuma
thanks Svyatoslav Mishyn for the feedback!
2017-12-24xml: make name entities static, minor clarificationsHiltjo Posthuma
2016-04-10xml: stricter check of entity: must end with ';', ...Hiltjo Posthuma
... zero output buffer if codepoint length is 0
2015-08-22xml: fix includesHiltjo Posthuma
2015-08-22xml: simplify XML readerHiltjo Posthuma
2015-08-16xml: change xml_parse_string to xml_parse_bufHiltjo Posthuma
In the parser itself allow reading '\0' in the XML itself. Add a length parameter to specify the buffer size.
2015-08-14minor code-style improvementsHiltjo Posthuma
2015-08-14xml: whoops, remove leftover xml_getnext_stdinHiltjo Posthuma
2015-08-14xml: separate reader context from parserHiltjo Posthuma
also: - rename xmlparser_ prefix to xml_. - make xml_parse public, this allows a custom reader like a direct mmap, see: XMLParser.getnext and (optionall) XMLParser.getnext_data. - improve the README text.
2015-08-08xml: move entity to namedentitystr()Hiltjo Posthuma
2015-08-06xml: remove forced __inline__ attributeHiltjo Posthuma
2015-08-06general cleanupsHiltjo Posthuma
2015-08-01xml: only allow full uppercase or full lowercase for entitiesHiltjo Posthuma
2015-07-31xml: fix xml_namedentitytostr loopHiltjo Posthuma
2015-07-31xml: fix missing include strings.h, for strncasecmpHiltjo Posthuma
2015-07-29improve includes (dont include headers in .h), fix build on LinuxHiltjo Posthuma
2015-07-28improve code-style and consistencyHiltjo Posthuma
2015-06-23xml: fix comment issue, improve cdata and comment while encountering separatorHiltjo Posthuma
2015-06-22xml: fix cdata issueHiltjo Posthuma
2015-06-21separate xml specific code into xml.cHiltjo Posthuma
2015-06-21xml.c: fix empty cdata callbackHiltjo Posthuma
2015-05-16xml: only call data handler if setHiltjo Posthuma
2015-05-16xml: call parseHiltjo Posthuma