summaryrefslogtreecommitdiff
path: root/sfeed.c
AgeCommit message (Collapse)Author
2022-02-05sfeed.c: code-style consistency: static functionsHiltjo Posthuma
2022-02-05sfeed: small optimizationHiltjo Posthuma
For feeds with lots of content data: Small performance improvement (~2%) on systems that implement putchar as a macro. On some systems using a function call for putchar it can be easier to replace with putchar_unlocked. (On an older MIPS32 VM changing putchar to putchar_unlocked makes writing 5x faster).
2022-02-04improve some code commentsHiltjo Posthuma
2022-02-01parsetime: no need to check `tp`. it must be setHiltjo Posthuma
2022-01-19sfeed: extend the time range, use long long instead of time_tHiltjo Posthuma
This allows to parse the time as a number in the 64-bit range, even on 32-bit platforms. Note that the sfeed formatting tools can still truncate/wrap the value to time_t, which can be 32-bit.
2022-01-19sfeed: parsetime: allow leap second like 23:59:60Hiltjo Posthuma
Specified in RFC2822 Section 3.3. Date and Time Specification "[...] the time-of-day MUST be in the range 00:00:00 through 23:59:60 (the number of seconds allowing for a leap second; see [STD12]) [...]" To test: <entry><updated>2016-12-31T23:59:60Z</updated></entry>
2021-11-23sfeed.1: improve a comment for string_appendHiltjo Posthuma
2021-11-23code-style: define fieldmap in the same order as the enum declarationHiltjo Posthuma
2021-07-11sfeed.c: parsetime: support short digit years for RSS pubDate fields (RFC822)Hiltjo Posthuma
RSS (pubDate) uses RFC822 dates. This standard is obsoleted by RFC2822. The RSS 2.0 spec says for the pubDate field: "[...] All date-times in RSS conform to the Date and Time Specification of RFC 822, with the exception that the year may be expressed with two characters or four characters (four preferred)." RFC822 section 5.1 describes the syntax with 2 digit years: https://datatracker.ietf.org/doc/html/rfc822#section-5.1 It was obsoleted/fixed in RFC2822 section 4.3: https://datatracker.ietf.org/doc/html/rfc2822#section-4.3 " Where a two or three digit year occurs in a date, the year is to be interpreted as follows: If a two digit year is encountered whose value is between 00 and 49, the year is interpreted by adding 2000, ending up with a value between 2000 and 2049. If a two digit year is encountered with a value between 50 and 99, or any three digit year is encountered, the year is interpreted by adding 1900." In the real world I've seen all sites using RSS use the 4-digit format. For historic context of changes and what feeds it might affect: - RFC822 was published in 13 august 1982, obsoleted by RFC2822. - RFC2822 was published in april 2001, obsoleted by RFC5322. - RFC5322 was published in october 2008. - RDF was started around 1996. It was published around 2004. - March 15, 1999: RSS 0.90 (Netscape), published by Netscape and authored by Ramanathan Guha. - July 10, 1999: RSS 0.91 (Netscape), published by Netscape and authored by Dan Libby. - June 9, 2000: RSS 0.91 (UserLand), published by UserLand Software and authored by Dave Winer. - Dec. 25, 2000: RSS 0.92, UserLand. - Aug. 19, 2002: RSS 2.0, UserLand. - July 15, 2003: RSS 2.0 (version 2.0.1), published by the Berkman Center for Internet & Society at Harvard Law School and authored by Dave Winer. - July 15, 2003: RSS 2.0 (version 2.0.1-rv-1), published by the RSS Advisory Board. - July 17, 2003: RSS 2.0 (version 2.0.1-rv-2), RSS Advisory Board. - April 6, 2004: RSS 2.0 (version 2.0.1-rv-3), RSS Advisory Board. - May 31, 2004: RSS 2.0 (version 2.0.1-rv-4), RSS Advisory Board. - June 19, 2004: RSS 2.0 (version 2.0.1-rv-5), RSS Advisory Board. - January 25, 2005: RSS 2.0 (version 2.0.1-rv-6), RSS Advisory Board. - Aug. 12, 2006: RSS 2.0 (version 2.0.8), RSS Advisory Board. - June 5, 2007: RSS 2.0 (version 2.0.9), RSS Advisory Board. - Oct. 15, 2007: RSS 2.0 (version 2.0.10), RSS Advisory Board. - March 30, 2009 (current): RSS 2.0 (version 2.0.11), RSS Advisory Board. RSS history source: https://www.rssboard.org/rss-history
2021-07-06sfeed: change comment which reflects printing relative URLs behaviourHiltjo Posthuma
This URL printing behaviour was changed recently in commit f305b032bc19b4e81c0dd6c0398370028ea910ca
2021-07-06sfeed: printtrimmed function does not change or modify the bufferHiltjo Posthuma
Make it const char *.
2021-06-01portability and standards: add BSD-like err() and errx() functionsHiltjo Posthuma
These are BSD functions. - HaikuOS now compiles without having to use libbsd. - Tested on SerenityOS (for fun), which doesn't have these functions (yet). With a small change to support wcwidth() sfeed works on SerenityOS.
2021-04-28fixup: a regression with RSS guid, by default ispermalink="true"Hiltjo Posthuma
2021-04-28use the last href attribute value if there are multiple setHiltjo Posthuma
Input to reproduce: <entry> <link href="https://codemadness.org/a" href="https://codemadness.org/b"/> </entry> Old value: "https://codemadness.org/ahttps://codemadness.org/b" New value: "https://codemadness.org/b" same with RSS <enclosure url="" />
2021-04-28add support for old/legacy Atom 0.3 feedsHiltjo Posthuma
This standard was a draft used around 2005-2006. Instead of the fields "published" and "updated" it used "issued" (mandatory field) and "modified" (optional). Add support for them and also in preference of supporting Atom 1.0 and creation dates first. I don't know any real-life examples that still use this though. Some references: - http://rakaz.nl/2005/07/moving-from-atom-03-to-10.html - https://www.dokuwiki.org/syndication (rss_type "atom" parameter value). - https://support.google.com/merchants/answer/160598?hl=en
2021-04-28improve "ispermalink", "rel" and "type" attribute handling/bufferingHiltjo Posthuma
2021-04-28improve content-type "type" attribute handling/bufferingHiltjo Posthuma
2021-04-27sfeed.c: detect the proper mime-type for XHTMLHiltjo Posthuma
Reference: https://www.w3.org/2003/01/xhtml-mimetype/
2021-04-24fix a comment code-styleHiltjo Posthuma
This fix is very important *ahem*.
2021-03-01util: improve/refactor URI parsing and formattingHiltjo Posthuma
Removed/rewritten the functions: absuri, parseuri, and encodeuri() for percent-encoding. The functions are now split separately with the following purpose: - uri_format: format struct uri into a string. - uri_hasscheme: quick check if a string is absolute or not. - uri_makeabs: make a URI absolute using a base uri and the original URI. - uri_parse: parse a string into a struct uri. The following URLs are better parsed: - URLs with extra "/"'s in the path prepended are kept as is, no "/" is added either for empty paths. - URLs like "http://codemadness.org" are not changed to "http://codemadness.org/" anymore (paths are kept as is, unless they are non-empty and not start with "/"). - Paths are not percent-encoded anymore. - URLs with userinfo field (username, password) are parsed. like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt - Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn, like: urn:isbn:0-395-36341-1 are allowed and parsed correctly. - Both local (file:///) and non-local (file://) are supported. - Specifying a base URL with a port will now only use it when the relative URL has no host and port set and follows RFC3986 5.2.2 more closely. - Parsing numeric port: parse as signed long and check <= 0, empty port is allowed. - Parsing URIs containing query, fragment, but no path separator (/) will now parse the component properly. For sfeed: - Parse the baseURI only once (no need to do it every time for making absolute URIs). - If a link/enclosure is absolute already or if there is no base URL specified then just print the link directly. There have also been other small performance improvements related to handling URIs. References: - https://tools.ietf.org/html/rfc3986 - Section "5.2.2. Transform References" have also been helpful.
2021-02-04sfeed.c: fix time parsing regression with non-standard date formatHiltjo Posthuma
The commit that introduced the regression was: commit 33c50db302957bca2a850ac8d0b960d05ee0520e Author: Hiltjo Posthuma <hiltjo@codemadness.org> Date: Mon Oct 12 18:55:35 2020 +0200 simplify time parsing Noticed on a RSS feed with the following date: <pubDate>2021-02-03 05:13:03</pubDate> This format is non-standard, but sfeed should support this. A standard format would be (for Atom): 2021-02-03T05:13:03Z Partially revert it.
2021-01-22sfeed: fix regression with parsing content fieldsHiltjo Posthuma
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200. After a content tag was parsed the "iscontenttag" variable was not reset. This caused 2 regressions: - It ignored other tags such as links after it. - It incorrectly set the content-type of a lesser priority field. Thanks to pazz0 for reporting it!
2020-10-22Do not change the referenced matched tag data (from gettag()).Hiltjo Posthuma
Fixes a regression introduced in the refactor in commit e43b7a48b08a6bbcb4e730e80395b3257681b33e Now copy the data by value. This structure is small and no performance regression has been seen. This was because the tag ID was modified which made subsequent parsed tags of this type behave strangely: ctx.tag->id = RSSTagGuidPermalinkTrue; Input data to reproduce: <rss> <channel> <item> <guid isPermaLink="false">https://def/</guid> </item> <item> <guid>https://abc/</guid> </item> </channel> </rss>
2020-10-12add a comment about the intended date priorityHiltjo Posthuma
2020-10-12Revert "RSS: give Dublin Core <dc:date> higher priority over <pubDate>"Hiltjo Posthuma
This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
2020-10-12simplify time parsingHiltjo Posthuma
2020-10-12remove unneeded check for NUL terminatorHiltjo Posthuma
2020-10-12RSS: give Dublin Core <dc:date> higher priority over <pubDate>Hiltjo Posthuma
This way dc:date could be the updated time of the item. For Atom there is <published> and <updated> with the same logic.
2020-10-12parse categories, add multiple field values support (for categories)Hiltjo Posthuma
Fields with multiple values are separated by '|'. In the future multiple enclosure support might be added. The categories tags are now parsed. This feature is useful for filtering and categorizing. Parsing of nested tags such as <author><name> has been improved. This code has been refactored. RSS <guid> isPermaLink is now handled differently also and will now prefer a permalink with "true" (link) over the ID. In practise multiple <guid> in an item does not happen.
2020-10-09sfeed: parse day with max 2 digits (instead of 4)Hiltjo Posthuma
2020-10-09sfeed: support the ISO8601 time format without separatorsHiltjo Posthuma
For example "19720229T132245Z" is now supported.
2020-10-09XML cdata callback: handle CDATA as dataHiltjo Posthuma
This improves handling CDATA for example in Atom feeds with: <author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>
2020-05-28sfeed: simplify/optimize checking end tags while inside a RSS/Atom tagHiltjo Posthuma
Instead of a binary search do set a pointer to the assigned expected end tag. This makes more sense and is also a minor optimization. No behavioural change intended.
2020-01-24cleanup some includesHiltjo Posthuma
2020-01-18minor style: use plain int for xml_entitytostr()Hiltjo Posthuma
2019-10-12string_append: check for addition and multiplication overflowHiltjo Posthuma
This could overflow / wrap the buffer. Note: SIZE_MAX is defined in POSIX to atleast 65535. On most platforms on 64-bit this is 0xffffffffffffffffUL bytes.
2019-09-05sfeed.c: fix typo in commentHiltjo Posthuma
2019-06-17sfeed: optimization: xmlattr: when not in some RSS/Atom tag skip further checksHiltjo Posthuma
2019-06-11fix typo in commentHiltjo Posthuma
2019-06-11optimization: only convert entities when we are inside a RSS/Atom tagHiltjo Posthuma
2019-06-11reorder functionHiltjo Posthuma
2019-06-11Handle entities in attribute values.Julian Schweinsberg
2019-05-25gettzoffset: fix possible arithmetic overflow if int is 16-bitHiltjo Posthuma
also reduce size of return type (32-bit+ should be enough).
2019-05-10remove unused variablesHiltjo Posthuma
2019-05-10sfeed: remove support for military zones and simplifyHiltjo Posthuma
see RFC2822 4.3 page 32: " [...] However, because of the error in [RFC822], they SHOULD all be considered equivalent to "-0000" unless there is out-of-band information confirming their meaning. "
2019-05-02sfeed: improve content type (attribute) handlingHiltjo Posthuma
- handle type attribute for MRSS media:description, media:description type="plain" is now parsed properly. - handle default content-types per tag now. - when multiple content-like fields are specified use the proper content-type. - be flexible about type attribute handling. - minor code tweaks.
2019-04-14sfeed: add support for the first enclosure of an itemHiltjo Posthuma
This is useful for example for podcasts (audio attachment), newsposts (usually some image) or comic strips (link to page, image as enclosure). thanks leot for the feedback!
2019-04-06optimization: define GETNEXT as an inline macroHiltjo Posthuma
This reduces much function call overhead. getnext is defined in xml.h for inline optimization. sfeed only uses one XML parser context per program, this allows further optimizations of the compiler also. On OpenBSD it was noticable because of retpoline etc function call overhead. Using clang and a 500MB test XML file reduces processing time from +- 12s to 5s. Tested using some crazy optimization flags: SFEED_CFLAGS = -O3 -std=c99 -DGETNEXT=getchar_unlocked -fno-ret-protector \ -mno-retpoline -static A GETNEXT macro is also nice for programs which mmap(2) some big XML file. Then you can simply define: #define GETNEXT() (off >= len ? EOF : reg[off++])
2019-04-06sfeed: gettag: simplify and use ANSI bsearch()Hiltjo Posthuma
2019-03-03gettzoffset: bit more strict UTC offset parsingHiltjo Posthuma