Age | Commit message (Collapse) | Author |
|
A numeric entity could be 5 bytes, so use a round number of 8 bytes.
No other change intended and no performance difference noticed.
|
|
|
|
https://datatracker.ietf.org/doc/html/rfc2822#section-4.3
"Where a two or three digit year occurs in a date, the year is to be
interpreted as follows: If a two digit year is encountered whose
value is between 00 and 49, the year is interpreted by adding 2000,
ending up with a value between 2000 and 2049. If a two digit year is
encountered with a value between 50 and 99, or any three digit year
is encountered, the year is interpreted by adding 1900."
Improvement on commit 7086670e4335714e1df982bf1058082b7400b653
For example (output from TZ=UTC sfeed_plain):
input: Sun, 26 Jul 049 19:26:34
was: 2049-07-26 19:26
now: 1949-07-26 19:26 (because this is a 3-digit year)
input: Sun, 26 Jul 1 19:26:34
was: 2001-07-26 19:26
now: 0001-07-26 19:26 (because this is a 1-digit year and doesn't match the short year rule)
input: Sun, 26 Jul 001 19:26:34
was: 2001-07-26 19:26
now: 1901-07-26 19:26 (because this is a 3 digit year)
These cases are all added to the tests in the sfeed_tests repo (times.xml file).
|
|
This also fixes a calculation (possibly a compiler bug) with Open Watcom 1.9.
|
|
Found while testing sfeed on MS-DOS with Open Watcom (for fun :)).
There an int is 16-bit and sfeed incorrectly wrapped the value, which produced
incorrect parsed UNIX timestamps as output.
|
|
|
|
|
|
|
|
|
|
This is not clearly defined by the C99 standard.
Define ctype-like macros to force it to be ASCII / UTF-8 (not extended ASCII or
something like noticed on OpenBSD 3.8).
(In practise modern libc libraries are all ASCII and UTF-8-compatible. Otherwise
this would break many programs)
|
|
It is unspecified if the C locale iscntrl is compatible with ASCII or not.
Noticed when testing on OpenBSD 3.8 which uses extended ASCII and also uses the
C1 range for control-characters. This breaks support with UTF-8.
Reference:
https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_control_codes_for_general_use
C1 table.
Force an own definition of an ASCII-compatible control-character range since
sfeed expects input to be UTF-8 (or converted from iconv) and so output to be
UTF-8 aswell.
|
|
|
|
This also makes the programs exit with a non-zero status when a read or write
error occurs.
This makes checking the exit status more reliable in scripts.
A simple example to simulate a disk with no space left:
curl -s 'https://codemadness.org/atom.xml' | sfeed > f
/mnt/test: write failed, file system is full
echo $?
0
Which now produces:
curl -s 'https://codemadness.org/atom.xml' | sfeed > f
/mnt/test: write failed, file system is full
write error: <stdout>
echo $?
1
Tested with a small mfs on OpenBSD, fstab entry:
swap /mnt/test mfs rw,nodev,nosuid,-s=1M 0 0
|
|
|
|
For feeds with lots of content data:
Small performance improvement (~2%) on systems that implement putchar as a
macro. On some systems using a function call for putchar it can be easier to
replace with putchar_unlocked.
(On an older MIPS32 VM changing putchar to putchar_unlocked makes writing 5x
faster).
|
|
|
|
|
|
This allows to parse the time as a number in the 64-bit range, even on 32-bit
platforms. Note that the sfeed formatting tools can still truncate/wrap the
value to time_t, which can be 32-bit.
|
|
Specified in RFC2822 Section 3.3. Date and Time Specification
"[...] the time-of-day MUST be in the range 00:00:00 through 23:59:60 (the
number of seconds allowing for a leap second; see [STD12]) [...]"
To test:
<entry><updated>2016-12-31T23:59:60Z</updated></entry>
|
|
|
|
|
|
RSS (pubDate) uses RFC822 dates. This standard is obsoleted by RFC2822.
The RSS 2.0 spec says for the pubDate field:
"[...] All date-times in RSS conform to the Date and Time Specification of RFC
822, with the exception that the year may be expressed with two characters or
four characters (four preferred)."
RFC822 section 5.1 describes the syntax with 2 digit years:
https://datatracker.ietf.org/doc/html/rfc822#section-5.1
It was obsoleted/fixed in RFC2822 section 4.3:
https://datatracker.ietf.org/doc/html/rfc2822#section-4.3
" Where a two or three digit year occurs in a date, the year is to be
interpreted as follows: If a two digit year is encountered whose
value is between 00 and 49, the year is interpreted by adding 2000,
ending up with a value between 2000 and 2049. If a two digit year is
encountered with a value between 50 and 99, or any three digit year
is encountered, the year is interpreted by adding 1900."
In the real world I've seen all sites using RSS use the 4-digit format.
For historic context of changes and what feeds it might affect:
- RFC822 was published in 13 august 1982, obsoleted by RFC2822.
- RFC2822 was published in april 2001, obsoleted by RFC5322.
- RFC5322 was published in october 2008.
- RDF was started around 1996. It was published around 2004.
- March 15, 1999: RSS 0.90 (Netscape), published by Netscape and authored by
Ramanathan Guha.
- July 10, 1999: RSS 0.91 (Netscape), published by Netscape and authored by Dan
Libby.
- June 9, 2000: RSS 0.91 (UserLand), published by UserLand Software and
authored by Dave Winer.
- Dec. 25, 2000: RSS 0.92, UserLand.
- Aug. 19, 2002: RSS 2.0, UserLand.
- July 15, 2003: RSS 2.0 (version 2.0.1), published by the Berkman Center for
Internet & Society at Harvard Law School and authored by Dave Winer.
- July 15, 2003: RSS 2.0 (version 2.0.1-rv-1), published by the RSS Advisory
Board.
- July 17, 2003: RSS 2.0 (version 2.0.1-rv-2), RSS Advisory Board.
- April 6, 2004: RSS 2.0 (version 2.0.1-rv-3), RSS Advisory Board.
- May 31, 2004: RSS 2.0 (version 2.0.1-rv-4), RSS Advisory Board.
- June 19, 2004: RSS 2.0 (version 2.0.1-rv-5), RSS Advisory Board.
- January 25, 2005: RSS 2.0 (version 2.0.1-rv-6), RSS Advisory Board.
- Aug. 12, 2006: RSS 2.0 (version 2.0.8), RSS Advisory Board.
- June 5, 2007: RSS 2.0 (version 2.0.9), RSS Advisory Board.
- Oct. 15, 2007: RSS 2.0 (version 2.0.10), RSS Advisory Board.
- March 30, 2009 (current): RSS 2.0 (version 2.0.11), RSS Advisory Board.
RSS history source: https://www.rssboard.org/rss-history
|
|
This URL printing behaviour was changed recently in commit
f305b032bc19b4e81c0dd6c0398370028ea910ca
|
|
Make it const char *.
|
|
These are BSD functions.
- HaikuOS now compiles without having to use libbsd.
- Tested on SerenityOS (for fun), which doesn't have these functions (yet).
With a small change to support wcwidth() sfeed works on SerenityOS.
|
|
|
|
Input to reproduce:
<entry>
<link href="https://codemadness.org/a" href="https://codemadness.org/b"/>
</entry>
Old value:
"https://codemadness.org/ahttps://codemadness.org/b"
New value:
"https://codemadness.org/b"
same with RSS <enclosure url="" />
|
|
This standard was a draft used around 2005-2006.
Instead of the fields "published" and "updated" it used "issued" (mandatory
field) and "modified" (optional). Add support for them and also in preference
of supporting Atom 1.0 and creation dates first.
I don't know any real-life examples that still use this though.
Some references:
- http://rakaz.nl/2005/07/moving-from-atom-03-to-10.html
- https://www.dokuwiki.org/syndication (rss_type "atom" parameter value).
- https://support.google.com/merchants/answer/160598?hl=en
|
|
|
|
|
|
Reference:
https://www.w3.org/2003/01/xhtml-mimetype/
|
|
This fix is very important *ahem*.
|
|
Removed/rewritten the functions:
absuri, parseuri, and encodeuri() for percent-encoding.
The functions are now split separately with the following purpose:
- uri_format: format struct uri into a string.
- uri_hasscheme: quick check if a string is absolute or not.
- uri_makeabs: make a URI absolute using a base uri and the original URI.
- uri_parse: parse a string into a struct uri.
The following URLs are better parsed:
- URLs with extra "/"'s in the path prepended are kept as is, no "/" is added
either for empty paths.
- URLs like "http://codemadness.org" are not changed to
"http://codemadness.org/" anymore (paths are kept as is, unless they are
non-empty and not start with "/").
- Paths are not percent-encoded anymore.
- URLs with userinfo field (username, password) are parsed.
like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt
- Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn,
like: urn:isbn:0-395-36341-1 are allowed and parsed correctly.
- Both local (file:///) and non-local (file://) are supported.
- Specifying a base URL with a port will now only use it when the relative URL
has no host and port set and follows RFC3986 5.2.2 more closely.
- Parsing numeric port: parse as signed long and check <= 0, empty port is
allowed.
- Parsing URIs containing query, fragment, but no path separator (/) will now
parse the component properly.
For sfeed:
- Parse the baseURI only once (no need to do it every time for making absolute
URIs).
- If a link/enclosure is absolute already or if there is no base URL specified
then just print the link directly. There have also been other small performance
improvements related to handling URIs.
References:
- https://tools.ietf.org/html/rfc3986
- Section "5.2.2. Transform References" have also been helpful.
|
|
The commit that introduced the regression was:
commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:55:35 2020 +0200
simplify time parsing
Noticed on a RSS feed with the following date:
<pubDate>2021-02-03 05:13:03</pubDate>
This format is non-standard, but sfeed should support this.
A standard format would be (for Atom): 2021-02-03T05:13:03Z
Partially revert it.
|
|
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200.
After a content tag was parsed the "iscontenttag" variable was not reset.
This caused 2 regressions:
- It ignored other tags such as links after it.
- It incorrectly set the content-type of a lesser priority field.
Thanks to pazz0 for reporting it!
|
|
Fixes a regression introduced in the refactor in commit
e43b7a48b08a6bbcb4e730e80395b3257681b33e
Now copy the data by value. This structure is small and no performance
regression has been seen.
This was because the tag ID was modified which made subsequent parsed tags of
this type behave strangely:
ctx.tag->id = RSSTagGuidPermalinkTrue;
Input data to reproduce:
<rss>
<channel>
<item>
<guid isPermaLink="false">https://def/</guid>
</item>
<item>
<guid>https://abc/</guid>
</item>
</channel>
</rss>
|
|
|
|
This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
|
|
|
|
|
|
This way dc:date could be the updated time of the item. For Atom there is
<published> and <updated> with the same logic.
|
|
Fields with multiple values are separated by '|'. In the future multiple
enclosure support might be added.
The categories tags are now parsed. This feature is useful for filtering and
categorizing.
Parsing of nested tags such as <author><name> has been improved. This code has
been refactored.
RSS <guid> isPermaLink is now handled differently also and will now prefer a
permalink with "true" (link) over the ID. In practise multiple <guid> in an
item does not happen.
|
|
|
|
For example "19720229T132245Z" is now supported.
|
|
This improves handling CDATA for example in Atom feeds with:
<author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>
|
|
Instead of a binary search do set a pointer to the assigned expected end tag.
This makes more sense and is also a minor optimization.
No behavioural change intended.
|
|
|
|
|
|
This could overflow / wrap the buffer.
Note: SIZE_MAX is defined in POSIX to atleast 65535.
On most platforms on 64-bit this is 0xffffffffffffffffUL bytes.
|
|
|