Age | Commit message (Collapse) | Author |
|
Reference:
https://www.w3.org/2003/01/xhtml-mimetype/
|
|
This fix is very important *ahem*.
|
|
Removed/rewritten the functions:
absuri, parseuri, and encodeuri() for percent-encoding.
The functions are now split separately with the following purpose:
- uri_format: format struct uri into a string.
- uri_hasscheme: quick check if a string is absolute or not.
- uri_makeabs: make a URI absolute using a base uri and the original URI.
- uri_parse: parse a string into a struct uri.
The following URLs are better parsed:
- URLs with extra "/"'s in the path prepended are kept as is, no "/" is added
either for empty paths.
- URLs like "http://codemadness.org" are not changed to
"http://codemadness.org/" anymore (paths are kept as is, unless they are
non-empty and not start with "/").
- Paths are not percent-encoded anymore.
- URLs with userinfo field (username, password) are parsed.
like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt
- Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn,
like: urn:isbn:0-395-36341-1 are allowed and parsed correctly.
- Both local (file:///) and non-local (file://) are supported.
- Specifying a base URL with a port will now only use it when the relative URL
has no host and port set and follows RFC3986 5.2.2 more closely.
- Parsing numeric port: parse as signed long and check <= 0, empty port is
allowed.
- Parsing URIs containing query, fragment, but no path separator (/) will now
parse the component properly.
For sfeed:
- Parse the baseURI only once (no need to do it every time for making absolute
URIs).
- If a link/enclosure is absolute already or if there is no base URL specified
then just print the link directly. There have also been other small performance
improvements related to handling URIs.
References:
- https://tools.ietf.org/html/rfc3986
- Section "5.2.2. Transform References" have also been helpful.
|
|
The commit that introduced the regression was:
commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:55:35 2020 +0200
simplify time parsing
Noticed on a RSS feed with the following date:
<pubDate>2021-02-03 05:13:03</pubDate>
This format is non-standard, but sfeed should support this.
A standard format would be (for Atom): 2021-02-03T05:13:03Z
Partially revert it.
|
|
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200.
After a content tag was parsed the "iscontenttag" variable was not reset.
This caused 2 regressions:
- It ignored other tags such as links after it.
- It incorrectly set the content-type of a lesser priority field.
Thanks to pazz0 for reporting it!
|
|
Fixes a regression introduced in the refactor in commit
e43b7a48b08a6bbcb4e730e80395b3257681b33e
Now copy the data by value. This structure is small and no performance
regression has been seen.
This was because the tag ID was modified which made subsequent parsed tags of
this type behave strangely:
ctx.tag->id = RSSTagGuidPermalinkTrue;
Input data to reproduce:
<rss>
<channel>
<item>
<guid isPermaLink="false">https://def/</guid>
</item>
<item>
<guid>https://abc/</guid>
</item>
</channel>
</rss>
|
|
|
|
This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
|
|
|
|
|
|
This way dc:date could be the updated time of the item. For Atom there is
<published> and <updated> with the same logic.
|
|
Fields with multiple values are separated by '|'. In the future multiple
enclosure support might be added.
The categories tags are now parsed. This feature is useful for filtering and
categorizing.
Parsing of nested tags such as <author><name> has been improved. This code has
been refactored.
RSS <guid> isPermaLink is now handled differently also and will now prefer a
permalink with "true" (link) over the ID. In practise multiple <guid> in an
item does not happen.
|
|
|
|
For example "19720229T132245Z" is now supported.
|
|
This improves handling CDATA for example in Atom feeds with:
<author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>
|
|
Instead of a binary search do set a pointer to the assigned expected end tag.
This makes more sense and is also a minor optimization.
No behavioural change intended.
|
|
|
|
|
|
This could overflow / wrap the buffer.
Note: SIZE_MAX is defined in POSIX to atleast 65535.
On most platforms on 64-bit this is 0xffffffffffffffffUL bytes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
also reduce size of return type (32-bit+ should be enough).
|
|
|
|
see RFC2822 4.3 page 32:
"
[...]
However, because of the error in
[RFC822], they SHOULD all be considered equivalent to "-0000" unless
there is out-of-band information confirming their meaning.
"
|
|
- handle type attribute for MRSS media:description,
media:description type="plain" is now parsed properly.
- handle default content-types per tag now.
- when multiple content-like fields are specified use the proper content-type.
- be flexible about type attribute handling.
- minor code tweaks.
|
|
This is useful for example for podcasts (audio attachment), newsposts (usually
some image) or comic strips (link to page, image as enclosure).
thanks leot for the feedback!
|
|
This reduces much function call overhead. getnext is defined in xml.h for
inline optimization. sfeed only uses one XML parser context per program, this
allows further optimizations of the compiler also.
On OpenBSD it was noticable because of retpoline etc function call overhead.
Using clang and a 500MB test XML file reduces processing time from +- 12s to
5s.
Tested using some crazy optimization flags:
SFEED_CFLAGS = -O3 -std=c99 -DGETNEXT=getchar_unlocked -fno-ret-protector \
-mno-retpoline -static
A GETNEXT macro is also nice for programs which mmap(2) some big XML file. Then
you can simply define:
#define GETNEXT() (off >= len ? EOF : reg[off++])
|
|
|
|
|
|
|
|
this style change is useful for my local coverage profile.
|
|
|
|
In RSS2 (but not RSS0.9), a <link> is optional and it can also be specified by
<guid isPermaLink="true"> (permalink is "true" by default).
When a <link> is also present this will be used instead of the GUID permalink.
|
|
|
|
the Atom link parsing is more strict now and checks the rel attribute. When the
rel attribute is empty it is handled as a normal link ("alternate").
This makes sure when an link with an other type is specified (such as
"enclosure", "related", "self" or "via") before a link it is not used.
sfeed does not handle enclosures, but the code is reworked so it is very simple
to add this. Enclosures are often used for example to attach some image to a
newspost or an audio file to a podcast.
|
|
|
|
Noticed in the webcomic "amphibian":
http://amphibian.com/feeds/atom
|
|
... and abstract printing timetamp and uri to string_print_{timestamp,uri}
similar to string_print_trimmed (normal string) and string_print_encoded
(content).
Noticed with whitespace around the field in the webcomic "amphibian":
http://amphibian.com/feeds/atom
|
|
|
|
|
|
|
|
- reorder and remove a goto.
- no need for a separate variable "end".
- don't use s[0] style because the pointer was changed.
|
|
noticed in "RMS notes" RSS.
|
|
- cast all ctype(3) function argument to (unsigned char) to avoid UB
POSIX says:
"The c argument is an int, the value of which the application shall ensure is a
character representable as an unsigned char or equal to the value of the macro
EOF. If the argument has any other value, the behavior is undefined."
Many libc cast implicitly the value, but NetBSD does not, which is probably the
correct thing to interpret it.
- no need to cast for putchar + rename some fputc(..., stdout) to putchar
POSIX says:
"The fputc() function shall write the byte specified by c (converted to an
unsigned char) to the output stream pointed to by stream [...]"
Major thanks to Leonardo Taccari <iamleot@gmail.com> for reporting and testing
it on NetBSD!
|
|
the uint* types in XML are not exposed anymore.
|
|
This makes sure xml.c in particular can be compiled without further
feature macros.
|