Age | Commit message (Collapse) | Author |
|
Removed/rewritten the functions:
absuri, parseuri, and encodeuri() for percent-encoding.
The functions are now split separately with the following purpose:
- uri_format: format struct uri into a string.
- uri_hasscheme: quick check if a string is absolute or not.
- uri_makeabs: make a URI absolute using a base uri and the original URI.
- uri_parse: parse a string into a struct uri.
The following URLs are better parsed:
- URLs with extra "/"'s in the path prepended are kept as is, no "/" is added
either for empty paths.
- URLs like "http://codemadness.org" are not changed to
"http://codemadness.org/" anymore (paths are kept as is, unless they are
non-empty and not start with "/").
- Paths are not percent-encoded anymore.
- URLs with userinfo field (username, password) are parsed.
like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt
- Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn,
like: urn:isbn:0-395-36341-1 are allowed and parsed correctly.
- Both local (file:///) and non-local (file://) are supported.
- Specifying a base URL with a port will now only use it when the relative URL
has no host and port set and follows RFC3986 5.2.2 more closely.
- Parsing numeric port: parse as signed long and check <= 0, empty port is
allowed.
- Parsing URIs containing query, fragment, but no path separator (/) will now
parse the component properly.
For sfeed:
- Parse the baseURI only once (no need to do it every time for making absolute
URIs).
- If a link/enclosure is absolute already or if there is no base URL specified
then just print the link directly. There have also been other small performance
improvements related to handling URIs.
References:
- https://tools.ietf.org/html/rfc3986
- Section "5.2.2. Transform References" have also been helpful.
|
|
Combine E-Tags, If-Modified-Since in one section. Also mention the curl
--compression option for typically GZIP decompression.
Note that E-Tags were broken in curl <7.73 due to a bug with "weak" e-tags.
https://github.com/curl/curl/issues/5610
From a question/feedback by e-mail from Hadrien Lacour, thanks.
|
|
|
|
The commit that introduced the regression was:
commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:55:35 2020 +0200
simplify time parsing
Noticed on a RSS feed with the following date:
<pubDate>2021-02-03 05:13:03</pubDate>
This format is non-standard, but sfeed should support this.
A standard format would be (for Atom): 2021-02-03T05:13:03Z
Partially revert it.
|
|
Kindof a non-issue but if theres a sfeedrc with no feeds then xargs will still
be executed and give an error. The xargs -r option (GNU extension) fixes this:
From the OpenBSD xargs(1) man page:
"-r Do not run the command if there are no arguments. Normally the
command is executed at least once even if there are no arguments."
Reproducable with the sfeedrc:
feeds() {
true
}
|
|
|
|
|
|
This code uses the non-portable xargs -P option to more efficiently process
feeds in parallel.
|
|
This adds a main() function. When the environment variable
$SFEED_UPDATE_INCLUDE is set then it will not execute the main handler. The
other functions are included and can be reused. This is also useful for
unit-testing.
|
|
handler
This is useful to be able to reuse the code (together with using sfeed_update
as an included script, coming in the next commit).
|
|
basesiteurl
Move it closer before it is used.
|
|
"(FAIL CONVERT)" -> "(FAIL PARSE)". Convert may be too similar to text encoding
conversion.
|
|
This can be useful to make more cleanly make connector scripts.
This does not necesarily even have to be in the sfeed(5) format.
|
|
|
|
... and do not show stderr of readlink.
|
|
|
|
surrogate pair
Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
0xdffff should be 0xdfff.
printf '<item><title>👈</title></item>' | sfeed
Before (bad):
👈
After:
👈
|
|
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200.
After a content tag was parsed the "iscontenttag" variable was not reset.
This caused 2 regressions:
- It ignored other tags such as links after it.
- It incorrectly set the content-type of a lesser priority field.
Thanks to pazz0 for reporting it!
|
|
Interesting C compiler project:
lacc: A simple, self-hosting C compiler:
https://github.com/larmel/lacc
|
|
Simple way to reproduce:
printf '<item><title>�</title></item>' | sfeed | iconv -t utf-8
Result:
iconv: (stdin):1:8: cannot convert
Output result:
printf '<item><title>�</title></item>' | sfeed
Before:
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............|
0000000c
After:
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.�.......|
00000010 0a |.|
00000011
The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).
Reference: https://unicode.org/faq/utf_bom.html , specificly:
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"
|
|
|
|
- Improve feed creation with empty results and new feed files.
Always make sure the file is created even when it is new and there are also no
items (after filtering).
- Consistency: always use the same feed file for merging.
Do not use "/dev/null" when it is a new file. This works using sort, but is
ugly when the merge() function is overridden and does something else. It should
be the feed file always.
|
|
This adds the name as the first parameter for the convertencoding() function,
like filter, merge, order, etc.
This can be useful to make an exception rule for text decoding in a more clean
way.
|
|
|
|
OPML is a more generic format, this tool is specificly for "rss" types and
subscription lists.
|
|
- Export read/unread state to a separate plain-text "urls" file, line by line.
- Handle white-space control-chars better.
From the sfeed(1) man page:
" The fields: title, id, author are not allowed to have newlines and TABs,
all whitespace characters are replaced by a single space character.
Control characters are removed."
So do the reverse for newsboat aswell: change white-space characters which are
also control-characters (such as TABs and newlines) to a single space
character.
|
|
Make a huge difference (cuts the time in half to process the same amount of
lines) on atleast glibc 2.30 on Void Linux. Seems to make no difference on
OpenBSD.
- This removes atleast one heap allocation per line (checked with valgrind).
This is because glibc will strdup() the environment variable $TZ and free it
each time, which is pointless here and wasteful.
- localtime_r does not require to set the variables like tzname.
In glibc-2.30/time/tzset.c in __tz_convert is the following code and comment:
/* Update internal database according to current TZ setting.
POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
This is a good idea since this allows at least a bit more parallelism. */
tzset_internal (tp == &_tmbuf && use_localtime);
This makes it always tzset() and inspect the environment $TZ etc. While with
localtime_r it will only initialize it once:
static void tzset_internal (int always) {
[...]
if (is_initialized && !always)
return;
|
|
range >= 127
For example: "\xef\xbf\xb7" (codepoint 0xfff7), returns wcwidth(wc) == -1.
The next byte was incorrected seeked, but the codepoint itself was valid
(mbtowc).
|
|
|
|
|
|
characters
This affects sfeed_plain.
- Use unicode replacement character (codepoint 0xfffd) when a codepoint is
invalid and proceed printing the rest of the characters.
- When a codepoint is invalid reset the internal state of mbtowc(3), from the
OpenBSD man page:
" If a call to mbtowc() resulted in an undefined internal state, mbtowc()
must be called with s set to NULL to reset the internal state before it
can safely be used again."
- Optimize for the common ASCII case and use a macro to print the character
instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this
improves printing performance from 1.7s to 1.0s on my laptop. On an other
system it improved by +- 25%. Tested with clang and gcc and also tested the
worst-case (non-ASCII) with no penalty.
To test:
printf '0\tabc\xc3 def' | sfeed_plain
Before:
1970-01-01 01:00 abc
After:
1970-01-01 01:00 abc� def
|
|
Same reason as the previous commit (allow to expand to macros).
|
|
Use putc instead of fputc, it can be optimized to macros.
From the OpenBSD man page:
" putc() acts essentially identically to fputc(), but is a macro that
expands in-line. It may evaluate stream more than once, so arguments
given to putc() should not be expressions with potential side effects."
sfeed_atom, sfeed_frames and sfeed_html are using this function.
Mini-benchmarked sfeed_html and it went from 1.45s to 1.0s with feed files in
total 250k lines (+- 350MB). Tested with clang and gcc on OpenBSD on an older
laptop.
|
|
|
|
|
|
sfeed_gopher must be able to write in the current directory, but does not need
write permissions outside it. It could read from any place in the filesystem
(to read feed files).
Prompted by a suggestion from vejetaryenvampir, thanks!
|
|
... move sections around in a more logical order and tweak some words.
Prompted by a question and feedback from Aleksei, thanks!
|
|
|
|
|
|
Feeds should contain absolute urls, but if it does not have it then this makes
it more convenient to configure such feeds.
|
|
|
|
sfeed_xmlenc is used automatically in sfeed_update for detecting the encoding.
In particular do not allow slashes anymore either. For example "//IGNORE" and
"//TRANSLIT" which are normally allowed.
Some iconv implementation might allow other funky names or even pathnames too,
so disallow that.
See also the notes about the "frommap" for the "-f" option.
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html
+ some minor parsing handling improvements.
|
|
This happens because the previous link type is not reset when a <link> tag
starts again, but it is reset when a type attribute starts.
Found on the spanish newspaper site: elpais.com
Input:
<link rel="alternate" href="https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada" type="application/rss+xml" title="RSS de la portada de El País"/>
<link rel="canonical" href="https://elpais.com"/>
Would print (second line is incorrect).
https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml
https://elpais.com/ application/rss+xml
Now prints:
https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml
Fix: reset it also at the start of a <link> tag in this case (for <base href />
it is still not wanted).
|
|
|
|
(ascii.jp)
|
|
Fix attribute parsing and now decode entities. The following now works (from
helsinkitimes.fi):
<base href="https://www.helsinkitimes.fi/" />
<link href="/?format=feed&type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" />
<link href="/?format=feed&type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" />
Properly associate attributes with the actual tag, this now parses properly
(from ascii.jp).
<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
<link rel="alternate" type="application/rss+xml" />
|
|
Fixes a regression introduced in the refactor in commit
e43b7a48b08a6bbcb4e730e80395b3257681b33e
Now copy the data by value. This structure is small and no performance
regression has been seen.
This was because the tag ID was modified which made subsequent parsed tags of
this type behave strangely:
ctx.tag->id = RSSTagGuidPermalinkTrue;
Input data to reproduce:
<rss>
<channel>
<item>
<guid isPermaLink="false">https://def/</guid>
</item>
<item>
<guid>https://abc/</guid>
</item>
</channel>
</rss>
|
|
https://support.google.com/analytics/answer/1033867?hl=nl
|
|
Noticed strange output on the site ascii.jp:
The site HTML contained:
<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
<link rel="alternate" type="application/rss+xml" />
This would print:
"/img/apple-touch-icon.png application/rss+xml"
Now it prints:
" application/rss+xml"
|
|
Use the same base filename as the feed file, because sfeed_update replaces '/'
in names with '_':
filename="$(printf '%s' "$1" | tr '/' '_')"
This fixes the example for fetching feeds with names containing '/'.
Reported by __20h__, thanks!
|