sfeed.git - Suckless rss Feed reader with my configs

Age	Commit message (Collapse)	Author
2021-01-22	xml.c: fix typo / regression in checking codepoint range for utf-16 ↵	Hiltjo Posthuma
	surrogate pair Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a 0xdffff should be 0xdfff. printf '<item><title>👈</title></item>' \| sfeed Before (bad): 👈 After: 👈
2021-01-22	sfeed: fix regression with parsing content fields	Hiltjo Posthuma
	This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200. After a content tag was parsed the "iscontenttag" variable was not reset. This caused 2 regressions: - It ignored other tags such as links after it. - It incorrectly set the content-type of a lesser priority field. Thanks to pazz0 for reporting it!
2021-01-22	README: tested with lacc	Hiltjo Posthuma
	Interesting C compiler project: lacc: A simple, self-hosting C compiler: https://github.com/larmel/lacc
2021-01-22	xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence	Hiltjo Posthuma
	Simple way to reproduce: printf '<item><title>&#xdc00;</title></item>' \| sfeed \| iconv -t utf-8 Result: iconv: (stdin):1:8: cannot convert Output result: printf '<item><title>&#xdc00;</title></item>' \| sfeed Before: 00000000 09 ed b0 80 09 09 09 09 09 09 09 0a \|............\| 0000000c After: 00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 \|.&#xdc00;.......\| 00000010 0a \|.\| 00000011 The entity is output as a literal string. This allows to see more easily whats wrong and debug the feed and it is consistent with the current behaviour of invalid named entities (&bla;). An alternative could be a UTF-8 replacement symbol (codepoint 0xfffd). Reference: https://unicode.org/faq/utf_bom.html , specificly: "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? " "A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error. [AF]"
2021-01-16	sfeed_update: typo in comment	Hiltjo Posthuma

2021-01-16	sfeed_update: improve consistency of feed creation and merging	Hiltjo Posthuma
	- Improve feed creation with empty results and new feed files. Always make sure the file is created even when it is new and there are also no items (after filtering). - Consistency: always use the same feed file for merging. Do not use "/dev/null" when it is a new file. This works using sort, but is ugly when the merge() function is overridden and does something else. It should be the feed file always.
2021-01-16	sfeed_update: make convertencoding() consistent with other overridable functions	Hiltjo Posthuma
	This adds the name as the first parameter for the convertencoding() function, like filter, merge, order, etc. This can be useful to make an exception rule for text decoding in a more clean way.
2021-01-16	sfeed_opml_import: minor code-style improvements	Hiltjo Posthuma

2021-01-16	sfeed_opml_import.1: clarify it handles OPML _subscription_ lists specificly	Hiltjo Posthuma
	OPML is a more generic format, this tool is specificly for "rss" types and subscription lists.
2021-01-16	README: newsboat sqlite3 export script: improvements	Hiltjo Posthuma
	- Export read/unread state to a separate plain-text "urls" file, line by line. - Handle white-space control-chars better. From the sfeed(1) man page: " The fields: title, id, author are not allowed to have newlines and TABs, all whitespace characters are replaced by a single space character. Control characters are removed." So do the reverse for newsboat aswell: change white-space characters which are also control-characters (such as TABs and newlines) to a single space character.
2021-01-10	optimize converting UNIX timestamp to localtime	Hiltjo Posthuma
	Make a huge difference (cuts the time in half to process the same amount of lines) on atleast glibc 2.30 on Void Linux. Seems to make no difference on OpenBSD. - This removes atleast one heap allocation per line (checked with valgrind). This is because glibc will strdup() the environment variable $TZ and free it each time, which is pointless here and wasteful. - localtime_r does not require to set the variables like tzname. In glibc-2.30/time/tzset.c in __tz_convert is the following code and comment: /* Update internal database according to current TZ setting. POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname. This is a good idea since this allows at least a bit more parallelism. */ tzset_internal (tp == &_tmbuf && use_localtime); This makes it always tzset() and inspect the environment $TZ etc. While with localtime_r it will only initialize it once: static void tzset_internal (int always) { [...] if (is_initialized && !always) return;
2021-01-09	printutf8pad: fix byte-seek issue with negative width codepoints in the ↵	Hiltjo Posthuma
	range >= 127 For example: "\xef\xbf\xb7" (codepoint 0xfff7), returns wcwidth(wc) == -1. The next byte was incorrected seeked, but the codepoint itself was valid (mbtowc).
2021-01-09	printutf8pad: small code-style/clarify changes	Hiltjo Posthuma

2021-01-08	sfeed_atom: code-style: use conditional with pledge, like the other tools	Hiltjo Posthuma

2021-01-08	util.c: printutf8pad(): improve padded printing and printing invalid unicode ↵	Hiltjo Posthuma
	characters This affects sfeed_plain. - Use unicode replacement character (codepoint 0xfffd) when a codepoint is invalid and proceed printing the rest of the characters. - When a codepoint is invalid reset the internal state of mbtowc(3), from the OpenBSD man page: " If a call to mbtowc() resulted in an undefined internal state, mbtowc() must be called with s set to NULL to reset the internal state before it can safely be used again." - Optimize for the common ASCII case and use a macro to print the character instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this improves printing performance from 1.7s to 1.0s on my laptop. On an other system it improved by +- 25%. Tested with clang and gcc and also tested the worst-case (non-ASCII) with no penalty. To test: printf '0\tabc\xc3 def' \| sfeed_plain Before: 1970-01-01 01:00 abc After: 1970-01-01 01:00 abc� def
2021-01-08	sfeed_gopher: optimize common output character function	Hiltjo Posthuma
	Same reason as the previous commit (allow to expand to macros).
2021-01-08	xmlencode: optimize common character output function	Hiltjo Posthuma
	Use putc instead of fputc, it can be optimized to macros. From the OpenBSD man page: " putc() acts essentially identically to fputc(), but is a macro that expands in-line. It may evaluate stream more than once, so arguments given to putc() should not be expressions with potential side effects." sfeed_atom, sfeed_frames and sfeed_html are using this function. Mini-benchmarked sfeed_html and it went from 1.45s to 1.0s with feed files in total 250k lines (+- 350MB). Tested with clang and gcc on OpenBSD on an older laptop.
2021-01-03	man pages: add more real world examples to the man pages	Hiltjo Posthuma

2021-01-02	sfeed.1/sfeed_plain.1: add example, improve quoting the url for sfeed_web.1	Hiltjo Posthuma

2021-01-01	sfeed_gopher: tighten filesystem permissions on OpenBSD using unveil(2)	Hiltjo Posthuma
	sfeed_gopher must be able to write in the current directory, but does not need write permissions outside it. It could read from any place in the filesystem (to read feed files). Prompted by a suggestion from vejetaryenvampir, thanks!
2021-01-01	README: add text about page redirects + tweak some words.	Hiltjo Posthuma
	... move sections around in a more logical order and tweak some words. Prompted by a question and feedback from Aleksei, thanks!
2021-01-01	README: tested on MIPS32 (big-endian)	Hiltjo Posthuma

2021-01-01	LICENSE: bump year	Hiltjo Posthuma

2021-01-01	sfeed_update: if baseurl is empty then use the path from the feed by default	Hiltjo Posthuma
	Feeds should contain absolute urls, but if it does not have it then this makes it more convenient to configure such feeds.
2020-11-09	bump version to 0.9.20	Hiltjo Posthuma

2020-11-01	sfeed_xmlenc: be more paranoid in printing encoding names	Hiltjo Posthuma
	sfeed_xmlenc is used automatically in sfeed_update for detecting the encoding. In particular do not allow slashes anymore either. For example "//IGNORE" and "//TRANSLIT" which are normally allowed. Some iconv implementation might allow other funky names or even pathnames too, so disallow that. See also the notes about the "frommap" for the "-f" option. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html + some minor parsing handling improvements.
2020-10-31	sfeed_web: improve parsing a <link> if it has no type attribute	Hiltjo Posthuma
	This happens because the previous link type is not reset when a <link> tag starts again, but it is reset when a type attribute starts. Found on the spanish newspaper site: elpais.com Input: <link rel="alternate" href="https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada" type="application/rss+xml" title="RSS de la portada de El País"/> <link rel="canonical" href="https://elpais.com"/> Would print (second line is incorrect). https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml https://elpais.com/ application/rss+xml Now prints: https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml Fix: reset it also at the start of a <link> tag in this case (for <base href /> it is still not wanted).
2020-10-24	bump version to 0.9.19	Hiltjo Posthuma

2020-10-22	sfeed_web: whoops, fix bug mentioned in the previous commit	Hiltjo Posthuma
	(ascii.jp)
2020-10-22	sfeed_web: attribute parsing improvements, improve man page	Hiltjo Posthuma
	Fix attribute parsing and now decode entities. The following now works (from helsinkitimes.fi): <base href="https://www.helsinkitimes.fi/" /> <link href="/?format=feed&type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" /> <link href="/?format=feed&type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" /> Properly associate attributes with the actual tag, this now parses properly (from ascii.jp). <link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" /> <link rel="alternate" type="application/rss+xml" />
2020-10-22	Do not change the referenced matched tag data (from gettag()).	Hiltjo Posthuma
	Fixes a regression introduced in the refactor in commit e43b7a48b08a6bbcb4e730e80395b3257681b33e Now copy the data by value. This structure is small and no performance regression has been seen. This was because the tag ID was modified which made subsequent parsed tags of this type behave strangely: ctx.tag->id = RSSTagGuidPermalinkTrue; Input data to reproduce: <rss> <channel> <item> <guid isPermaLink="false">https://def/</guid> </item> <item> <guid>https://abc/</guid> </item> </channel> </rss>
2020-10-21	README: filter example, filter Google Analytics utm_* parameters	Hiltjo Posthuma
	https://support.google.com/analytics/answer/1033867?hl=nl
2020-10-21	sfeed_web: reset feedlink buffer	Hiltjo Posthuma
	Noticed strange output on the site ascii.jp: The site HTML contained: <link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" /> <link rel="alternate" type="application/rss+xml" /> This would print: "/img/apple-touch-icon.png application/rss+xml" Now it prints: " application/rss+xml"
2020-10-18	README: improve etag example with escaping of the filename	Hiltjo Posthuma
	Use the same base filename as the feed file, because sfeed_update replaces '/' in names with '_': filename="$(printf '%s' "$1" \| tr '/' '_')" This fixes the example for fetching feeds with names containing '/'. Reported by __20h__, thanks!
2020-10-18	README: add example to support ETag caching	Hiltjo Posthuma

2020-10-18	xml.c: initialize i = 0	Hiltjo Posthuma
	Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
2020-10-16	README.xml: reference examples, ANSI compatible, mention original parser	Hiltjo Posthuma

2020-10-16	README: fix unescaped character in regex in awk in filter example	Hiltjo Posthuma
	Found by testing using mawk.
2020-10-12	add a comment about the intended date priority	Hiltjo Posthuma

2020-10-12	Revert "RSS: give Dublin Core <dc:date> higher priority over <pubDate>"	Hiltjo Posthuma
	This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
2020-10-12	README: filter example: strip Facebook fbclid parameter	Hiltjo Posthuma

2020-10-12	simplify time parsing	Hiltjo Posthuma

2020-10-12	remove unneeded check for NUL terminator	Hiltjo Posthuma

2020-10-12	RSS: give Dublin Core <dc:date> higher priority over <pubDate>	Hiltjo Posthuma
	This way dc:date could be the updated time of the item. For Atom there is <published> and <updated> with the same logic.
2020-10-12	parse categories, add multiple field values support (for categories)	Hiltjo Posthuma
	Fields with multiple values are separated by '\|'. In the future multiple enclosure support might be added. The categories tags are now parsed. This feature is useful for filtering and categorizing. Parsing of nested tags such as <author><name> has been improved. This code has been refactored. RSS <guid> isPermaLink is now handled differently also and will now prefer a permalink with "true" (link) over the ID. In practise multiple <guid> in an item does not happen.
2020-10-09	xml: remove unused code for sfeed	Hiltjo Posthuma

2020-10-09	fix counting due to uninitialized variable when the time could not be parsed	Hiltjo Posthuma
	Since commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0 if the time is empty or could not be parsed then it is shown/aligned as a blank space instead of being skipped. An oversight in this change was that items should be counted and set in `isnew`. This commit fixes the uninitialized variable and possible miscounting.
2020-10-09	xml.h: minor comment rewording	Hiltjo Posthuma

2020-10-09	sfeed: parse day with max 2 digits (instead of 4)	Hiltjo Posthuma

2020-10-09	sfeed: support the ISO8601 time format without separators	Hiltjo Posthuma
	For example "19720229T132245Z" is now supported.