summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-04-28fixup: a regression with RSS guid, by default ispermalink="true"Hiltjo Posthuma
2021-04-28use the last href attribute value if there are multiple setHiltjo Posthuma
Input to reproduce: <entry> <link href="https://codemadness.org/a" href="https://codemadness.org/b"/> </entry> Old value: "https://codemadness.org/ahttps://codemadness.org/b" New value: "https://codemadness.org/b" same with RSS <enclosure url="" />
2021-04-28add support for old/legacy Atom 0.3 feedsHiltjo Posthuma
This standard was a draft used around 2005-2006. Instead of the fields "published" and "updated" it used "issued" (mandatory field) and "modified" (optional). Add support for them and also in preference of supporting Atom 1.0 and creation dates first. I don't know any real-life examples that still use this though. Some references: - http://rakaz.nl/2005/07/moving-from-atom-03-to-10.html - https://www.dokuwiki.org/syndication (rss_type "atom" parameter value). - https://support.google.com/merchants/answer/160598?hl=en
2021-04-28sfeed.{1,5}: improve documentation, the content-type field can be empty...Hiltjo Posthuma
... if there is no content.
2021-04-28enable unlocked I/O by defaultHiltjo Posthuma
getchar_unlocked is part of POSIX and should be supported by most platforms. On all tested platforms it has a performance benefit, sometimes smallish (<12%), sometimes large (~40%).
2021-04-28README: update newsboat export scriptHiltjo Posthuma
Since newsboat version 2.22 (2020-12-21) it stores the content mime-type of a field so allow to export this. The older entries are empty and will be exported as "html" (even though they might have been plain-text). ... also add the (empty) category field.
2021-04-28improve "ispermalink", "rel" and "type" attribute handling/bufferingHiltjo Posthuma
2021-04-28improve content-type "type" attribute handling/bufferingHiltjo Posthuma
2021-04-27sfeed.c: detect the proper mime-type for XHTMLHiltjo Posthuma
Reference: https://www.w3.org/2003/01/xhtml-mimetype/
2021-04-24fix a comment code-styleHiltjo Posthuma
This fix is very important *ahem*.
2021-03-13bump version to 0.9.22Hiltjo Posthuma
2021-03-12sfeed_web.1, sfeed_xmlenc.1: remove unneeded mdoc escape sequenceHiltjo Posthuma
2021-03-03sfeed_update: return instead of exit in main() on successHiltjo Posthuma
This is useful so the script can be included, call main and then have additional post-main functionality.
2021-03-02README: workaround empty fields with *BSD xargs -0Hiltjo Posthuma
Workaround it by setting the empty "middle" fields to some value. The last field can be empty. Some feeds were incorrectly using the wrong base URL if the `baseurl` field was empty but the encoding field was set. So it incorrectly used the encoding field instead. Only now noticed some feeds were failing because the baseURL is validated since commit f305b032bc19b4e81c0dd6c0398370028ea910ca and returning a non-zero exit status. This doesn't happen with GNU xargs, busybox or toybox xargs. Affected (atleast): OpenBSD, NetBSD, FreeBSD and DragonFlyBSD xargs which share similar code. Simple way to reproduce the difference: printf 'a\0\0c\0' | xargs -0 echo Prints "a c" on *BSD. Prints "a c" on GNU xargs (and some other implementations).
2021-03-01sfeed_update: fix baseurl substitutionHiltjo Posthuma
Follow-up from a rushed commit: commit 58555779d123be68c0acf9ea898931d656ec6d63 Author: Hiltjo Posthuma <hiltjo@codemadness.org> Date: Sun Feb 28 13:33:21 2021 +0100 sfeed_update: simplify, use feedurl directly This also make it possible to use non-authoritive URLs as a baseurl, like "magnet:" URLs.
2021-03-01util.c: uri_makeabs: check initial base URI field, not dest `a` (style)Hiltjo Posthuma
No functional difference because the base URI host is copied beforehand.
2021-03-01sfeed.1: reference sfeed_update and sfeedrcHiltjo Posthuma
The shellscript is optional, but reference it in the documentation.
2021-03-01sfeed_update: simplify, use feedurl directlyHiltjo Posthuma
This also make it possible to use non-authoritive URLs as a baseurl, like "magnet:" URLs.
2021-03-01util: improve/refactor URI parsing and formattingHiltjo Posthuma
Removed/rewritten the functions: absuri, parseuri, and encodeuri() for percent-encoding. The functions are now split separately with the following purpose: - uri_format: format struct uri into a string. - uri_hasscheme: quick check if a string is absolute or not. - uri_makeabs: make a URI absolute using a base uri and the original URI. - uri_parse: parse a string into a struct uri. The following URLs are better parsed: - URLs with extra "/"'s in the path prepended are kept as is, no "/" is added either for empty paths. - URLs like "http://codemadness.org" are not changed to "http://codemadness.org/" anymore (paths are kept as is, unless they are non-empty and not start with "/"). - Paths are not percent-encoded anymore. - URLs with userinfo field (username, password) are parsed. like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt - Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn, like: urn:isbn:0-395-36341-1 are allowed and parsed correctly. - Both local (file:///) and non-local (file://) are supported. - Specifying a base URL with a port will now only use it when the relative URL has no host and port set and follows RFC3986 5.2.2 more closely. - Parsing numeric port: parse as signed long and check <= 0, empty port is allowed. - Parsing URIs containing query, fragment, but no path separator (/) will now parse the component properly. For sfeed: - Parse the baseURI only once (no need to do it every time for making absolute URIs). - If a link/enclosure is absolute already or if there is no base URL specified then just print the link directly. There have also been other small performance improvements related to handling URIs. References: - https://tools.ietf.org/html/rfc3986 - Section "5.2.2. Transform References" have also been helpful.
2021-03-01README: combine bandwidth saving options into one sectionHiltjo Posthuma
Combine E-Tags, If-Modified-Since in one section. Also mention the curl --compression option for typically GZIP decompression. Note that E-Tags were broken in curl <7.73 due to a bug with "weak" e-tags. https://github.com/curl/curl/issues/5610 From a question/feedback by e-mail from Hadrien Lacour, thanks.
2021-02-05sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedanticHiltjo Posthuma
2021-02-04sfeed.c: fix time parsing regression with non-standard date formatHiltjo Posthuma
The commit that introduced the regression was: commit 33c50db302957bca2a850ac8d0b960d05ee0520e Author: Hiltjo Posthuma <hiltjo@codemadness.org> Date: Mon Oct 12 18:55:35 2020 +0200 simplify time parsing Noticed on a RSS feed with the following date: <pubDate>2021-02-03 05:13:03</pubDate> This format is non-standard, but sfeed should support this. A standard format would be (for Atom): 2021-02-03T05:13:03Z Partially revert it.
2021-01-28README: fix xargs -P example when there are no feedsHiltjo Posthuma
Kindof a non-issue but if theres a sfeedrc with no feeds then xargs will still be executed and give an error. The xargs -r option (GNU extension) fixes this: From the OpenBSD xargs(1) man page: "-r Do not run the command if there are no arguments. Normally the command is executed at least once even if there are no arguments." Reproducable with the sfeedrc: feeds() { true }
2021-01-27sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedanticHiltjo Posthuma
2021-01-27typofixesHiltjo Posthuma
2021-01-27README: add an example script to reuse the sfeed_update codeHiltjo Posthuma
This code uses the non-portable xargs -P option to more efficiently process feeds in parallel.
2021-01-27sfeed_update: allow to reuse the code more easily as an included scriptHiltjo Posthuma
This adds a main() function. When the environment variable $SFEED_UPDATE_INCLUDE is set then it will not execute the main handler. The other functions are included and can be reused. This is also useful for unit-testing.
2021-01-27sfeed_update: separate code of parallel exection and feed() into a _feed() ↵Hiltjo Posthuma
handler This is useful to be able to reuse the code (together with using sfeed_update as an included script, coming in the next commit).
2021-01-27sfeed_update: shuffle code getting the path of the feedurl to make the ↵Hiltjo Posthuma
basesiteurl Move it closer before it is used.
2021-01-27sfeed_update: change parse failure error messageHiltjo Posthuma
"(FAIL CONVERT)" -> "(FAIL PARSE)". Convert may be too similar to text encoding conversion.
2021-01-27sfeed_update: add an overridable parse() function, using sfeed(1) by defaultHiltjo Posthuma
This can be useful to make more cleanly make connector scripts. This does not necesarily even have to be in the sfeed(5) format.
2021-01-24sfeed_opml_export: fix typos in commentHiltjo Posthuma
2021-01-24sfeed_update: print the filename again as passed as a parameterHiltjo Posthuma
... and do not show stderr of readlink.
2021-01-23bump version to 0.9.21Hiltjo Posthuma
2021-01-22xml.c: fix typo / regression in checking codepoint range for utf-16 ↵Hiltjo Posthuma
surrogate pair Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a 0xdffff should be 0xdfff. printf '<item><title>&#x1f448;</title></item>' | sfeed Before (bad): &#x1f448; After: 👈
2021-01-22sfeed: fix regression with parsing content fieldsHiltjo Posthuma
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200. After a content tag was parsed the "iscontenttag" variable was not reset. This caused 2 regressions: - It ignored other tags such as links after it. - It incorrectly set the content-type of a lesser priority field. Thanks to pazz0 for reporting it!
2021-01-22README: tested with laccHiltjo Posthuma
Interesting C compiler project: lacc: A simple, self-hosting C compiler: https://github.com/larmel/lacc
2021-01-22xml.c: do not convert UTF-16 surrogate pairs to an invalid sequenceHiltjo Posthuma
Simple way to reproduce: printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8 Result: iconv: (stdin):1:8: cannot convert Output result: printf '<item><title>&#xdc00;</title></item>' | sfeed Before: 00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............| 0000000c After: 00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.&#xdc00;.......| 00000010 0a |.| 00000011 The entity is output as a literal string. This allows to see more easily whats wrong and debug the feed and it is consistent with the current behaviour of invalid named entities (&bla;). An alternative could be a UTF-8 replacement symbol (codepoint 0xfffd). Reference: https://unicode.org/faq/utf_bom.html , specificly: "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? " "A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By representing such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in a valid data stream. Therefore a converter must treat this as an error. [AF]"
2021-01-16sfeed_update: typo in commentHiltjo Posthuma
2021-01-16sfeed_update: improve consistency of feed creation and mergingHiltjo Posthuma
- Improve feed creation with empty results and new feed files. Always make sure the file is created even when it is new and there are also no items (after filtering). - Consistency: always use the same feed file for merging. Do not use "/dev/null" when it is a new file. This works using sort, but is ugly when the merge() function is overridden and does something else. It should be the feed file always.
2021-01-16sfeed_update: make convertencoding() consistent with other overridable functionsHiltjo Posthuma
This adds the name as the first parameter for the convertencoding() function, like filter, merge, order, etc. This can be useful to make an exception rule for text decoding in a more clean way.
2021-01-16sfeed_opml_import: minor code-style improvementsHiltjo Posthuma
2021-01-16sfeed_opml_import.1: clarify it handles OPML _subscription_ lists specificlyHiltjo Posthuma
OPML is a more generic format, this tool is specificly for "rss" types and subscription lists.
2021-01-16README: newsboat sqlite3 export script: improvementsHiltjo Posthuma
- Export read/unread state to a separate plain-text "urls" file, line by line. - Handle white-space control-chars better. From the sfeed(1) man page: " The fields: title, id, author are not allowed to have newlines and TABs, all whitespace characters are replaced by a single space character. Control characters are removed." So do the reverse for newsboat aswell: change white-space characters which are also control-characters (such as TABs and newlines) to a single space character.
2021-01-10optimize converting UNIX timestamp to localtimeHiltjo Posthuma
Make a huge difference (cuts the time in half to process the same amount of lines) on atleast glibc 2.30 on Void Linux. Seems to make no difference on OpenBSD. - This removes atleast one heap allocation per line (checked with valgrind). This is because glibc will strdup() the environment variable $TZ and free it each time, which is pointless here and wasteful. - localtime_r does not require to set the variables like tzname. In glibc-2.30/time/tzset.c in __tz_convert is the following code and comment: /* Update internal database according to current TZ setting. POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname. This is a good idea since this allows at least a bit more parallelism. */ tzset_internal (tp == &_tmbuf && use_localtime); This makes it always tzset() and inspect the environment $TZ etc. While with localtime_r it will only initialize it once: static void tzset_internal (int always) { [...] if (is_initialized && !always) return;
2021-01-09printutf8pad: fix byte-seek issue with negative width codepoints in the ↵Hiltjo Posthuma
range >= 127 For example: "\xef\xbf\xb7" (codepoint 0xfff7), returns wcwidth(wc) == -1. The next byte was incorrected seeked, but the codepoint itself was valid (mbtowc).
2021-01-09printutf8pad: small code-style/clarify changesHiltjo Posthuma
2021-01-08sfeed_atom: code-style: use conditional with pledge, like the other toolsHiltjo Posthuma
2021-01-08util.c: printutf8pad(): improve padded printing and printing invalid unicode ↵Hiltjo Posthuma
characters This affects sfeed_plain. - Use unicode replacement character (codepoint 0xfffd) when a codepoint is invalid and proceed printing the rest of the characters. - When a codepoint is invalid reset the internal state of mbtowc(3), from the OpenBSD man page: " If a call to mbtowc() resulted in an undefined internal state, mbtowc() must be called with s set to NULL to reset the internal state before it can safely be used again." - Optimize for the common ASCII case and use a macro to print the character instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this improves printing performance from 1.7s to 1.0s on my laptop. On an other system it improved by +- 25%. Tested with clang and gcc and also tested the worst-case (non-ASCII) with no penalty. To test: printf '0\tabc\xc3 def' | sfeed_plain Before: 1970-01-01 01:00 abc After: 1970-01-01 01:00 abc� def
2021-01-08sfeed_gopher: optimize common output character functionHiltjo Posthuma
Same reason as the previous commit (allow to expand to macros).