summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-01-09printutf8pad: small code-style/clarify changesHiltjo Posthuma
2021-01-08sfeed_atom: code-style: use conditional with pledge, like the other toolsHiltjo Posthuma
2021-01-08util.c: printutf8pad(): improve padded printing and printing invalid unicode ↵Hiltjo Posthuma
characters This affects sfeed_plain. - Use unicode replacement character (codepoint 0xfffd) when a codepoint is invalid and proceed printing the rest of the characters. - When a codepoint is invalid reset the internal state of mbtowc(3), from the OpenBSD man page: " If a call to mbtowc() resulted in an undefined internal state, mbtowc() must be called with s set to NULL to reset the internal state before it can safely be used again." - Optimize for the common ASCII case and use a macro to print the character instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this improves printing performance from 1.7s to 1.0s on my laptop. On an other system it improved by +- 25%. Tested with clang and gcc and also tested the worst-case (non-ASCII) with no penalty. To test: printf '0\tabc\xc3 def' | sfeed_plain Before: 1970-01-01 01:00 abc After: 1970-01-01 01:00 abc� def
2021-01-08sfeed_gopher: optimize common output character functionHiltjo Posthuma
Same reason as the previous commit (allow to expand to macros).
2021-01-08xmlencode: optimize common character output functionHiltjo Posthuma
Use putc instead of fputc, it can be optimized to macros. From the OpenBSD man page: " putc() acts essentially identically to fputc(), but is a macro that expands in-line. It may evaluate stream more than once, so arguments given to putc() should not be expressions with potential side effects." sfeed_atom, sfeed_frames and sfeed_html are using this function. Mini-benchmarked sfeed_html and it went from 1.45s to 1.0s with feed files in total 250k lines (+- 350MB). Tested with clang and gcc on OpenBSD on an older laptop.
2021-01-03man pages: add more real world examples to the man pagesHiltjo Posthuma
2021-01-02sfeed.1/sfeed_plain.1: add example, improve quoting the url for sfeed_web.1Hiltjo Posthuma
2021-01-01sfeed_gopher: tighten filesystem permissions on OpenBSD using unveil(2)Hiltjo Posthuma
sfeed_gopher must be able to write in the current directory, but does not need write permissions outside it. It could read from any place in the filesystem (to read feed files). Prompted by a suggestion from vejetaryenvampir, thanks!
2021-01-01README: add text about page redirects + tweak some words.Hiltjo Posthuma
... move sections around in a more logical order and tweak some words. Prompted by a question and feedback from Aleksei, thanks!
2021-01-01README: tested on MIPS32 (big-endian)Hiltjo Posthuma
2021-01-01LICENSE: bump yearHiltjo Posthuma
2021-01-01sfeed_update: if baseurl is empty then use the path from the feed by defaultHiltjo Posthuma
Feeds should contain absolute urls, but if it does not have it then this makes it more convenient to configure such feeds.
2020-11-09bump version to 0.9.20Hiltjo Posthuma
2020-11-01sfeed_xmlenc: be more paranoid in printing encoding namesHiltjo Posthuma
sfeed_xmlenc is used automatically in sfeed_update for detecting the encoding. In particular do not allow slashes anymore either. For example "//IGNORE" and "//TRANSLIT" which are normally allowed. Some iconv implementation might allow other funky names or even pathnames too, so disallow that. See also the notes about the "frommap" for the "-f" option. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html + some minor parsing handling improvements.
2020-10-31sfeed_web: improve parsing a <link> if it has no type attributeHiltjo Posthuma
This happens because the previous link type is not reset when a <link> tag starts again, but it is reset when a type attribute starts. Found on the spanish newspaper site: elpais.com Input: <link rel="alternate" href="https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada" type="application/rss+xml" title="RSS de la portada de El País"/> <link rel="canonical" href="https://elpais.com"/> Would print (second line is incorrect). https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml https://elpais.com/ application/rss+xml Now prints: https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml Fix: reset it also at the start of a <link> tag in this case (for <base href /> it is still not wanted).
2020-10-24bump version to 0.9.19Hiltjo Posthuma
2020-10-22sfeed_web: whoops, fix bug mentioned in the previous commitHiltjo Posthuma
(ascii.jp)
2020-10-22sfeed_web: attribute parsing improvements, improve man pageHiltjo Posthuma
Fix attribute parsing and now decode entities. The following now works (from helsinkitimes.fi): <base href="https://www.helsinkitimes.fi/" /> <link href="/?format=feed&amp;type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" /> <link href="/?format=feed&amp;type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" /> Properly associate attributes with the actual tag, this now parses properly (from ascii.jp). <link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" /> <link rel="alternate" type="application/rss+xml" />
2020-10-22Do not change the referenced matched tag data (from gettag()).Hiltjo Posthuma
Fixes a regression introduced in the refactor in commit e43b7a48b08a6bbcb4e730e80395b3257681b33e Now copy the data by value. This structure is small and no performance regression has been seen. This was because the tag ID was modified which made subsequent parsed tags of this type behave strangely: ctx.tag->id = RSSTagGuidPermalinkTrue; Input data to reproduce: <rss> <channel> <item> <guid isPermaLink="false">https://def/</guid> </item> <item> <guid>https://abc/</guid> </item> </channel> </rss>
2020-10-21README: filter example, filter Google Analytics utm_* parametersHiltjo Posthuma
https://support.google.com/analytics/answer/1033867?hl=nl
2020-10-21sfeed_web: reset feedlink bufferHiltjo Posthuma
Noticed strange output on the site ascii.jp: The site HTML contained: <link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" /> <link rel="alternate" type="application/rss+xml" /> This would print: "/img/apple-touch-icon.png application/rss+xml" Now it prints: " application/rss+xml"
2020-10-18README: improve etag example with escaping of the filenameHiltjo Posthuma
Use the same base filename as the feed file, because sfeed_update replaces '/' in names with '_': filename="$(printf '%s' "$1" | tr '/' '_')" This fixes the example for fetching feeds with names containing '/'. Reported by __20h__, thanks!
2020-10-18README: add example to support ETag cachingHiltjo Posthuma
2020-10-18xml.c: initialize i = 0Hiltjo Posthuma
Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
2020-10-16README.xml: reference examples, ANSI compatible, mention original parserHiltjo Posthuma
2020-10-16README: fix unescaped character in regex in awk in filter exampleHiltjo Posthuma
Found by testing using mawk.
2020-10-12add a comment about the intended date priorityHiltjo Posthuma
2020-10-12Revert "RSS: give Dublin Core <dc:date> higher priority over <pubDate>"Hiltjo Posthuma
This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
2020-10-12README: filter example: strip Facebook fbclid parameterHiltjo Posthuma
2020-10-12simplify time parsingHiltjo Posthuma
2020-10-12remove unneeded check for NUL terminatorHiltjo Posthuma
2020-10-12RSS: give Dublin Core <dc:date> higher priority over <pubDate>Hiltjo Posthuma
This way dc:date could be the updated time of the item. For Atom there is <published> and <updated> with the same logic.
2020-10-12parse categories, add multiple field values support (for categories)Hiltjo Posthuma
Fields with multiple values are separated by '|'. In the future multiple enclosure support might be added. The categories tags are now parsed. This feature is useful for filtering and categorizing. Parsing of nested tags such as <author><name> has been improved. This code has been refactored. RSS <guid> isPermaLink is now handled differently also and will now prefer a permalink with "true" (link) over the ID. In practise multiple <guid> in an item does not happen.
2020-10-09xml: remove unused code for sfeedHiltjo Posthuma
2020-10-09fix counting due to uninitialized variable when the time could not be parsedHiltjo Posthuma
Since commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0 if the time is empty or could not be parsed then it is shown/aligned as a blank space instead of being skipped. An oversight in this change was that items should be counted and set in `isnew`. This commit fixes the uninitialized variable and possible miscounting.
2020-10-09xml.h: minor comment rewordingHiltjo Posthuma
2020-10-09sfeed: parse day with max 2 digits (instead of 4)Hiltjo Posthuma
2020-10-09sfeed: support the ISO8601 time format without separatorsHiltjo Posthuma
For example "19720229T132245Z" is now supported.
2020-10-09README: tested with cproc and sdcc on Z80 emulator, for funHiltjo Posthuma
cproc: cproc: https://github.com/michaelforney/cproc qbe: https://c9x.me/compile/ z80 (sfeed base program) fuzix: http://www.fuzix.org/ RC2014 emulator: https://github.com/EtchedPixels/RC2014 sdcc: http://sdcc.sourceforge.net/
2020-10-09man pages: tweak alignment of listsHiltjo Posthuma
2020-10-09xml.c: remove buffering of comment data, which is unused anywayHiltjo Posthuma
2020-10-09xml.h: add underscore for #ifdef guardHiltjo Posthuma
This is the common style.
2020-10-09XML cdata callback: handle CDATA as dataHiltjo Posthuma
This improves handling CDATA for example in Atom feeds with: <author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>
2020-07-06bump version to 0.9.18Hiltjo Posthuma
2020-07-05sfeed_atom: minor simplification, gmtime_r is not needed hereHiltjo Posthuma
2020-07-05README: reference sfeed_cursesHiltjo Posthuma
2020-07-05README: improvementsHiltjo Posthuma
- Add an example to optimize bandwidth use with the curl -z option. - Add a note about CDNs blocking based on the User-Agent (based on a question mailed to me). - Add an script to convert existing newsboat items to the sfeed(5) TSV format.
2020-07-05format tools: don't skip items with a missing/invalid timestamp fieldHiltjo Posthuma
Handle it appropriately in the context of each format tool. Output the item but keep it blanked. NOTE: maybe in sfeed_twtxt it should use the current time instead?
2020-07-05sfeed_mbox: don't ignore items with a missing/invalid timestampHiltjo Posthuma
The Date header is mandatory. Use the current time if it is missing/invalid.
2020-07-05sfeed_atom: the updated field is mandatory: use the current time...Hiltjo Posthuma
... if it is missing/invalid.