If you have no mouth, you cannot scream - please kill RSS!

| 1 Comment | No TrackBacks

A while ago I've started using Liferea (a feed reader) after I've realised that I'm spending way too much time just checking websites for new content. It was time to call it web bankruptcy and switch over to feeds.

Things were good for a while, time passed and more and more feeds were added.

Eventually the Ironman challenge launched and I've subscribed to its Atom feed. All seemed fine until some posts appeared as a html tag soup, tags displayed as text instead of actually used.

This is Perl, I should be okay with that, let's fix that bug then - I thought and that's when the fun started.

Ironman uses Plagger to manage feed aggregation and after a bit of looking around I've found out that it uses XML::Atom internally to publish the Atom feed.

Atom stores the meat in the body of the content element and the WTFing on my side started when i realised what the relevant part of the code does:

Simplified, it's something like:

eval {
    my $parser = XML::LibXML->new;
    my $tree = $parser->parse_string($copy);
    $node = $tree->getDocumentElement;
};
if (!$@ && $node) {
    $elem->appendChild($node);
    $content->type('xhtml');
} else {
    $elem->appendChild(XML::LibXML::Text->new($data));
    $content->type($data =~ /^\s*</ ? 'html' : 'text')
}
In effect, the code tries to parse the data as xml(!) and adds the content node with type xhtml. If that fails and the data begins with a tag, then it's treated as html otherwise as text.

This is where I've decided that XML::Atom is broken on a design level. Attempting to guess format/encoding is a serious antipattern as far as I'm concerned. The format should be specified, not guessed. What's worse about XML::Atom is that in the current code there is no way to override that guess, to specify the content's type directly.

I've committed some hacks to work around this problem for ironman, however the proper fix would be to replace XML::Atom and XML::Feed in Plagger with a proper solution. A worthy task for the lazyweb, eligible* for a vertical meter of beer if the person is present at YAPC::EU 2009 or possibly some later event.

Returning to the ironfeed, I thought the nasty hacks I've made were the last of the tagsoup problem and I can enjoy a well formatted feed in the future. It turned out not so.

If you have no mouth, you cannot scream

To understand my shock it has to be said that I learn as I go along, so before I started out fixing the content type of the feed entries I didn't know too much about RSS or Atom or the modules used to create the feed.

Imagine my shock when I found out that trying to extract type information from the most commonly used feed format, RSS 2.0, is impossible by design, because that piece of sh not well designed format does not supply one.

Let me reiterate once more: guessing encoding or formatting doesn't work. It belongs in the the hall of antipatterns where people parse html with regular expressions and embed SQL in javascript.

Because of this, I'd like to make a humble request:

Please, please, stop using RSS and publish your feeds using Atom 1.0, so that we know how to interpret the content you're publishing!

Thank you.

* Eligibility is determined by my subjective judgement. A necessary but not sufficient condition would be to be able to generate a valid Atom ironman feed with it with no detectable formatting problems where data is from atom sources and to have tests.

No TrackBacks

TrackBack URL: http://dev.perl.hu/movabletype/htdocs/mt/mt-tb.cgi/4

1 Comment

I've been bothered by the look of some feeds myself (including the ones from my blog). It is nice to see someone investigating it. Unfortunately, I do not know how to change the Blogger settings so that the feed will look nice. And when I look at the xml source of the feed, I see that I am already using the atom format.

Leave a comment

About this Entry

This page contains a single entry by szbalint published on June 2, 2009 8:00 PM.

Bilingual Blogging Builds a Better Budapest.pm? was the previous entry in this blog.

Perl, Én Így Szeretlek is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.25