Sunday, June 10, 2012

OPML Update

Here are about two thousand feeds you can import to the RSS reader of your choice. I think I have all the duplicates eliminated, but I did add some reprehensible items via a little browse through Google's feeds and bundles so I may find out What Tom Friedman and Michelle Malkin think is important. The amount of programming content is somewhat balanced out now, but there's still a little too much marketing/business/motivational happy-talk for my taste. I find Google Reader sucks for this much stuff, but then I think Google Reader sucks anyway. There are plenty of alternatives out there that handle an information overload in a better way.

Even geekier portion of this post:

When I gathered all these feeds together there were a lot of duplicate feeds, and figuring those out was a pain in the ass because a lot of people had customized the names and descriptions and so forth. What to do?

It helps to know that an OPML file is just text, and that NetNewsWire will fetch descriptions for you. So here's a line of text representing the feed for Super Colossal (which is great but sadly not recently updated):

<outline text="Super Colossal" description="" title="Super Colossal" type="rss" version="RSS" htmlUrl="http://supercolossal.ch" xmlUrl="http://supercolossal.ch/feed/"/>

Pretty readable: there's text, an empty description, a title, and then some things that look less irrelevant. So, if we have a second line in the file that says

<outline text="Also Super Colossal" description="It's got stuff about building things" title="Also Super Colossal" type="rss" version="RSS" htmlUrl="http://supercolossal.ch" xmlUrl="http://supercolossal.ch/feed/"/>

it's going to appear out of order on the list and eliminating a duplicate is tough (time-consuming really). TextWrangler to the rescue, one of the ultimate Mac freebies for geeks. TextWrangler does grep, a very powerful and dangerous kind of search if you don't know what you're doing. I frequently don't know what I'm doing, but I just don't care, so on we went. Since we know NetNewsWire will fetch text and title and description, all we have to do is wipe those for the 2800 feeds in the file. Easy! So here's a search that more or less means look for anything that says 'title="anything"' and replace it with an emptied version:

Nitwit Grep

And there's the result:

Grep Result

So you just do that a couple more times for the other fields - I guess you could do text=".*?" description=".*?" title=".*?" and get it all done at once but I wasn't smart enough to think of it at the time and grep and ambition don't match well with n00bs - and then the fields are wiped. TextWrangler can then put selected lines in order and then process them in various ways, like removing duplicate lines:

Processing Lines

And there we go, a whole bunch of feeds eliminated. DKW would do it in Excel, but I swear TextWrangler is faster. Down to around 2100. Empty out NetNewsWire, reimport the cleaned up OPML and it'll acquire the right text and description and title on its own. And if the feeds vary slightly - one's RSS and one's Atom - it's pretty likely that when NetNewsWire gathers information about the feed it's gonna show up next to its not-quite-duplicated partner anyway, and you can get rid of that pretty easily.

10 comments:

ifthethunderdontgetya™³²®© said...

I don't use it, either.

And if the mustache of understanding wants my opinion on something, he can leave a comment on my blog just like anybody else.
~

Substance McGravitas said...

Busy. Only 47000 unread items to go.

mikey said...

I use Google Reader pretty much exclusively.

I've got a thousand or so feeds in there, but I spend the majority of my time in the "everyday" folder - less than 100. I go in and clean it up every now and then, but that really just means creating a new folder and starting with a core set of feeds. Eventually it grows into something unwieldy and annoying so I do it again. But yeah, I bet there are some feeds that are duplicated in a number of folders. I should just delete most of them and start over, but I'm not getting any performance problems, so I reckon there's no compelling reason to do so...

ifthethunderdontgetya™³²®© said...

S McG, is there a way to convert those items into digital music?
~

Substance McGravitas said...

Yes, in a bunch of different ways. One of my favourites was for the old Mac OS, and it would take pictures and interpret position, colour and luminance of pixels and make sound out of them. Simple geometrical pictures worked best for organized noise, but the mush a landscape would produce was quite listenable.

Screenshots of the text of this, particularly from TextWrangler using a monospace font, would make something I'd listen to.

Substance McGravitas said...

There's also a current Mac screensaver that uses RSS feeds: not hard to make those trigger different noises in Quartz Composer. Maybe I should play with that.

Substance McGravitas said...

I spend the majority of my time in the "everyday" folder

My everday folder is pretty much the blogroll. The RSS feeds are the world outside that.

Blogroll needs an update.

M. Bouffant said...

Thanks, but I don't think it does the blog-o-sphere any good if we're both getting the same feeds.

Only 322 unread.

Substance McGravitas said...

I know you're not getting the same feeds as me. I don't think you'd put up with all the programming bullshit for instance.

Substance McGravitas said...

Speaking of which this collection fills me full of love.