Sacrificing web history on the altar of instant

As I said in my last post about Twitter’s lack of a business model, I’ve been doing some research lately for a think tank. My research has basically consisted of three things:

  • Looking back on the media coverage of an event that happened in early 2010
  • Looking back at the way bloggers reacted to said event
  • And having a quick look at Twitter for reactions there too

Pretty simple stuff, I think you’ll agree. My assumption was that I would be able to tap into Google News; Google Blog, Icerocket and maybe Technorati; and Twitter’s archives. Then I’d be able to scrape the data using something like Outwit Hub, chuck it in Excel and Bob’s your uncle.

Oh, how sadly misguided, how spectacularly wrong.

Now before you talk about how ephemeral the web is and how no one should rely on it for anything, that’s only partly true. A lot of stuff on the web stays on the web, and given how much of our digital selves we are putting on the web, we do need to think about archiving and how we preserve our stuff for the future. But this post is not about archiving, but about accessing what’s already out there.

The first thing I did when I started my research was try to go back in time in Google News to early 2010 and search for news articles about the particular story – the eruption of Eyjafjallajökull – I was interested in.

But Google’s News search results are fuzzy. I wanted to search for the news on particular days, e.g. all the news about the Eyjafjallajökull on 16 April 2010. Do that search, and you’ll be presented with lots of results, many of them not from 16 April 2010 at all, but 17 April or even 18 or 15 April.

I  wanted to refine the search by location, so restricted it to Pages from the UK. Fascinatingly, this included Der Spiegel, Business Daily Africa, Manila Bulletin, FOX News, Le Post and a whole bunch of media sources that, when I last looked, weren’t based in the UK.

So I now have search results which are not limited to either the date or the place that I want. But even worse, results are clustered by story, which might seem like a good idea, but which in reality is lacking. Firstly, these clusters of similar stories are often not clusters of similar stories at all, but clusters of stories that appear to have some keywords in common but which are often actually about slightly different things. I can see the sense in attempting to cluster stories together for the sake of cutting down on duplication for the reader but equally, sometimes I just want a damn list.

Whilst doing my research, I also found that Google News is not, as I had thought, a subset of Google Web results. If you do the same searches on Google Web you get a slightly different set of data, obviously including non-news sites, but actually also including some news sites that aren’t in Google News, and not including many that are.

So far, so annoying, but Google isn’t the only search engine in the world… Except, Google pwned search years ago and innovation in search appears to be almost entirely absent. Bing does news, but a search on Eyjafjallajökull tosses up just three pages of results, and you can’t sort by date. Yahoo News finds nothing. A friend suggested that my local library might have a searchable news archive, but the one I looked at was unworkable for what I wanted.

I’m sure there are paid archives of digital news, but that wasn’t within my budget and, to be honest, given how much news is out there in the wild, there should be a good way to search it. I even tried the Google News API, but that has exactly the same unwanted behaviours as the website.

But hey, things will be better in the blog search, right?

Years ago, in the golden era of blogging, Technorati worked. Their site used to be really great, and I loved it so much I did some work with them. These days, I’m not quite sure what it’s for. It’s certainly not for search, given it finds nothing for Eyjafjallajökull. Icerocket is a better search engine, and you can refine by date, but it finds nothing on our target date, which is surprising as it’s a day or so after Eyjaf popped her top and the flight ban was well underway and, well, you’d think someone on the internet might have had something to say about it.

So, we’re back to Google Blogs. It lets me restrict by date! And specify UK-only! And it coughs up one page of results. Really? We have 4.57m bloggers in the UK and only 35 of them wrote something? I’ve always had my suspicions that the Google Blog index was poorly formed, but Google Blogs is the only choice I have, so I just have to put up with it. At least the results are in a neat list and all on the target date, even if some of them are clearly not from the UK, or even actually blogs, for that matter.

Now then, Twitter. We all know that Twitter’s archives have been on the endangered list for some time, but although they aren’t deleting old Tweets, accessing them is very difficult. Despite providing you with dates going back to 2008 in their advanced search page, you get an error if you try to search for April 2010: “since date or since_id is too old”.

SocialMention is a new search site that I’ve started to find really useful. They search across the majority of the social web and allow you to split that down by type. So I can search for ‘Eyjafjallajökull’ in ‘microblogs’ and get realtime results, but I can’t go back in time further than ‘last month’.

So, we’re back to Google again, this time Google Realtime. It only goes back to early 2010, so lucky for me that my target date is within that period. But the only way I can access that date is by a really clunky timeline interface – I can’t specify a date as I can in Google’s other searches.

Furthermore, there’s no pagination. I can’t hit ‘next page’ at the bottom and fish through a bunch of search pages to find something interesting – my navigation through the results is entirely dependent on the timeline interface. Such an interface will and does entirely outwit Outwit, which can normally follow ‘next’ links to scrap date from an entire search. I doubt it knows how to deal with the stupid timeline interface.

After all this searching and frustration, I’m left with this question:

What has happened to our web history?
The web is mutable, yes, but there’s an awful lot of fire-and-forget content that, generally speaking, hangs around for years. Individual blogs may come and go, but overall there’s a huge pool of blog content out there. Same for news. Twitter is a slightly weird case because it’s a single service with a huge archive of historically interesting data which it isn’t letting just anyone get at. Not even scholars. Twitter may have given its archive to the Library of Congress, but even that’s going to be limited access if their blog post is anything to go by:

In addition to looking at preservation issues, the Library will be working with academic research communities to explore issues related to researcher access.  The Twitter collection will serve as a helpful case study as we develop policies for research use of our digital archives. Tools and processes for researcher access will be developed from interaction with researchers as well as from the Library’s ongoing experience with serving collections and protecting privacy and rights.

The Library is not Twitter and will not try to reproduce its functionality.  We are interested in offering collections of tweets that are complementary to some of the Library’s digital collections: for example, the National Elections Web Archive or the Supreme Court Nominations Web Archive. We will make an announcement when the collection is available for research use.

I’m not an academic researcher, so whether I’d even get access to the archive for research is up in the air. (I can’t find any updates as to the availability of Twitter’s archive via the Library of Congress, so if anyone has info, please leave a comment.)

I think we have two problems here, one already briefly mentioned above.

1. Google has pwned search
For years, Google has been the dominant search engine, and in some ways they’ve paid a price for this as publishers of all stripes have climbed on the Google haterz bandwagon. My suspicions are that Google’s fuzzy search results are a sop to the news industry, because Google should be capable of producing a rich and valuable search tool that allows the user to see whatever data they want to see, in whatever layout they want. Maybe, after all the stupid shit the news publishers have thrown their way, Google thinks that building in failure to their news search product will insulate them from criticism from the industry.

But I don’t think that this absolves Google of responsibility for the lack of finesse in historical search. After all, which bloggers are gathering together to demand Google not index them? And Twitter users who don’t want to be indexed by Google can go private with their account.

But Google dominance does seem to have caused other search engines to wither on the vine. It’s almost like no one is bothering to innovate in search anymore. Bloglines used to be a pretty good blog search engine, but it has gone the way of the dodo. Technorati is now useless as a search engine. Bing is a starting point, but needs an awful lot of work if it’s going to compete. News search is completely underserved, and Twitter… really, Twitter archival search is non-existent.

Are Google really so far ahead that they can’t be touched? Are they really so great that no one is going to bother challenging them? The answer to the first question is clearly no, they aren’t that brilliant that their work can’t be improved upon, not just in terms of the search algorithm which has come in for a lot of criticism lately, but also in terms of their interface and the granularity of advanced searches. And I’d be deeply disturbed if people thought that the answer to the second question was yes. Google are the incumbent but that makes them vulnerable to a smaller, more nimble, more innovative competitors.

2. Historic search has been sidelined to serve instant
What’s going on right now? That’s the question that most search engines seem to be asking these days. Most have limited or zero capacity to look back on our web history, focusing instead of instant search. The immediacy of tools like Twitter and Facebook is alluring, especially for brands and companies who want to know what’s being said about them so that they can respond in a timely fashion.

But focusing on now and abandoning deeper, more nuanced historic searches is a disturbing trend. Searching the web’s past for research purposes might be a minority sport, but can we as a society really afford to disenfranchise our own past? Can we afford to alienate the researchers and ethnographers and anthropologists who want to learn about how our digital world has changed? About our reactions to events, as they happened rather than remembered years later? There is value in archives, but not if they are locked up, and the key thrown away by the search engines.

We cannot afford to sacrifice our history on the altar of instant. We can’t just say goodbye to the idea of being able to find out about our past, because it’s ok, we can see just how pretty the present looks. The obsession with instant risks not just our past, it also risks our future.

12 thoughts on “Sacrificing web history on the altar of instant

  1. Thank you for highlighting the abysmal state of search! And FYI, if you have a public library card, you might have free access to several excellent news search databases.

  2. I do agree with you on the topic of making sure it is clearer how to search in the past, yet I believe this is more of a presentational problem than a technical. Google News uses cluster technology most commonly known as Topic Detection & Tracking. The most “relevant” result is shown on top yet the “topic” can have a creation date that is different than what they show on the page.

    I am by no means a fan of Google News as the clustering technique they use have obvious drawbacks for usability, and also is heavily skewed to highly reputable sources. I can elaborate on this more if you want to, yet the date constraint comment is probably more a presentation than a technology problem.

  3. Mike, I do have a library card, and I did have a look at my nearest library’s online resources, but they weren’t suitable for what I needed, which was to be able to scrape URL, publisher, title and date to an Excel spreadsheet. Had I had more time I might have been able to delver further into the world of library databases, but instead I went with what I knew I could get out of Google, tedious as it was.

    Erik, I’m sure that Google News has the expertise to fix this without having to try too hard. As you say, it’s mainly presentation issue. If they wanted to make the search non-fuzzy, to return only those articles with the specified date, they could. They could even be much more accurate with the location data too as it’s pretty damn clear that the Times of India isn’t a UK source. I can only infer that they choose not to provide accurate search results.

    As for Google Blogs, their index has always been a bit ropey, so that problem I suspect isn’t just about presentation, it’s about how they are indexing blogs. Location on blogs is much, much harder to do because unless someone says where they are from, it’s almost impossible to know, so we’ll put that issue aside. But do we really believe that only 35 bloggers had something to say about a massive international event on 16 April? To me, that indicates an incomplete index and that is something that again Google could fix if they could be bothered to, but it’s a bit more than presentation.

    Unfortunately we’re in the position where we’re in hock to Google, because there aren’t any credible competitors. That all by itself worries me not because I think Google are some sort of evil monopolistic corporation — I don’t think in quite such childish terms! — but because once a near-monopoly has been established the incumbent can easily become lazy and stop innovating, and the market has no way to punish them. I would really like it if Google stayed at the top of its game with all the different search types its doing, but I don’t think it is.

  4. Very good post. I had exactly the same experience about a year ago (right around the time of Eyjafjallajökull). Unfortunately, I didn’t write a blog post about it, but instead I took the short way out and complained bitterly on Twitter thus falling into the same trap, ensuring that my comments would never be found by anyone farther than a week or so into the future. Of course, as you mention, had I blogged about it, that would be equally as invisible now, so oblivion is truly inevitable.

    I hope I’ll have time to post a longer response to this later. It really deserves one, because the trend you mention is worrisome and has broad implications. I’ve given up completely on finding reliable information on the Internet concerning past events and chronologies, having wasted hours of my time on such searches for naught.

  5. Hi Laura! If you do write a blog post, please do post a link here! I would be interested to read another take on it!

  6. Suw,

    Great post, and couldn’t agree more. We’ve been so fixated on the instant nature of the digital world that we’ve lost sight of the other key characteristic of it, which is persistence. The way we organize news coverage – for speed and instant delivery – is one example, but you’ve really highlighted another, which is the limits of search. Probably explains also why Wikipedia is so popular; few other people do as good a job of updating information while providing context.


  7. Suw great work. I may be able to help you compile that spreadsheet. Drop me a line. R

  8. Suw

    You’ll have to excuse me if this comment comes across as ignorant or out of touch – I’m a journalist first, my technical skills come a long way down the list.

    Isn’t this a failure of metadata? By this I mean shouldn’t it be possible to place an unambiguous tag (if that’s the right technical term) somewhere in a news story that says: “This piece was published on April 16, 2010”. And shouldn’t a search engine then be capable of returning only entries tagged with that precise date?

    If things don’t work that way, they should.

  9. Reg, thanks for your comment, and thanks for your blog post too!

    For everyone else, Reg followed up here:

    Steph, thanks for your link too! Good stuff 😀

    Hey Rebecca, Thanks for your offer, but the spreadsheet is done and dusted now!

    Bill, you’re spot on about the metadata, in that news content should all be marked up with metadata that at least gives the correct date, publication name, byline, location, and some keywords. There’s no good reason for that not to have already happened, but I get the feeling that a lot of CMSes don’t provide an easy way to do that, and for many news outlets they just don’t see it as a priority. (I suspect the reasoning is something along the lines of: Why would we want to spend money making our content more easily found by Google? without ever thinking about how having a fully searchable archive might be a good business move…)

    But that said, Google is quite good at pulling out dates etc. from the pages it catalogues, so in many ways the problem I’m describing is one they could fix relatively easily. They have the data, it’s just about how they present it. They could provide a ‘flat’ view’, a ‘in chronological order’ view, a ‘clustered’ view, a ‘relevance’ view… It’s their choice to display the data the way that they have – as far as I can see there’s no technical reason for them to display the search results in this fuzzy way.

  10. Regarding dates in CMS – HTML5 gives us a very easy way to indicate the date an article was written and the published date (pubDate)
    A good tutorial is at
    Useful for journalists, archivist, and search engines.

    One issue with searching for dates is that of timezones. Take the death of John Lennon. American bloggers – had they existed – would have written about 1980-12-08 whereas their British counterparts would have been writing about 1980-12-09.
    So, it’s always best to search a day forward and back when looking date-specific content.

  11. Terence, agreed about the timezone fuzziness. In my case, I was only looking for stuff from the UK so it wasn’t so much of an issue, but when tracking a story like the Christchurch Earthquake, or the Great Tohoku Earthquake, timezones become a big issue. A fix for this would be to timestamp everything in local time with timezone/summertime date, and in UTC so that you have a canonical time that allows things like timelines to be more easily put together.

    Again, it’s a solvable problem, but I can’t see any evidence that the industry really cares. I mean, there’s the hNews microformat and NewsML, but it’s not clear that either are being widely used. Caring about metadata is a bit of a geek thing, imho, and there aren’t that many geeks actually steering online news strategy, sadly.

Comments are closed.