As I said in my last post about Twitter’s lack of a business model, I’ve been doing some research lately for a think tank. My research has basically consisted of three things:
- Looking back on the media coverage of an event that happened in early 2010
- Looking back at the way bloggers reacted to said event
- And having a quick look at Twitter for reactions there too
Pretty simple stuff, I think you’ll agree. My assumption was that I would be able to tap into Google News; Google Blog, Icerocket and maybe Technorati; and Twitter’s archives. Then I’d be able to scrape the data using something like Outwit Hub, chuck it in Excel and Bob’s your uncle.
Oh, how sadly misguided, how spectacularly wrong.
Now before you talk about how ephemeral the web is and how no one should rely on it for anything, that’s only partly true. A lot of stuff on the web stays on the web, and given how much of our digital selves we are putting on the web, we do need to think about archiving and how we preserve our stuff for the future. But this post is not about archiving, but about accessing what’s already out there.
The first thing I did when I started my research was try to go back in time in Google News to early 2010 and search for news articles about the particular story – the eruption of Eyjafjallajökull – I was interested in.
But Google’s News search results are fuzzy. I wanted to search for the news on particular days, e.g. all the news about the Eyjafjallajökull on 16 April 2010. Do that search, and you’ll be presented with lots of results, many of them not from 16 April 2010 at all, but 17 April or even 18 or 15 April.
I wanted to refine the search by location, so restricted it to Pages from the UK. Fascinatingly, this included Der Spiegel, Business Daily Africa, Manila Bulletin, FOX News, Le Post and a whole bunch of media sources that, when I last looked, weren’t based in the UK.
So I now have search results which are not limited to either the date or the place that I want. But even worse, results are clustered by story, which might seem like a good idea, but which in reality is lacking. Firstly, these clusters of similar stories are often not clusters of similar stories at all, but clusters of stories that appear to have some keywords in common but which are often actually about slightly different things. I can see the sense in attempting to cluster stories together for the sake of cutting down on duplication for the reader but equally, sometimes I just want a damn list.
Whilst doing my research, I also found that Google News is not, as I had thought, a subset of Google Web results. If you do the same searches on Google Web you get a slightly different set of data, obviously including non-news sites, but actually also including some news sites that aren’t in Google News, and not including many that are.
So far, so annoying, but Google isn’t the only search engine in the world… Except, Google pwned search years ago and innovation in search appears to be almost entirely absent. Bing does news, but a search on Eyjafjallajökull tosses up just three pages of results, and you can’t sort by date. Yahoo News finds nothing. A friend suggested that my local library might have a searchable news archive, but the one I looked at was unworkable for what I wanted.
I’m sure there are paid archives of digital news, but that wasn’t within my budget and, to be honest, given how much news is out there in the wild, there should be a good way to search it. I even tried the Google News API, but that has exactly the same unwanted behaviours as the website.
But hey, things will be better in the blog search, right?
Years ago, in the golden era of blogging, Technorati worked. Their site used to be really great, and I loved it so much I did some work with them. These days, I’m not quite sure what it’s for. It’s certainly not for search, given it finds nothing for Eyjafjallajökull. Icerocket is a better search engine, and you can refine by date, but it finds nothing on our target date, which is surprising as it’s a day or so after Eyjaf popped her top and the flight ban was well underway and, well, you’d think someone on the internet might have had something to say about it.
So, we’re back to Google Blogs. It lets me restrict by date! And specify UK-only! And it coughs up one page of results. Really? We have 4.57m bloggers in the UK and only 35 of them wrote something? I’ve always had my suspicions that the Google Blog index was poorly formed, but Google Blogs is the only choice I have, so I just have to put up with it. At least the results are in a neat list and all on the target date, even if some of them are clearly not from the UK, or even actually blogs, for that matter.
Now then, Twitter. We all know that Twitter’s archives have been on the endangered list for some time, but although they aren’t deleting old Tweets, accessing them is very difficult. Despite providing you with dates going back to 2008 in their advanced search page, you get an error if you try to search for April 2010: “since date or since_id is too old”.
SocialMention is a new search site that I’ve started to find really useful. They search across the majority of the social web and allow you to split that down by type. So I can search for ‘Eyjafjallajökull’ in ‘microblogs’ and get realtime results, but I can’t go back in time further than ‘last month’.
So, we’re back to Google again, this time Google Realtime. It only goes back to early 2010, so lucky for me that my target date is within that period. But the only way I can access that date is by a really clunky timeline interface – I can’t specify a date as I can in Google’s other searches.
Furthermore, there’s no pagination. I can’t hit ‘next page’ at the bottom and fish through a bunch of search pages to find something interesting – my navigation through the results is entirely dependent on the timeline interface. Such an interface will and does entirely outwit Outwit, which can normally follow ‘next’ links to scrap date from an entire search. I doubt it knows how to deal with the stupid timeline interface.
After all this searching and frustration, I’m left with this question:
What has happened to our web history?
The web is mutable, yes, but there’s an awful lot of fire-and-forget content that, generally speaking, hangs around for years. Individual blogs may come and go, but overall there’s a huge pool of blog content out there. Same for news. Twitter is a slightly weird case because it’s a single service with a huge archive of historically interesting data which it isn’t letting just anyone get at. Not even scholars. Twitter may have given its archive to the Library of Congress, but even that’s going to be limited access if their blog post is anything to go by:
In addition to looking at preservation issues, the Library will be working with academic research communities to explore issues related to researcher access. The Twitter collection will serve as a helpful case study as we develop policies for research use of our digital archives. Tools and processes for researcher access will be developed from interaction with researchers as well as from the Library’s ongoing experience with serving collections and protecting privacy and rights.
The Library is not Twitter and will not try to reproduce its functionality. We are interested in offering collections of tweets that are complementary to some of the Library’s digital collections: for example, the National Elections Web Archive or the Supreme Court Nominations Web Archive. We will make an announcement when the collection is available for research use.
I’m not an academic researcher, so whether I’d even get access to the archive for research is up in the air. (I can’t find any updates as to the availability of Twitter’s archive via the Library of Congress, so if anyone has info, please leave a comment.)
I think we have two problems here, one already briefly mentioned above.
1. Google has pwned search
For years, Google has been the dominant search engine, and in some ways they’ve paid a price for this as publishers of all stripes have climbed on the Google haterz bandwagon. My suspicions are that Google’s fuzzy search results are a sop to the news industry, because Google should be capable of producing a rich and valuable search tool that allows the user to see whatever data they want to see, in whatever layout they want. Maybe, after all the stupid shit the news publishers have thrown their way, Google thinks that building in failure to their news search product will insulate them from criticism from the industry.
But I don’t think that this absolves Google of responsibility for the lack of finesse in historical search. After all, which bloggers are gathering together to demand Google not index them? And Twitter users who don’t want to be indexed by Google can go private with their account.
But Google dominance does seem to have caused other search engines to wither on the vine. It’s almost like no one is bothering to innovate in search anymore. Bloglines used to be a pretty good blog search engine, but it has gone the way of the dodo. Technorati is now useless as a search engine. Bing is a starting point, but needs an awful lot of work if it’s going to compete. News search is completely underserved, and Twitter… really, Twitter archival search is non-existent.
Are Google really so far ahead that they can’t be touched? Are they really so great that no one is going to bother challenging them? The answer to the first question is clearly no, they aren’t that brilliant that their work can’t be improved upon, not just in terms of the search algorithm which has come in for a lot of criticism lately, but also in terms of their interface and the granularity of advanced searches. And I’d be deeply disturbed if people thought that the answer to the second question was yes. Google are the incumbent but that makes them vulnerable to a smaller, more nimble, more innovative competitors.
2. Historic search has been sidelined to serve instant
What’s going on right now? That’s the question that most search engines seem to be asking these days. Most have limited or zero capacity to look back on our web history, focusing instead of instant search. The immediacy of tools like Twitter and Facebook is alluring, especially for brands and companies who want to know what’s being said about them so that they can respond in a timely fashion.
But focusing on now and abandoning deeper, more nuanced historic searches is a disturbing trend. Searching the web’s past for research purposes might be a minority sport, but can we as a society really afford to disenfranchise our own past? Can we afford to alienate the researchers and ethnographers and anthropologists who want to learn about how our digital world has changed? About our reactions to events, as they happened rather than remembered years later? There is value in archives, but not if they are locked up, and the key thrown away by the search engines.
We cannot afford to sacrifice our history on the altar of instant. We can’t just say goodbye to the idea of being able to find out about our past, because it’s ok, we can see just how pretty the present looks. The obsession with instant risks not just our past, it also risks our future.