Xtech 2006: David Beckett – Semantics Through the Tag

Common way to think of tags is as a list of resources. You tag something, then you get a list of stuff you’ve tagged, etc.

Another way is tag clouds. Size of the tag represents how popular it is.

Suggested tags. Discovery process for other tags that people are using that are similar to your tags.

Flickr – clustering of photos with tags that are related, also interestingness which is partly tagged-based but also takes interactions with the site from people into account.

Mash-ups – assume tag is a primary key. Works well on events such as Xtech: Technorati, Planet Xtech. Photo sites are more place/time centric; del.icio.us ones are more pic related. So further away you get from place/time centric and the more generic the tag gets, the weaker their usefulness for mash-ups is. Generic tags don’t work so well – dont work as a connection across tag space, won’t tell you anything.

Emergent tag structures

There’s no documentation because it’s so lightweight so people use it however they like. Pave the paths that people follow.

– Geo tagging, lat/long, places

– Cell tagging, from mobile phone cell towers, associated with cameraphone pictures

– Blue tagging, bluetooth devices that were in context at the time

Hierarchy

Not about creating an taxonomy, but looking at emergent hierarchy, e.g.

– programming

– programming:html or programming/html

These appeared on their own, no on is thinking of consistency. Tag system may not understand the hierarchical system, but it helps people find them.

Grouping

Bundles in delicious. Similar to Flickr sets.

So who’s stuff is tagged?

– Yours on the tagging site

– Other people’s on the tagging site

– Anybody’s

Flickr is a closed system, and you can only tag certain things like your photos and your contacts; whereas Del.icio.us is more open.

So how’s tag is it anyway?

If it was a domain name it is clear, but really, who cares. This is lightweight, don’t need to think about who owns it, just use it.

Tags are vocabularies per service, tagonomies. Each services uses different words as tabs, and may or may not be the same across services. So they can use terms differently.

What’s the point of tagging semantics?

1. for people to understand what some use of a tag means: there’s no way of finding out what a tag means without looking at it in context and figuring it out.

2. for computers to gather information about a tag, supporting #1.

What does a tag mean to someone?

– ask them? not scalable

– look it up in a canon. dictionary or encyclopaedia, but it isn’t distributed and it’s too much like hard work. need a mechanism that you can just use. don’t want anything heavyweight.

Good things

– low barrier to use, just start typing

– few restrictions on syntax

– unrestricted word space, if you were looking at it from a librarian’s point of view, Dewey Decimal is restricted to what’s defined by the system

– social description, folksonomy, can see what friends are doing, looking groups and sets, and make up your own tags

– if you have lots of tags, over time as the no. of tags increases, the descriptions merge towards an average, because any one individuals version of a description becomes less important, so over time meanings converge.

– easy to experiment, because there is no authority that says it’s not allowed.

Problems

– formalism problems: mixing types of things, names of things, genres, made up things, ambiguity, synonyms

– meaning is implicit

– power curves, nobody explains the long tail tags, so individuals meanings get lost and subdued by the mass of people tagging (this is a plus – see above – and a minus)

– naive tag mashups mix up meanings

– syntax problems – stemming, plurals. some services try to join things by ignoring spaces, plurals, caps, lower case, etc. by using natural language.

– tricky to make a short, unique tag. computer wants something unique, humans want something short and easy.

All these are the usual human-entered metadata problems.

Possible solutions

– microformats: no good hoock for software, and are read only

– web APIs: read/write but are for programmers only, not much use to 99% of tag users

– RSS: but it’s read-only, so more about me giving you stuff than getting stuff back.

Separate from service

Need to then understand the words out of context, with no service behind it.

Want

– a description

– a community

– a historical record

Answer: a wiki

– a description page

– a community of people to discuss and/or edit

– a historical record

Example, raptor tag

Raptor is a bird of prey, a hard drive, a plane, dinosaurs. So what does it mean if you tag something raptor?

So there is ambiguity. Wikipedia uses disambiguation pages to help clean up meanings from works. People can read this, but so can machines. This stuff is recording semantically, so can tell this term is ambiguous. Can also look up in Wiktionary, and can then leap across languages too.

Wikitag

– Easy to create

– record the ambiguity, and synonyms/prefered names

– microformat compatible: metadata is wiki markup, so is visible; reuse of existing format.

http://en.wikitag.org/wiki/Raptor

Defined the meaning of a tag.

– can discuss the term

– and can add disambiguation if need be

This isn’t perfect.

– discussion, needs easy-to-use threaded discussion

– wikipedia rules, e.g. NPOV and encylopedic style are not appropriate for something as lightweight as this. Needs fewer rules, maybe just ‘keep it legal’.

– Centralised. 🙁 don’t want a ‘one true way’ of doing this.

Can also add in semantic wiki mark-up.

Tagging is s social process with a gap: the place for a community to build the meaning. wiki can fill the gap.

Xtech 2006: Mikel Maron – GeoRSS

90% of information has a spatial component. Needed to agree a format.

Definitive history of RSS

– Syndicus Geographum – ancient Greek treaty for sharing of maps between city states

– blogmapper/rdfmapper – 2002, specifying locations in weblog posts, little map with red dots

– w3c rdfig geo vocabulary – 2003, came up with simple vocab on irc and published a doc, and this is the basis of geocoding RDF and RSS

– geowanking – May 2003, on this discussion list GeoRSS first uttered

– World as a Blog/WorldKit – realtime weblog geo info nabbing tools, World as a Blog looks at geotags in real time then plots them on a map so you can see who’s up too late.

– USGS 2004, started their Earthquake Alerts Feed.

– Yahoo! maps supports georss 2005

Lots has happened. Google released GoogleMaps, and shook everyone up with an amazing resource of map data, and released an API. Lots of map-based mash-ups.

OSGeo Foundation, Where 2.0, OpenStreetMap.

Format then only specified points, not lines or polygons. GeoRSS.org. Alternatives were KML used by Google Earth, rich, similar to GML, but too complicated and tied to GoogleEarth so some stuff is more for 3d; GPX, XML interchange format for GPS data, but tied to its applicaton, extensible but not so useful; and GML, Open Geospacial Consortium, and useful for defining spatial geometries, so an XML version of a shape file, but quite complicated spec at over 500 pages, and a bit of confusion on how you use it because it’s not a schema its similar to RDF, so provides geometric objects for your own schema.

OGC got involved in GeoRSS because they wanted to help promote GML. So some of GeoRSS is drawn from GML. Two types of GeoRSS: Simple and GML. Simple is a compressed version of GML. Neutral regading feed type, e.g. RSS1.0/RDF, RSS2.0, Atom.

Looking for potential to create a Microformat.

[Now goes into some detail re the spec which I’m not going to try to reproduce].

EC JRC Tsunami Simulator. Subscribed to USGS earthquake feed, ran tsunami model, and dependent on outcome, they would sent out an alter. Also had RSS feed. Produce maps of possible tsunami.

Supported by, or about to be supported by:

– Platial

– Tagzania

– Ning

– Wayfairing, Plazes

Commercial support

– MSFT announced intention

– Yahoo! (Upcoming, Weather, Traffic, Flickr may potentially use it, and Maps API

– Ning

– CadCorp

Google

– OGC member

– MGeoRSS

– Acme GeoRSS

– GeoRSS2KML

And other stuff

– Feed validator

– WordPress Plugin in the works

– Weblog

– A press release

– Feed icon

Aggregation

http://placedb.org

http://www.jeffpalm.com/geo/

http://fofredux.sourceforge.net/

http://mapufacture.com/georss/

Mapufacture, create and position a map, select georss feeds and put them together in a map. Then can do keyword searches and location searches. Being able to aggregate them together is very useful. Rails app. E.g. several weather feeds, added to a map and then when you click on the pointer on the map, the content shows up.

Social maps, e.g. places tagged as restaurants in Platial and Tagzania on one simple maps.

Can search, and navigate the map to show the area you’re interested in, then it searches the feeds and grabs everything in that location. All search results produce a GeoRSS feed which you can then reuse.

Odds and ends

– mobile device potential, sharing info about where you are

– sensors, could be used for publishing sensor data

– GIS Time Navigation, where you navigate through space and see things happening in time, e.g. a feed of events in Amsterdam which provided you with a calendar and location.

– RSS to GeoRSS converter, taking RSS, geocode place names and produce GeoRSS

Xtech 2006: Ben Lund – Social Bookmarking For Scientists

Connotea, social bookmarking for scientists.

Why for scientists? Obviously, scientists and clinicians are a core market. doesn’t exclude others, but concentrating on users with a common interest they could increase discover benefits. Hooks into academic publishing technologies.

Connotea is an open tool, is social so connects to other users, and has tags. But what it does is identify articles solely from the bookmark URLs. So it can pull up the citation from the URL – title, author, journal, issue no. page, publication date. This is important for scientists.

Way it does it is by ‘URL scanning’. So user is on a page, e.g. PubMed which is a huge database of abstracts from biomed publications. When the user clicks ‘Add to Connotea’, this opens a window, it recognises that this is a scholarly article, and imports the data.

Uses ‘citation source plug-ins’ – perl modules for each API. It asks each plug-in to see if it recognises the URL and when it does it goes and gets the information which then associates it with the bookmark in the database.

[Now runs through some programming stuff.]

Bookmarks on a lot of these scientific resources are far from clean or permanent and have a lot of session data in. So this needs cleaning off.

So what’s important? Retrieval and discovery. Already has tagging for navigation. Also has search in case there are some articles that haven’t been accurately tagged.

Provides extra link options for bookmarks. Main title links to the article, say in PubMed; but there are links to other sources for this article, e.g. to the original Nature article; plus other databases, and cross-referencing services.

System also produces a long open URL with all the bibliographic information in it.

Now … the hate.

First hate:

– poorly documented and poorly implemented data formats. Variety of different XML schema. Liberal interpretations of standards.

Second hate:

– have to do lots of unnecessary hoop-jumping to get this data. Lots of pinging different urls to get coookies, POSTs, etc.

Third hate:

– have to do everything on a case-by-case basis. have to reverse engineer each publisher’s site . have to write ad hoc rules and custom procedures for each case.

A wish

Nature release a proposal called OTMI, open text mining interface – wants to make Nature’s text open for data mining, but not the articles themselves. So researchers looking for raw XML for doing data mining research, but ever time someone asks they have to make ad hoc arrangements for each case. So OTMI does some pre-processing to make the data more usable.

Publishers could choose to be supported by Connotea and remove the need for them to reverse engineer. Publisher just puts a link through to an ATOM doc with the relevant data in so that the citation can be easily retrieved.

Blogs already do autodiscovery of ATOM feeds, so can test idea using a citation source plug-in for a blog. It works, so can treat any source as a citation, but only whilst the post is still in the RSS feed.

Another wish

Citation microformat. Connotea would work really well with a citation microformat, so is going to look into that.

Summary

How to do URL to metadata

– manual entry

– scraping the page

– recognise and extract some ID, Connotea does that, but it doesn’t scale to the whole web.

– follow a metadata link from page, this is the blog plug-in

– parse the page directly, not possible yet.

Useful not just for Nature as publishers of data, but also anyone else who wants to be discoverable and bookmarkable.

Nature blog about this, Nascent.

Xtech 2006: Tom Coates – Native to a Web of Data: Designing a part of the Aggregate Web

This is a developed version of the talk Tom gave first at the Caron Summit on the Future of Web Apps. So now you can compare and contrast, and maybe draw the conclusion that either I am typing more slowly these days, or he just talked faster.

Was working at the BBC, now at Yahoo! Only been there a few months so what he’s saying is not corporate policy. [Does everyone who leaves the BBC go to work for Yahoo?]

Paul’s presentation was a little bit ‘sad puppy’ but mine is going to be more chichi. Go to bingo.scrumjax.com for buzzword bingo.

I’m going to be talking about design of W2.0. When people think about design they think rounded corner, gradient fills, like Rollyo, Chatsum. Now you have rounded corners and aqua effects FeedRinse. All started with Blogger and the Adaptive Path group.

Could talk for hours about this, the new tools at our disposal, about how Mac or OmniGraffe change the way people design. But going to talk about products and how they fit into the web. Web is gestalt.

What is the web changing into?

What can you or should you build on top of it?

Architectural stuff

Web of data, W2.0 buzzwords, lots of different things going on, at design, interface, server levels, social dynamics. Too much going on underneath it to stand as a term, but w.20 is condensing as a term. These buzzwords are an ettempt to make sense of things, there are a lot of changes and innovations, and I’m going to concentrate on one element. On the move into a web of data, reuse, etc.

Web is becoming aggregate web of connected data sources and service. “A web of data sources, services for exploring and manipulating data, and ways that user can connect them together.”

Mashups are pilot fish for the web. By themselves, not that interesting. But they are a step on the way to what’s coming.

Eg., Astronewsology. Take Yahoo! news, and star signs, so can see what news happens to Capricorns. Then compare to predictions. Fact check the news with the deep, importance spiritual nature of the universe.

Makes two sets of data explorable by each other, put together by an axis of time.

Network effect of services.

– every new service can build on top of every other existing service. the web becomes a true platform.

– every service and piece of data that’s added to the web makes every other service potentially more powerful.

These things hook together and work together so powerfully that it all just accelerates.

Consequences

– massive creative possibilities

– accelerating innovation

– increasingly competitive services

– increasing specialisation

API-ish thing is a hippy dream… but there is money to be made. Why would a company do this?

– Use APIs to drive people to your stuff. Amazon, eBay. Make it easier for people to find and discover your stuff.

– Save yourself money, make service more attractive and useful with less central dev’t

– Use syndicated content as a platform, e.g. stick ads on maps, or target banner adds more precisely

– turn your API into a pay-for service

Allows the hippies and the money men to work together, and the presence of the money is good. The fact that they are part of this ecosystem is good.

If you are part of this ecosystem, you benefit from this acceleration. If you’re not, you’re part of a backwater.

What can I build that will make the whole web better? (A web of data, not of pages.) How can I add value to the aggregate web?

Data sources should be pretty much self-explanatory. Should be able to commercialise it, open it out, make money, benefit from the ecosystem around you. How can you help people use it?

If you’re in social software, how can you help people create, collect or annotate data?

There is a land grab going on for certain types of data sources. People want to be the definitive source. In some areas, there is opportunity to be the single source. In others, it’s about user aggregation, reaching critical mass, and turn that aggregated data into a service.

Services for exploring/manipulating data. You don’t need to own the data source to add values, you can provide people tools to manipulate it.

Users, whether developers or whomever. Feedburner good at this. Slicing information together.

Now will look at the ways to build these things. Architectural principles.

Much of this stuff from Matt Biddulph’s Application of Weblike Design to Data: Designing Data for Reuse, which Tom worked on with Matt.

The web of data comprises these components.

– Data sources

– Standard ways of representing data

– Identifiers and URLs

– Mechanisms for distributing data

– Ways to interact with/enhance data

– Rights frameworks and financial

These are the core components that we have to get write for this web of data to emerge properly.

Want people to interrogate this a bit more, and think about what’s missing.

Ten principles.

1. Look to add value to the aggregate web of data.

2. Build or normal users, developers and machines. Users need something beautiful. Developers need something useful, that they can build upon, show them the hooks like consistent urls. Machines need predictability. How can you automate stuff? E.g. tagspaces on Flickr can be automated getting those photos thus tagged.

3. Start by explorable data, not page. How are you going to represent that data. Designers think yuou need to start with user needs, but most user needs stuff is based on knowing what the data is for to start with. Need to work out best way to explore data.

4. Identify your first order objects and make them addressable. What are the core concepts you are dealing with? First order objects are things like people, addresses, events, TV shows, whatever.

5. Correlate with external identifier schemes (or coin a new standard).

6. Use readable, reliable and hackable URLs.

– Should have a 1-1 correlation with the concept.

– Be a permanent references to resources, use directories to represent hierarchy

– not reflect the underlying tech.

– reflect the structure of the data – e.g. tv schedules don’t reflect the tv show but the broadcast, so if you use the time/date when a show is broadcast, that doesn’t correlate to the show itself, it’s too breakable.

– be predictable, guessable, hackable.

– be as human readable as possible, but no more.

– be – or expose – identifiers. e.g. if you have an identifiers for every item, e.g. IMDb film identifiers could be used by other service to relate to that film.

Good urls are beautiful and a mark of design quality.

7. Build list views and batch manipulation interfaces

Three core types of page

– destination

– list-view

– manipulation interface, data handled in pages that are still addressable and linkable.

8. Create parallel data service using understood standards

9. Make your data as explorable as possible

10. Give everything an appropriate licence

– so people know how they can and can’t use it.

Are you talking about the semantic web? Yes and no. But it’s a web of dirty semantics – getting data marked up, describable by any means possible. The nice semantic stuff is cool, but use any way you can to get it done.

Xtech 2006: Paul Hammond – An open (data) can of worms

Used to work for the BBC, but left three weeks ago, so can’t talk too much about them. Started working for Yahoo! two weeks ago, lots of APIs at the Developer Network. But can’t really talk about that because he’s only been there two weeks.

Ideas he wants to talk about are his personal experience, and experiences of his friends which they’ve told him in confidence, so can’t talk about that either. So this talk will be not as detailed as would have liked.

Open data. BBC and Yahoo! both understand the benefits of open data. Both have made statements about the importance of open data. Both aim to make as much data available as possible. And there are restrictions on the use of those data.

People know that BBC and Yahoo! are opening up their data, because it’s still relatively rare. So when a new company does it, everyone gets excited. So wanted to see how much data there really is.

List of open APIs at www.programmableweb.com/apilist and it’s a fairly good list, but missing a few bits and pieces. It had 201 APIs listed, and they are all on one page. One quarter of APIs listed are from 7 companies

Yahoo!

Google

Amazon

MS

Ebay

AOL

Plus one I missed.

Most of the companies are new, only 14 APIs from companies more than 20 years old. The big old companies are big, and they’ve collected a lot of useful data that we could do interesting things with that it’s not available.

So everyone in our tech bubble think open data is a good idea, but hardly anyeone is doing it. So if open data is such a good idea, why isn’t there more of it? Don’t care about the format of the data.

Haven’t mentioned RSS/Atom. There are millions of RSS feeds, but these highlight the problems even more. You can now get RSS feeds for almost anything you want, but try getting in depth sports statistics, or updated stock market data, or flight times. You can’t get it. RSS is intended to be read in an aggregator, and most of it can’t be reused or republished.

So you can get any data you want from the net, so long as it’s the last 10 items on an RSS feed, and you don’t what to do anything with it.

Why are people happy to put some data out, but not others. Do the tech and standards need to be better? Yes, they are not perfect but they never are. Simple things like character encoding are very easy to get wrong. Definitions are difficult.

But they are good enough. Standards have been developed because there’s a real need to use this stuff behind the firewall. RSS is popular, and most of it is not perfect, but it’s good enough.

So if it’s not the tech, it must be something else. But there’s a simple reason. Organisations don’t do anything unless they think it is in their best interests. A company won’t do anything unless it makes money, so maybe companies don’t think its worthwhile. That means either:

They’re right.

or

They’re wrong.

Either could be correct. But more important is to understand their reasons.

Most companies don’t know what an API is. If they don’t understand the concept of releasing their data online, then standards won’t matter. Explaining the concept of an API is hard when you are talking to people who don’t know how computers work.

People are starting to learn about RSS. They understand that if they use RSS they don’t need to visit the site. But to use it you do need to know a little bit about it. However, it fits in to an existing business model – it drives interest and visitors to their site. Is in a positive feedback loop because the more RSS there is, the more you see it, the more likely people are going to use it.

So assuming the companies knows what an API is…

Most companies make money from their data. So they will say ‘why give it away?’. For some you can explain why it’s good – for a public broadcaster you can say ‘we’ve paid for it already’. For some companies there are reasons – improves branding, etc. – but it’s a risk.

For most companies, they want competitive advantage. So if a competitor has opened up then you have to open up to keep up.

If you sell data and then you start giving it away it reduces the perceived data. If you sell it for tens of thousands of pounds, then why are you giving it away? Gets into a downward spiral as to what that data is worth.

Opening up data is risky – risk losing money that you’re making. Could argue that they are wrong, but not sure that they are.

Many companies are not allowed to open up, even if they want to.

Lawyers say no. Most companies don’t have complete rights over the data they used. So stock prices on the evening news don’t come from the broadcaster, it’s bought in. Google don’t create their own map data, they buy it from someone like Navteq. It’s cheaper that way. Data provider has economies of scale. Also waste of time to do it yourself. some companies also act as middlemen between groups, e.g. travel agents ticket bookings and Sabre and the airlines. Companies outsource things. Then there are exclusivity issues.

So even if they wanted to, some companies are contractually prohibited to share their data.

Look at Google Map mash-ups. Google get their map data from NavTeq, but the data used in the Google API is from Tele Atlas. Have to be determined to do this. Might also cost you more money.

Finally, the general public wouldn’t always like it. Personal data, for example.

It’s nice to have. But the benefits are second order. So people label it as low priority.

Once you have an API it will be missing features.

So what should we do?

Not sending emails demanding and API. That just makes you look like a moron.

But… what you can do

1. Be aware of the problems

2. Demonstrate usefulness, screen scrape if you need to, but don’t get yourself cease-and-desisted

3. Don’t assume it’s a technology problem

4. Target the right people, find someone on the inside who can help you

5. Talk about benefits to the provider, not the consumer. If you talk about the benefits to you, they’ll see you just as someone who wants something for free.

6. Have patience. It is getting better every day, and it takes time for business to come round.

Xtech 2006: Steven Pemberton – The power of declarative thinking

Sapir-Whorf Hypothesis. Connection between thought and language, if you haven’t got a word for it you can’t think it. If you don’t percieve it asw a concept, you won’t invent a word for it. For example: Dutch ‘gezellig’ [or Welsh ‘hiraeth’].

The Deeper Meaning of Liff: A dictionary of things there aren’t any words for yet but there ought to be.

Example, Peoria (n.): the fear of peeling too few potatoes.

Web examples, AJAX, blog, microformats, Web 2.0. These are words that let us talk about things, they create the concept for us so we can talk about them, even though the thing existed before. They also signal the success of work that has gone on in the past.

There’s little in AJAX that wasn’t there from the start. Blogs have really been around since 95.

What needs a name? Think about concepts that needs names (which the Saphir-Whorf Hypothesis doesn’t allow us to do).

E.g. the sort of website that is like CSS Zen Garden wherein the HTML has been sliced straight off from the CSS. Another example, is using SVG to render data.

Other things that need to be Whorfed in the future:

– layering semantics over viewalb econtent like microformats, RDF/A, making the semantic web more palatable for the web author.

– webapps using decorative markup.

Moores law and an exponential world. Computers very powerful now. His new computer is a dual-core, which means his computer is twice as idle as it was before. Why aren’t we using best use of this power?

A declarative approach puts the work in the computer, not on the human’s shoulders.

Software versions not so much of an issue these days, but devices are. Lots and lots of devices. Also diversity of users. We are all visually impaired at some point or another, specially with tiny fonts on powerpoint slides, so designing for accessibility is designing for our future selves. It’s essential.

Google is a blind users, it sees what a blind user sees. If your site is accessible, Google will see more too.

Want ease of use, device independence, accessibility.

Bugs increase with complexity. A program that is 10 times longer has 32 times the bugs. But most code in most programmes has nothing to do with what the programme should achieve.

However, declarative programming cuts the crap. Javascript, for example, falls over if it gets too long, and declarative programming could replace it and make the computer do the hard stuff without it cluttering up the code. It makes it easier by removing the administrative details that you don’t want to mess about with anyway, so if you let the computer do it then you can remove a lot of this code. So the declarative mark-up is the only bit produced by the human.

Xtech 2006: Tristan Ferne – Chopping Up Radio

Finding things when your content is audio is hard, and BBC has a lot of audio content. So need to use metadata, so have info about whole programmes. Don’t have data about how these programmes can be chopped up, e.g.

– news stories

– magazine programmes

– interviews

– music tracks

Acquiring metadata about programmes:

– in production process, either people or software, pre-broadcast

– media analysis of what is broadcast

– user annotation

Focusing on user annotation, which is the Annotatable Audio project. Aim is to get listeners to divide programmes into segments and to annotate and tag each bit. Demonstrated a pilot internally, and preparing for a live deployment.

Can annotate the audio by selecting segments (like ‘notes’ in Flickr) and add factual notes. Are thinking about adding comments about whether or not people like stuff. Wiki-like.

Intending to launch around a low-profile programme, probably factual so they promote the annotation angle, not the discussion angle. Users will need to log in to annotate, but any user can see the canonical version.

Will be able to then search within the programme, to generate chapterised podcasts, and also want to support chapterised MP3s.

Looking at using it as an internal tool for production staff, e.g. tracklisting for specialist music shows or live sessions where the tracklisting can’t be pulled off of a CD.

Can add in tags and then pull out related Flickr photos, which can work nicely but sometimes doesn’t.

Could be used for syndication, so people could more easily use a section or segment of a programme using a ‘blog this’ button on the interface which creates a Flash interface you can put on your site. Problems with editorial policy on that, but it’s an aspiration for their department.

Regarding licensing, will initially be doing it with audio that there are not licensing issues for, which is either rights-free or for which the BBC has the rights.

Xtech 2006: Tom Loosemore – Treating Digital Broadcast As Just Another API, and other such ruminations

Going to tell us a story about Mr Baird, Mr Moore, and Mr Berners-Lee.

10-15 years ago, Mr Baird ruled the roost, but we know about TV and what makes great TV is great programmes, fabulous stories fabulously told. Mr Moore then came a long and said our chips will get faster, our kit will get smaller, and his corollary, that disks will just keep getting bigger. That was 30 years ago. 15 years ago Mr Berners-lee populated the net, and said the ‘internet is made of people’.

10 years ago, Mr Loosemore started working for Wired in the UK as a journalist before they went bust. One of his jobs was to keep abreast of Moore’s Law, as the editor wanted to do a monthly feature on costs and size of computing equipment. Recently found a spreadsheet from 95 charting ISP costs and it was really expensive. In 95 everything was analogue – TV, satellite, cable.

Then in 98 Mr Murdoch gave away digital set-top boxes, and it cost £2billion, but the market thought he was nuts. News International nearly cost him the business. But he saw that it was an essential move, because it gave him more bandwidth. In the UK in 95 you had 4 maybe 5 channels, but when Murdoch went with his set-top box, you had hundreds.

Then digital terrestrial started, which was rubbish, but then taken over by the BBC and you can have about 30 free channels.

Doesn’t look at digital broadcasting the same way that everyone else does. Sees it as a way of distributing 1s and 0s. Doesn’t see it as programmes, but as data.

Lots of different standards and formats.

Also live P2P being used to stream live TV.

Focus on Freeview, and view it as an API.

Expect from an API:

– rich, i.e. interesting. 30 TV channels and a bunch of radio is rich.

– open. Freeview is an unencrypted

– well structured. in theory Freeview is

– scalable

– very high availability, it doesn’t fall over

– accessible

– proven

Doesn’t do so well:

– licence? licence is domestic and personal, so do what you will so long as it is domestic and personal.

– documented? Theoretically, dtg.org.uk. But the documentation is copyrighted and managed by Digital Television Group, so have to be a member before you can get the documentation.

But it’s not hard to reverse engineer, so you can see where the broadcasters are adhering to the standards and where they are being a bit naughty.

Five years ago, Freeview is just taking off in the UK, but other stuff also going on.

There’s a lot of data, 2mpbs MPEG2, 2GB storage per day, 50gb per channel per day, so a terabyte will store 4 channels for a week. But linear TV is a bad way to distribute stuff – most of the time you miss most of the stuff. So what if we just record everything.

So, colleagues built a box to store entire broadcast from the BBC for a week. 2.3 terabytes of storage. About 1000 programmes. Had it for about three weeks. When you’ve got that much choice, existing TV interfaces like the grid layout don’t work. Too much data.

Broadcast metadata alongside the programmes, and the BBC have created an API for that metadata, Got 18 months worth of programme metadata, and got Phil Gifford turned it into a website. Got genre data, but that’s pretty useless when you have 100,000 programmes and it’s not help finding stuff you like.

But if you show people stuff that people are in, say programmes with Caroline Quentin, that’s helpful. Mood data was about as useful as genre, but associate it with people it becomes interesting.

Then discovered the BBC Programme Catalogue. Wonderfully well structured data model, and amazing how disciplined they had been in keeping their vocabularies consistent. So Matt Biddulph put it online, and the crucial thing is that everything is a feed – RDF, FOAF etc.

But that’s only the metadata. Where are the programmes?

So, 12 TB stores all BBC TV for 6 months, and that’s a lot of programmes. But what happens when you give people that amount of content? Can’t make it public, but can make it available to BBC staff, who have to watch a lot of TV in order to do their job. Built an internal pilot, the Archive Testbed, which is no longer live. Took the learning from the metadata only prototype and found a few things.

Keep the channel data. Channels are useful and throwing them away too soon cost them. Channel brands are more than just a navigational paradigm, they are a kite mark of different types of programme. So some programmes scream ‘BBC 2’, for example.

Give people all the metadata, all of which came from external broadcast sources, not internal databases.

Added ratings and comments, links to blog posts, bit of social scheduling – what are my friends watching? What do people recommend? If I don’t know what I want, I want other people to tell me.

Was fantastic, but had to limit it to a couple of hundred people within the BBC. Was a bit too popular for their own good.

In the R&D department, a couple of them worked on a project called Kamaelia to create framework to plug together components for network applications and about six months ago, persuaded them they needed a project for that framework and so applied it to this.

Hopefully will make the project very successful. Now BBC Macro has been released as a pilot. Will be eventually everywhere.

Xtech 2006: Roland Alton-Scheidl – StreamOnTheFly network

Reusing broadcast radio/video content. For small broadcasters who have little budget and who need to swap content. System is for the journalists not the listeners. Intelligent audio search and retrieval. Simple DRM mechanisms.

Small community radio stations, often have a stream online. StreamOnTheFly created a structure of nodes which exchange metadata but not the audio files. Each node carries content, metadata, classifications, stats and feedback. Portal provides way for people to follow what content is relevant to them.

[Note: This sounded like a really good idea, but I kinda lost my focus a bit, hence the short notes.]

Xtech 2006: Di-Ann Eisnor – Collaborative Atlas: Post geopolitical boundaries

Platial, trying to help link people to people.

People have been mapping their lives, autobiogeography: where you were born, went to school, etc. They are mapping things of historical importance, e.g. ‘Women who changed the world’. Maps for hobbies and interests, e.g. bird-watchers and cat-lovers.

Over 4000 maps. Everything from food to activists to romantic encounters.

Has tags and comments. Can embed video.

When people get to ‘own’ places, geopolitical boundaries start to melt. Initial analysis. They looked at tags and found a social topography irrelevant to proximity or national borders. Correlated cities based on users, and some cities are gateways to other cities.

Some themes within the tags are universal from city to city, e.g. city names, coffee, restaurants, food, art and home.

Aggregating geodata in Placedb. Taking location point data such as geoRSS or geotagged data, or data that includes city or street names, and then apply comparative analysis algorithms to find the location of documents with no obvious location. So can collect Flickr pictures, Reuters stories, etc. for a specific area, e.g. your home town, and this is fed into Platial, e.g. the London page.

Need more geofeeds into Placedb.