The social side of citizen science

I spent last Thursday and Friday at the Citizen Cyberscience Summit, listening to a series of presentations about how the public are collaborating with scientists to achieve together what neither group can do alone. It was a fascinating couple of days which illustrated the vast variety of projects either running currently or in the pipeline. We’ve all heard of SETI@home, but there are projects now across a diverse set of disciplines, from botany to history, astronomy, meteorology, particle physics, seismology and beyond.

What was notable, however, was that the majority of the projects were about volunteers donating CPU cycles rather than brain cycles. Where communities were mentioned it was generally in passing, and when community tools were mentioned they were almost invariably forums/bulletin boards.

I had hoped to here more from the different projects about community churn, retention tactics, development tactics, social tools, and other such things, but was not totally surprised to see that most presentations focused on the science instead. There was a discussion session scheduled for Friday evening to talk some of these issues through, but I sadly couldn’t stay for it. Nevertheless, I think that the social and community aspects should have been discussed throughout the two days.

It is obvious that there is tremendous overlap of interests between the citizen science community and the social collaboration community, and there are lessons both parties could learn from each other. I’d love to see some sort of round-table organised that brought the two communities together to discuss some of the issues that citizen science faces. In lieu of that, here are a few ideas to hopefully get an online discussion going.

The forum is not the only tool
I don’t think it’s a surprise that those projects which do have a community component tend towards having a forum of some sort. They’ve been around for ages and for many people they are the default discussion tool. However, we’ve come a long way since the forum was invented and there are many social tools that are more suited to certain types of tasks.

Wikis, for example, are much better for collecting static (or slowly evolving) information such as help pages. Blogs are good for ongoing updates and discussion around them. UserVoice is great for gathering feedback on your website or software. A community is a multi-faceted thing so often needs more than just one tool.

Facebook is not a panacea 
During lunch on Friday I did get to talk to some of the other attendees about social media. Facebook, of course, came up. Whilst Facebook is a massive social network, one has to be very careful how one uses otherwise it can be a massive waste of time. Facebook Causes, for example, was said by the Washington Post to have raised money for only a tiny percentage of the nonprofits that used it. I myself have seen how Facebook encourages ‘clicktivisim’ – the aimless joining of a group or cause that isn’t followed up by any meaningful action.

Facebook as a platform, however, is a more interesting proposition. Facebook Connect allows users to log in to your site using Facebook and lets your site post updates for the user to their wall. And Facebook apps may allow citizen science to be done actually on Facebook rather than requiring users to go to another site. In this way, Facebook shows promise, but starting a group or a page and hoping that people will just go off and recruit users to your project is unlikely to be successful.

Twitter is a network of networks
Where Facebook is sitting in the kitchen being introspective over a can of cider, Twitter is the extrovert at the party. Although Facebook has more users (~500m), Twitter is now at ~150m users and growing at 300k per day. More to the point, however, Twitter is easy to use, more open, and Tweets that go viral really do go viral because it’s not just your network you’re reaching, but a network of networks. The potential value for recruitment and retention is huge, if you do it right.

Design apps to be social from the beginning
If you’re creating software for users to download and run, think about how you could make that social. The social aspects to your project don’t need to be managed exclusively on a separate website or third party software. If it makes sense for what you are doing, build in sociability.

Most of these tools are free
I’m guessing that most citizen science projects have little funding. Where social media is concerned, the good news is that the vast majority of key tools are free. The not-so-good news is that you do need to understand how to use them, which could take some investment in terms of training and consulting, and you need time to maintain your online presence. A good consultant will help you understand how to work social media into your work life so that it doesn’t become a drain on resources, but you must have some time to commit to it.

This is where JISC and other funding bodies could really help: by allocating specific funds to raising awareness of social tools in the science community, providing training, ensuring that projects can afford to work with outside social media consultants, and even by helping project leaders understand how to find a good social media consultant (sadly, there are lots of carpetbaggers).

The opportunity afforded to citizen science by social media is enormous, regardless of whether a project is focused on CPU time or more human-scale tasks. Now let’s start talking about how to realise that potential!

Real-time search: The web at the speed of life

This is the presentation that I gave this week at the Nordic Supersearch 2010 conference in Oslo organised by the Norwegian Institute of Journalism. To help explain the presentation, I was looking at the crush of information that people are dealing with, the 5 exabytes of information that Eric Schmidt of Google says that we’re creating every two days.

I think search-based filters such as Google Realtime are only part of the answer. Many of the first generation real-time search engines help filter the firehouse of updates being pumped into Facebook and Twitter, but it’s often difficult to understand the provenance of the information that you’re looking at. More interestingly, I think we are now seeing new and better ways ways to filter for relevant information beyond the search box. Search has been the way for people to find information that is interesting and relevant, but I think real-time activity is providing new ways to deliver richer relevance.

I also agree with Mahendra Palsule that we’re moving from a numbers game to the challenge of delivering relevant information to audiences. In a lot of ways, simply driving traffic to a news site is not working. Often, as traffic increases, loyalty metrics decrease. Bounce rates go up. (Bounce rates are the percentage of visitors who spend less than 5 seconds on your site.) Time on site goes down. The number of single-page per visit visitors increase. It doesn’t have to be that way, but it is too often the case. For news organisations and other content producers, we need to find ways to increase loyalty and real engagement with our content and our journalists. I believe more social media can increase engagement, and I also believe that finding better ways to deliver relevant content to audiences is also key.

Google’s method of delivering relevance in the past was to determining the authority of content on the web by looking at the links to that content, but now we’re seeing other ways to filter for relevance. When you look how services such as filter content, we’re actually tapping into the collective attention of either our social networks or networks of influence in the case of lists of influential Twitter users. In addition to attention, we’re also starting to see location-based networks filter based on not only what is happening in real-time but also what we’re doing in real-space. We can deliver targeted advertising based on location, and for news organisations, there are huge opportunities to deliver highly targeted content.

Lastly, I think we’re finding new ways to capture mass activity by means of visualisation. Never before have we been able to tell a story in real-time as we can now. I gave the examples of the New York Times Twitter visualisation during the Super Bowl and also the UK Snow map.

I really do believe that with more content choices than the human brain can possibly cope with, intelligent filters delivering relevant information and services to people will be a huge opportunity. I think it’s one of the biggest challenges in terms of news organisations that in the battle for attention, we have to constantly be focused on relevance or become irrelevant. Certainly, any editor worth his or her salt knows (or thinks he or she knows) what his audience wants, but there are technology companies that are developing services that can help deliver a highly specialised stream of relevant information to people. As with so many issues in the 21st Century, it won’t be technology or editorial strategies alone that will deliver relevance or sustainable businesses for news organisations, it will the effective use of both.


Janos Barbero, The challenge of scientific discovery games

FoltIt is a protein folding video game. Proteins are chains of amino acids, and they form a unique 3D structure which is key to their function.

Distributed computing isn’t enough to understand protein structures. Game where you try to fold the protein yourself. Game design is difficult, but even more difficult when constrained by the scientific problem you are trying to solving. You can’t take out the fiddly bits. But players have to stay engaged.

Approach the game development as science. Collect data on how people progress through the game so that they could change the training so that they found it easier to do the difficult bits. Also use that info to improve the tools. Had a lot of interaction with and feedback from the players.

Also analyse how people use the different game-tools to do the folding, and see two in particular were used consistently by successful players.

Emergence of game community. Seeing people getting engaged. Had a fairly broad appeal, demographics similar to World of Warcraft.

Second milestone was when players started beating the biochemists, emergence of ‘protein savants’, had great intuition about the proteins, but couldn’t always explain it.

Have a game wiki so people can share their game playing strategy. Each player has a different approach, can use different game-tools. People develop different strategies for different stages of the game.

Humans are comparable or better than computers at this task.

Multiplayer game, they form groups or clans which self-organise, many groups have people who focus on the first phase, others focus on the endgame.

Effect of competition, as one person makes progress, others try to keep up.

Users can share solutions, network amplification.

Humans have completely different strategy to computers, can make huge leaps computers can’t, often looking at bad structures that lead to good, which a computers can’t.

FoldIt is just the first step in getting games to do scientific work. Problem solving and learning through game play. Looking to find ways to train people into experts, generalise to all spatial problems, streamline game synthesis for all problems, and create policies, algorithms or protocols from player strategies.

Expand from problem solving to creativity. Potential for drug design, nano machines and nano design, molecular design. Aim is to create novel protein/enzyme/drug/vaccine that wouldn’t be seen in nature.

Also want to integrate games into the scientific process. Design cycle: pose problem, get public interest, run puzzle, evaluate-analyse-modify-run-repeat, publish.

Elizabeth Cochran, Distributed Sensing: using volunteer computing to monitor earthquakes around the world

Quake-Catcher Network: Using distributed sensors to record earthquakes, to put that data into existing regional seismic networks.

Aim: To better understand earthquakes and mitigate seismic risk by increasing density of seismic observations.

Uses new low-cost sensors that measure acceleration, so can see how much ground shakes during earthquakes. Using BOINC platform. Need volunteers to run sensors, or laptop with sensors.

Why do we need this extra seismic data. Need an idea of what the seismic risk is in an area, look at the major fault systems, population density, and type of buildings.

Where are the faults? Want the sensors in places where earthquakes occur. GSHAP map, shows areas of high seismic risk near plate boundaries. Most concerned with population centres, want sensors where people are, so can get community involved. Looking at cities of over 1m people in areas of high seismic risk.

Construction standards in some areas mean buildings can withstand shaking. But two very large earthquakes took place this year e.g.: Haiti was a bit problem because they have infrequent earthquakes and very low building standards. Chile, had relatively few deaths, and even though some damage, the buildings remained standing.

Seismic risk, look at what happens in the earthquake fault. Simulation of San Andreas fault, shows how much slip, a lot of complexity in a rupture. Very high amplitude in LA basin because it’s very soft sediment which shakes a lot.

Need to figure out how buildings respond. Built 7 storey building on a shake table and shook it, with sensors in and recorded what happened to it. Shake table can’t replicate real earthquakes perfectly. Also have many different types of structure so hard to get the data for them all.

Instead, use sophisticated modelling to understand what happens along the fault, propagation, and building reaction.

Simulations now much more detailed than observed, so no way to check them.

Need to add additional sensors. Seismic stations run upwards of $100k dollars each. Can’t get millions of dollars to put up a sensor net.

Instead use accelerometers that are in laptops, e.g. Apple, ThinkPad, which are used to park hard drive when you drop them. Can tap into that with software in the background to monitor acceleration. Can record if laptop falls off desk or if there’s an earthquake.

External sensors can be plugged into any computer, cost $30 – $100 each, so inexpensive to put into schools, homes etc. Attached by USB.


Location, if you have a laptop you move about, so need laptop by IP, but user can also input their location which is more exact than IP. And user can enter multiple locations, e.g. work, home.

Timing, there’s no GPS clock in most computers, and want to know exaclty when a particular seismic wave arrives, so do network time protocol and pings to find the right time.

Noise, get much more noise in the data than a traditional sensor, e.g. laptop bouncing on a lap. Look at clusters. If one laptop falls on a floor, they can ignore it, but if waves of laptops shake and the waves move at the right speed, they have an event.

Have 1400 participants globally, now trying to intensify network in certain places, e.g. Los Angeles.

Use information for detection of earthquakes, then look at some higher order problems, e.g. earthquake source, wave propagation.

Had one single user in Chile at the time of the earthquake. Software looked at current sensor record and sees if it’s different to previous. Info sent to server after 7 seconds. Soon after earthquake started, internet and power went out, but they did get the date later.

Took new sensors to Chile and distributed them around the area. Put up a webpage asking for volunteers in Chile and got 700 in a week. Had more volunteers than sensors. Had 100 stations installed.

There were many aftershocks, up to M6.7. Don’t often have access to a place with lots of earthquakes happening all at once, so could test data. Looked for aftershock locations, could get them very quickly. Useful for emergency response.

Had stations in the region and some had twice as much shaking as others, gives idea of ground shaking.

Want to have instruments in downtown LA. Have a high-res network in LA already but station density not high enough to look at wave propagation. If put stations in schools, then can get a good network that will show structure of LA basin.

Will also improve understanding of building responses. You can look at dominant frequency that a building shakes at, if that changes then the building has been damaged.

Want to make an earthquake early warning system. An earthquake starts at a given location and the waves propagate out. If you have a station that quickly record the first shaking, and you can get an location and magnitude from that, then because seismic waves travel slower than internet traffic you can get a warning to places further away. More sensors you have, the quicker you can get the warning out.

Working with Southern Californica quake network to see if they can integrate two sensor networks. Also working with Mexico City to install stations, as currently only have a few stations. If any one of them goes down, it affect their ability to respond.

Matt Blumberg, Society of Minds – a framework for distributed thinking

GridRepublic, trying to raise awareness of volunteer computing. Provide people with a list of BOINC projects, can manage all your projects in one website.

Progress Thru Processors, trying to reach people in Facebook. Join up, one click process, projects post updates to hopefully reach volunteers’ friends.

Distributed thinking – what can be done if you draw on the intellectual resources of your network instead of just CPUs. How would you have to organise to make use of available cognition.

What is thinking? Marvin Minksky, The Society of Mind, “minds are built from mindless stuff’. Thinking is made up of small processes called agents, intelligence is an emergent quality. Put those agents into a structure in order to get something useful out of them.

Set of primitives

  • Pattern matching/difference identification
  • Categorising/Tagging/Naming
  • Sorting
  • Remembering
  • Observing
  • Questioning
  • Simulating/Predicting
  • Optimising
  • Making analogies
  • Acquiring new processes

Another way of thinking about it, linked stochastic processes, try stuff randomly, then explore those approaches that seem to be giving better results.




Philip Brohan, Volunteer online transcription of historical climate records

Interested in observation, and particularly extreme weather such as torrential rain, storms.

Morning of 16 Oct 1987, Great Storm in SE England, have weather records for that day, coloured by pressure. Low pressure – storminess. Can we understand its dynamics, can we predict it? Take observations and model them.

Previous big storm was 1703, so if we’re interested in climatology of storms, we need 100s years of records, and need them for everywhere in the world. Europe is well represented, but, say, Antarctica is not. Even in 1987, we didn’t have good records for there.

1918, rather badly observed period of time. People were distracted from weather observations (!).

This is the problem we’re trying to solve. We need more weather observations from 1918. Easy part of the problem: Public Records Office has a tremendous amount of info in their archive. Weather data potentially available if we can extract data.

Ship’s log of HMS Invincible, covers 1914 – 1915. Records actions each hour, and takes weather observations every 4 hours, six per day. Full weather obs. World’s collective archives have millions of these observations, and they are tremendously useful.

Started photographing the logbooks, 250k images. Tried OCR, doesn’t work. Using citizen science project to solve this problem.

Working with the people at Zooniverse, collaborating with them for 5 months. Funded by JISC.

At the moment developing the systems, Old Weather won’t be live for another month. You can pick a ship, join the crew of that ship and start to extract that information from its logbook: date, location, weather information. Doing some beta tests at the moment, hope that in a few weeks time it’ll be launched as a real project.

There is other information in these log books that might be of more interest to others. Expecting to find a lot of this sort of data, e.g. Invincible on 8 Dec 1914, at 5am it was engaged with Battle of the Falkland Islands.

Mostly, don’t know what is in these log books, so need to find out.

[I’m personally very excited about this project as I’m working with the Zooniverse chaps on a small part of it, so very please to see it’s close to launch!]

Mark Hedges, Sustaining archives

Archives, physical or digital. All sorts of documents, but many are important to historians, e.g. scraps of paper from early days of computing can be very important later on.

Time consuming to find things. Dangers to sustainability – stuff gets lost, thrown away, destroyed by accident or fire.

Digital archives, easier to access, but often funding runs out and we need them to last.

NOF-Digitise programme, ran for 5 years, ended 6 years ago, awarded £50m to 155 projects. What happened to them?

  • 30 websites still exist and have been enhanced since
  • 10 absorbed into larger archives
  • 83 websites exist but haven’t changed in 6 years since project ceased
  • 31 no URL available or doesn’t work.

Arhives can die

  • Server failes/vanishes
  • Available but unchanged, becomes obsolescent
  • Content obsolete, new material not included
  • Inadequate metadata
  • Hidden archives, stuff’s there but no one can find it
  • Isolated (from the web of data)

Can we involve the community? Most archives have a focus, so there may be a community interested in it.

Can exploit the interest of specific groups for specific archives, e.g. Flickr tagging of photos. But this can be too libertarian, open to misuse. Not appropriate for more formal archives, e.g. tagging often too loose.

Middle way between professional cataloguers on one hand, free tagging on the other.

Split work up into self-contained tasks that can be sent to volunteers to be performed over internet. Problem with free tagging is that it’s insufficiently accurate. Use task replication to get consensus, calibration of performance, etc.

Apply this methodology to digital archives and cultural heritage collections. Want to sustain and enhance the archives. Want specific communities to adopt archives to ensure longer term prospects.

Very early stage project, TELDAP, rich archive of material relating to Chinese and Taiwanese cultural material. But doesn’t have high visibility worldwide. Needs metadata enhancement, etc.

Great Ormond St Hosp historic case notes, e.g. Dr Garrad, chronological view of his case notes. Transcription, mark up key ideas, cross referencing. Specialised knowledge required, so community is retired nurses, doctors, etc.

East London Theatre Archive Project, contains digitised material from playbills, photos, posters. Images have metadata, but there’s a lot of textual information which hasn’t been extracted and isn’t therefore accessible.

Experimenting with variety of tasks: transcription; identification of ‘special’ text,e.g. cast lists which could be linked to list of actors, or play type.

Some images have text but it’s quite complexly arranged in columns, sections, with embedded pictures. So not entirely easy. Would be useful is to divide images into their different section and classify them according to their nature.

Hybrid approach, OCR them first to produce rough draft, then get volunteer contributions rather than starting with original image.

Ephemeral materials produce very important information.

Communities. Different communities: people with intrinsic interest in topic, e.g. academic, professional; local social communities, e.g. schools; history groups, genealogists; international pool of potential volunteers with E London ancestors.

Size of community less important than having an interest in a particular topic. Important to identify people who have an interest in the fate of the archive. Small groups.

Issues to address. Open-endedness of the tasks makes it hard to asses how well it’s going. Can also attract people with malicious intent.

Want to develop guidelines for this sort of community building.

How are volunteer outputs integrated with professional outputs? Resistance from professionals to anyone else doing stuff.

Having volunteer thinkers as a stage in the project, one could have more complex processes, after the volunteers have done stuff, can get pros in to do more specialised XML mark-up, so have a ‘production line’ to make best use of everyone’s skills.

Getting communities to participate in related archives might help people preserve their cultural identity in an increasingly globalised world.

David Aanensen, EpiCollect – a generic framework for open data collection using smartphones

Looks at a number of projects, including which tracks MRSA spread, and Bd-Maps which looks at amphibian health.

Have been developing a smartphone app so that people in the field can add data. Use GPS so location aware, can take in stills/video.

EpiCollect, can submit info and access data others have submitted, and do data filtering. Android and iPhone versions. Very generic method, any questionnaires could be used for any subject.

Fully generic version at Anyone can create a project, design a form for data collection, load the project up, go out and collect data, and then have a mapping interface on website that you can filter by variable. Free and open source, code on Google Code. Use Gmail authentication.

Drag and drop interface to create form from text input, long text, select single option, select multiple.

iPhone app is free. Can host multiple projects if you want. Once you load the project it transfers the form. Can add multiple entires on your phone. Can attach video, stills, sound. Then data sent to central server. Can actually use it without a SIM card, will save it and then upload over wifi.

Can also edit entires and add new entires via the web interface too. Have also included Google Chat, so that you can contact people directly through the web interface.

Data is mapped on Google Maps, which gives you a chance to see distribution, and can click through for futher details. Also produces bar graphs and pie charts.

One project was animal surveillance in Kenya and Tanzania. There’s also health facility mapping in Tanzania. Archeologists dig sites in Europe. Plant distribution in Yellowstone National Park, encouraging visitors to collect data. Street art collection, photographing favourite tags.

Very simple to use, so people can develop their own projects.

Open source so you can host it on your own server, just a simple XML definition.

Yuting Chen, Puzzle@home and the minimum information sudoku challenge

Sudoku comes from the Latin Square, invented in middle age, Leonhard Euler. But Sudoku related to the Colouring Problem, how do you colour each node in a pentagram/star so none have a neighbour the same colour. Think of Sudoku numbers as colours, each square must be different to its neighbour.

Solving sudoku for all sizes – it’s not just 9 x 9 – is an NP-complete problem, i.e “damn hard”!

How many solutions does Sudoku have? For 4 x 4 Latin Square, 576 versions, and for 9 x 9… there are lots and lots, i.e. 6 x 10 ^ 21. Without symetries, Russell & Frazer found 5.4bn solutions if you take out the symmetries.

Sudoku puzzles require clues to define a unique solution. With 4 clues, it might not have a unique solution. So what is the minimum number of clues that will provide a unique solution. Minimum found now is 17. But is there a 16 clue puzzle? Need a sudoku-checker programme to see if any 16 clue puzzles have unique solutions.

If can check for each solution in 1 second, need to spend 173 years to check all the options, but 1 second to search is not feasible.

Fastest checker will still take 2417 CPU years. Volunteer computing can help. Each solution can be checked independently.

Asia@home is promoting volunteer computing in SE Asia.

Future plans include earthquake hazard maps and medicine design simulations.

Wenjing Wu, Citizen Cyberscience in China: CAS@home

CAS researcher focuses on where volunteer computing and thinking can help. Well known in China, and well trusted.

Chinese volunteer demographics, 42k BOINC users, 420m total internet users, 1.33bn total population. Most volunteers come from eastern developed part of China. Ave age around 27, 90% male, most are students, IT pros, mid-income workers., project started in 2003 to translate and provide information on other volunteer computing projects.

Concerns about volunteer computing:

  • Barriers
    • Language barriers
    • Complication of registration and participation
    • Lack of consciousness of science and contribution
  • Security
    • Internet environment unsafe
    • Piracy
    • Usage of public computers
  • Energy
    • Based on coal – worthwhile?
    • Extra air conditioning in hot season
    • High bills
  • China
    • When will China have their own project
    • Now have CAS@home

CAS@home is first volunteer project in China, launched Jan 2010, based at Inst of High Energy Physics, and Chineses academy of sciences. Uses BOINC.

First application is to predict protein structure. Comparing structure of proteins with existing templates to predict structure. Templates are independent so data can be analysed in parallel.

Future project will be to study physics theories in tau-charm energy region, like strong interaction and weak interaction.

Computing for water cleaning, run on IBM World Computing Grid, simulating new low-cost low-pressure water filters. Filters use nanotubes, as flow resistance in carbon nanotubes is 1000x lower than predicted, so work at low pressure. Physical mechanism not fully understood. Want to simulate this in more detail using molecular dynamics.