Janos Barbero, The challenge of scientific discovery games

FoltIt is a protein folding video game. Proteins are chains of amino acids, and they form a unique 3D structure which is key to their function.

Distributed computing isn’t enough to understand protein structures. Game where you try to fold the protein yourself. Game design is difficult, but even more difficult when constrained by the scientific problem you are trying to solving. You can’t take out the fiddly bits. But players have to stay engaged.

Approach the game development as science. Collect data on how people progress through the game so that they could change the training so that they found it easier to do the difficult bits. Also use that info to improve the tools. Had a lot of interaction with and feedback from the players.

Also analyse how people use the different game-tools to do the folding, and see two in particular were used consistently by successful players.

Emergence of game community. Seeing people getting engaged. Had a fairly broad appeal, demographics similar to World of Warcraft.

Second milestone was when players started beating the biochemists, emergence of ‘protein savants’, had great intuition about the proteins, but couldn’t always explain it.

Have a game wiki so people can share their game playing strategy. Each player has a different approach, can use different game-tools. People develop different strategies for different stages of the game.

Humans are comparable or better than computers at this task.

Multiplayer game, they form groups or clans which self-organise, many groups have people who focus on the first phase, others focus on the endgame.

Effect of competition, as one person makes progress, others try to keep up.

Users can share solutions, network amplification.

Humans have completely different strategy to computers, can make huge leaps computers can’t, often looking at bad structures that lead to good, which a computers can’t.

FoldIt is just the first step in getting games to do scientific work. Problem solving and learning through game play. Looking to find ways to train people into experts, generalise to all spatial problems, streamline game synthesis for all problems, and create policies, algorithms or protocols from player strategies.

Expand from problem solving to creativity. Potential for drug design, nano machines and nano design, molecular design. Aim is to create novel protein/enzyme/drug/vaccine that wouldn’t be seen in nature.

Also want to integrate games into the scientific process. Design cycle: pose problem, get public interest, run puzzle, evaluate-analyse-modify-run-repeat, publish.

Elizabeth Cochran, Distributed Sensing: using volunteer computing to monitor earthquakes around the world

Quake-Catcher Network: Using distributed sensors to record earthquakes, to put that data into existing regional seismic networks.

Aim: To better understand earthquakes and mitigate seismic risk by increasing density of seismic observations.

Uses new low-cost sensors that measure acceleration, so can see how much ground shakes during earthquakes. Using BOINC platform. Need volunteers to run sensors, or laptop with sensors.

Why do we need this extra seismic data. Need an idea of what the seismic risk is in an area, look at the major fault systems, population density, and type of buildings.

Where are the faults? Want the sensors in places where earthquakes occur. GSHAP map, shows areas of high seismic risk near plate boundaries. Most concerned with population centres, want sensors where people are, so can get community involved. Looking at cities of over 1m people in areas of high seismic risk.

Construction standards in some areas mean buildings can withstand shaking. But two very large earthquakes took place this year e.g.: Haiti was a bit problem because they have infrequent earthquakes and very low building standards. Chile, had relatively few deaths, and even though some damage, the buildings remained standing.

Seismic risk, look at what happens in the earthquake fault. Simulation of San Andreas fault, shows how much slip, a lot of complexity in a rupture. Very high amplitude in LA basin because it’s very soft sediment which shakes a lot.

Need to figure out how buildings respond. Built 7 storey building on a shake table and shook it, with sensors in and recorded what happened to it. Shake table can’t replicate real earthquakes perfectly. Also have many different types of structure so hard to get the data for them all.

Instead, use sophisticated modelling to understand what happens along the fault, propagation, and building reaction.

Simulations now much more detailed than observed, so no way to check them.

Need to add additional sensors. Seismic stations run upwards of $100k dollars each. Can’t get millions of dollars to put up a sensor net.

Instead use accelerometers that are in laptops, e.g. Apple, ThinkPad, which are used to park hard drive when you drop them. Can tap into that with software in the background to monitor acceleration. Can record if laptop falls off desk or if there’s an earthquake.

External sensors can be plugged into any computer, cost $30 – $100 each, so inexpensive to put into schools, homes etc. Attached by USB.

Challenges:

Location, if you have a laptop you move about, so need laptop by IP, but user can also input their location which is more exact than IP. And user can enter multiple locations, e.g. work, home.

Timing, there’s no GPS clock in most computers, and want to know exaclty when a particular seismic wave arrives, so do network time protocol and pings to find the right time.

Noise, get much more noise in the data than a traditional sensor, e.g. laptop bouncing on a lap. Look at clusters. If one laptop falls on a floor, they can ignore it, but if waves of laptops shake and the waves move at the right speed, they have an event.

Have 1400 participants globally, now trying to intensify network in certain places, e.g. Los Angeles.

Use information for detection of earthquakes, then look at some higher order problems, e.g. earthquake source, wave propagation.

Had one single user in Chile at the time of the earthquake. Software looked at current sensor record and sees if it’s different to previous. Info sent to server after 7 seconds. Soon after earthquake started, internet and power went out, but they did get the date later.

Took new sensors to Chile and distributed them around the area. Put up a webpage asking for volunteers in Chile and got 700 in a week. Had more volunteers than sensors. Had 100 stations installed.

There were many aftershocks, up to M6.7. Don’t often have access to a place with lots of earthquakes happening all at once, so could test data. Looked for aftershock locations, could get them very quickly. Useful for emergency response.

Had stations in the region and some had twice as much shaking as others, gives idea of ground shaking.

Want to have instruments in downtown LA. Have a high-res network in LA already but station density not high enough to look at wave propagation. If put stations in schools, then can get a good network that will show structure of LA basin.

Will also improve understanding of building responses. You can look at dominant frequency that a building shakes at, if that changes then the building has been damaged.

Want to make an earthquake early warning system. An earthquake starts at a given location and the waves propagate out. If you have a station that quickly record the first shaking, and you can get an location and magnitude from that, then because seismic waves travel slower than internet traffic you can get a warning to places further away. More sensors you have, the quicker you can get the warning out.

Working with Southern Californica quake network to see if they can integrate two sensor networks. Also working with Mexico City to install stations, as currently only have a few stations. If any one of them goes down, it affect their ability to respond.

Matt Blumberg, Society of Minds – a framework for distributed thinking

GridRepublic, trying to raise awareness of volunteer computing. Provide people with a list of BOINC projects, can manage all your projects in one website.

Progress Thru Processors, trying to reach people in Facebook. Join up, one click process, projects post updates to hopefully reach volunteers’ friends.

Distributed thinking – what can be done if you draw on the intellectual resources of your network instead of just CPUs. How would you have to organise to make use of available cognition.

What is thinking? Marvin Minksky, The Society of Mind, “minds are built from mindless stuff’. Thinking is made up of small processes called agents, intelligence is an emergent quality. Put those agents into a structure in order to get something useful out of them.

Set of primitives

  • Pattern matching/difference identification
  • Categorising/Tagging/Naming
  • Sorting
  • Remembering
  • Observing
  • Questioning
  • Simulating/Predicting
  • Optimising
  • Making analogies
  • Acquiring new processes

Another way of thinking about it, linked stochastic processes, try stuff randomly, then explore those approaches that seem to be giving better results.

 

 

 

Philip Brohan, Volunteer online transcription of historical climate records

Interested in observation, and particularly extreme weather such as torrential rain, storms.

Morning of 16 Oct 1987, Great Storm in SE England, have weather records for that day, coloured by pressure. Low pressure – storminess. Can we understand its dynamics, can we predict it? Take observations and model them.

Previous big storm was 1703, so if we’re interested in climatology of storms, we need 100s years of records, and need them for everywhere in the world. Europe is well represented, but, say, Antarctica is not. Even in 1987, we didn’t have good records for there.

1918, rather badly observed period of time. People were distracted from weather observations (!).

This is the problem we’re trying to solve. We need more weather observations from 1918. Easy part of the problem: Public Records Office has a tremendous amount of info in their archive. Weather data potentially available if we can extract data.

Ship’s log of HMS Invincible, covers 1914 – 1915. Records actions each hour, and takes weather observations every 4 hours, six per day. Full weather obs. World’s collective archives have millions of these observations, and they are tremendously useful.

Started photographing the logbooks, 250k images. Tried OCR, doesn’t work. Using citizen science project to solve this problem.

Working with the people at Zooniverse, collaborating with them for 5 months. Funded by JISC.

At the moment developing the systems, Old Weather won’t be live for another month. You can pick a ship, join the crew of that ship and start to extract that information from its logbook: date, location, weather information. Doing some beta tests at the moment, hope that in a few weeks time it’ll be launched as a real project.

There is other information in these log books that might be of more interest to others. Expecting to find a lot of this sort of data, e.g. Invincible on 8 Dec 1914, at 5am it was engaged with Battle of the Falkland Islands.

Mostly, don’t know what is in these log books, so need to find out.

[I’m personally very excited about this project as I’m working with the Zooniverse chaps on a small part of it, so very please to see it’s close to launch!]

Mark Hedges, Sustaining archives

Archives, physical or digital. All sorts of documents, but many are important to historians, e.g. scraps of paper from early days of computing can be very important later on.

Time consuming to find things. Dangers to sustainability – stuff gets lost, thrown away, destroyed by accident or fire.

Digital archives, easier to access, but often funding runs out and we need them to last.

NOF-Digitise programme, ran for 5 years, ended 6 years ago, awarded £50m to 155 projects. What happened to them?

  • 30 websites still exist and have been enhanced since
  • 10 absorbed into larger archives
  • 83 websites exist but haven’t changed in 6 years since project ceased
  • 31 no URL available or doesn’t work.

Arhives can die

  • Server failes/vanishes
  • Available but unchanged, becomes obsolescent
  • Content obsolete, new material not included
  • Inadequate metadata
  • Hidden archives, stuff’s there but no one can find it
  • Isolated (from the web of data)

Can we involve the community? Most archives have a focus, so there may be a community interested in it.

Can exploit the interest of specific groups for specific archives, e.g. Flickr tagging of photos. But this can be too libertarian, open to misuse. Not appropriate for more formal archives, e.g. tagging often too loose.

Middle way between professional cataloguers on one hand, free tagging on the other.

Split work up into self-contained tasks that can be sent to volunteers to be performed over internet. Problem with free tagging is that it’s insufficiently accurate. Use task replication to get consensus, calibration of performance, etc.

Apply this methodology to digital archives and cultural heritage collections. Want to sustain and enhance the archives. Want specific communities to adopt archives to ensure longer term prospects.

Very early stage project, TELDAP, rich archive of material relating to Chinese and Taiwanese cultural material. But doesn’t have high visibility worldwide. Needs metadata enhancement, etc.

Great Ormond St Hosp historic case notes, e.g. Dr Garrad, chronological view of his case notes. Transcription, mark up key ideas, cross referencing. Specialised knowledge required, so community is retired nurses, doctors, etc.

East London Theatre Archive Project, contains digitised material from playbills, photos, posters. Images have metadata, but there’s a lot of textual information which hasn’t been extracted and isn’t therefore accessible.

Experimenting with variety of tasks: transcription; identification of ‘special’ text,e.g. cast lists which could be linked to list of actors, or play type.

Some images have text but it’s quite complexly arranged in columns, sections, with embedded pictures. So not entirely easy. Would be useful is to divide images into their different section and classify them according to their nature.

Hybrid approach, OCR them first to produce rough draft, then get volunteer contributions rather than starting with original image.

Ephemeral materials produce very important information.

Communities. Different communities: people with intrinsic interest in topic, e.g. academic, professional; local social communities, e.g. schools; history groups, genealogists; international pool of potential volunteers with E London ancestors.

Size of community less important than having an interest in a particular topic. Important to identify people who have an interest in the fate of the archive. Small groups.

Issues to address. Open-endedness of the tasks makes it hard to asses how well it’s going. Can also attract people with malicious intent.

Want to develop guidelines for this sort of community building.

How are volunteer outputs integrated with professional outputs? Resistance from professionals to anyone else doing stuff.

Having volunteer thinkers as a stage in the project, one could have more complex processes, after the volunteers have done stuff, can get pros in to do more specialised XML mark-up, so have a ‘production line’ to make best use of everyone’s skills.

Getting communities to participate in related archives might help people preserve their cultural identity in an increasingly globalised world.

David Aanensen, EpiCollect – a generic framework for open data collection using smartphones

Looks at a number of projects, including spatialepidemiology.net which tracks MRSA spread, and Bd-Maps which looks at amphibian health.

Have been developing a smartphone app so that people in the field can add data. Use GPS so location aware, can take in stills/video.

EpiCollect, can submit info and access data others have submitted, and do data filtering. Android and iPhone versions. Very generic method, any questionnaires could be used for any subject.

Fully generic version at EpiCollect.net. Anyone can create a project, design a form for data collection, load the project up, go out and collect data, and then have a mapping interface on website that you can filter by variable. Free and open source, code on Google Code. Use Gmail authentication.

Drag and drop interface to create form from text input, long text, select single option, select multiple.

iPhone app is free. Can host multiple projects if you want. Once you load the project it transfers the form. Can add multiple entires on your phone. Can attach video, stills, sound. Then data sent to central server. Can actually use it without a SIM card, will save it and then upload over wifi.

Can also edit entires and add new entires via the web interface too. Have also included Google Chat, so that you can contact people directly through the web interface.

Data is mapped on Google Maps, which gives you a chance to see distribution, and can click through for futher details. Also produces bar graphs and pie charts.

One project was animal surveillance in Kenya and Tanzania. There’s also health facility mapping in Tanzania. Archeologists dig sites in Europe. Plant distribution in Yellowstone National Park, encouraging visitors to collect data. Street art collection, photographing favourite tags.

Very simple to use, so people can develop their own projects.

Open source so you can host it on your own server, just a simple XML definition.

Yuting Chen, Puzzle@home and the minimum information sudoku challenge

Sudoku comes from the Latin Square, invented in middle age, Leonhard Euler. But Sudoku related to the Colouring Problem, how do you colour each node in a pentagram/star so none have a neighbour the same colour. Think of Sudoku numbers as colours, each square must be different to its neighbour.

Solving sudoku for all sizes – it’s not just 9 x 9 – is an NP-complete problem, i.e “damn hard”!

How many solutions does Sudoku have? For 4 x 4 Latin Square, 576 versions, and for 9 x 9… there are lots and lots, i.e. 6 x 10 ^ 21. Without symetries, Russell & Frazer found 5.4bn solutions if you take out the symmetries.

Sudoku puzzles require clues to define a unique solution. With 4 clues, it might not have a unique solution. So what is the minimum number of clues that will provide a unique solution. Minimum found now is 17. But is there a 16 clue puzzle? Need a sudoku-checker programme to see if any 16 clue puzzles have unique solutions.

If can check for each solution in 1 second, need to spend 173 years to check all the options, but 1 second to search is not feasible.

Fastest checker will still take 2417 CPU years. Volunteer computing can help. Each solution can be checked independently.

Asia@home is promoting volunteer computing in SE Asia.

Future plans include earthquake hazard maps and medicine design simulations.

Wenjing Wu, Citizen Cyberscience in China: CAS@home

CAS researcher focuses on where volunteer computing and thinking can help. Well known in China, and well trusted.

Chinese volunteer demographics, 42k BOINC users, 420m total internet users, 1.33bn total population. Most volunteers come from eastern developed part of China. Ave age around 27, 90% male, most are students, IT pros, mid-income workers.

EQUN.com, project started in 2003 to translate and provide information on other volunteer computing projects.

Concerns about volunteer computing:

  • Barriers
    • Language barriers
    • Complication of registration and participation
    • Lack of consciousness of science and contribution
  • Security
    • Internet environment unsafe
    • Piracy
    • Usage of public computers
  • Energy
    • Based on coal – worthwhile?
    • Extra air conditioning in hot season
    • High bills
  • China
    • When will China have their own project
    • Now have CAS@home

CAS@home is first volunteer project in China, launched Jan 2010, based at Inst of High Energy Physics, and Chineses academy of sciences. Uses BOINC.

First application is to predict protein structure. Comparing structure of proteins with existing templates to predict structure. Templates are independent so data can be analysed in parallel.

Future project will be to study physics theories in tau-charm energy region, like strong interaction and weak interaction.

Computing for water cleaning, run on IBM World Computing Grid, simulating new low-cost low-pressure water filters. Filters use nanotubes, as flow resistance in carbon nanotubes is 1000x lower than predicted, so work at low pressure. Physical mechanism not fully understood. Want to simulate this in more detail using molecular dynamics.

Ben Segal, LHC@home starts to tackle real LHC physics

LHC accelerates protons and particles, analysis of resulting events is helping understand the nature of matter, origin of the universe etc.

LHC@home is based on BOINC, to allow volunteers to lend their computers, started five years ago. Was a tool to help design and tune the accelerator itself. Beams circulate and collide many times per second. Objectives were to raise awareness of CERN and the LHC, as well as providing extra CPU power.

Project has run intermittently, which the volunteers don’t really like. Hoping to start giving them a steady flow of jobs in the next few months.

Now want to do “real physics”.

Serious challenges:

  • Most volunteers run Windows, but experiments run on Linux and exporting to Windows is impractical. Didn’t have that problem with design project.
  • Code canges often, so all PCs must update
  • Code size very big

Solution:

Can we use virtualisation? Entire applications environment is packaged, sent out as a ‘virtual image’ and executed in a virtual machine.

Result:

Coding proting to Windows happens automatically. But the problem is the virtual image is still very big (10 Gbtes), and must rewrite whole virtual image for each update.

So there’s a solution but it’s not very practical.

CernVM – when mainframes ruled the roost, IBM mainframe was called CernVM. It was replaced by PCs and Unix/Linux.

The new CernVM is a ‘thin’ virtual appliance for the LHC experiments. Racks of virtual machines. Provides a complete, portable and easy way to configure user env for running locally, on a grid or in the cloud.

New LHC@Home project will use CernVM. Sends user 0.1 GB, it logs into system and downloads rest up to 1GB, and never has to load the full 10GB. Can run real physics with just 1GB.

However, that’s not the end of it. System runs on BOINC, which needs the jobs to be sent from the physicists. But they won’t change their current set up to produce BOINC jobs – they want to know where their jobs are and be able to manage them, and BOINC doesn’t allow that. Their job submission system does do that.

CernVM has software image control, also added an interface for job management.

All done by students and volunteers, which makes it a bit intermittent.

Peter Amoako-Yirenkyi, AfricaMap – volunteer cartography for Africa

A lot has been said about maps, but old maps no longer relevant. Need maps that reflect people’s real-world needs.

There’s a lack of data and geo-information. UN geographic data in different formats on different platforms. Details that are required to make the map useful are just missing. Maps look good at scale of 2000 ft, but at 200 ft are woefully inadequate.

AfricaMap uses volunteers to do tasks that required no specific scientific training. Annotate satellite imagery and provide geospatial data.

Task-specific because they are working with an agency that is well defined in the way it works, i.e. UNOSAT. UNOSAT provide satellite-based solutions for UN, local gov’t, NGOs etc. have made over 1000 maps/analyses in over 200 emergencies and conflict zones. Support early warning, crisis response, etc.

Objective of AfricaMap is to help UNOSAT solve this problem.

  • Produce maps for humanitarian causes
  • Genearte early warning, crisis response, human rights, sustainable recovery information
  • Capacity for specific requrests, e.g. sudden disasters situations
  • Provide historic data

Volunteers look at sat images, on laptop or mobile, and annotate.

This activity has a direct impact on the volunteers themselves, there is an immediate need for this information.

Designed a learning/training system from the beginning, which helps calibrate the volunteers so that they can be given specific jobs that fit their skills.

Create work units from tiled imagery, define and send jobs to volunteers, receive completed jobs, validate, eliminate work done, score volunteers’ work, place them in levels and teams.

Africa Map is still in active development. Not starting from scratch – building on existing technologies.