A few weeks ago, I was at the IBM Systems and Technology Group Analyst Insights conference in Old Greenwich, Connecticut. This is IBM’s annual opportunity to lay out where it sees the future for STG. As part of the visit, analysts were given an opportunity to meet with a number of the Watson engineers at their research centre. One of the subjects under discussion was the impact of IBM Watson on medicine.

Medical research: a battleground for big data?

Ask most people where IBM Watson is used and they will point to the large number of stories about its use in helping to diagnose cancer patients. The more news aware might have also picked up the work being done in medical research and genomics to accelerate drug research.

To some, there is a feeling that this sort of work is nothing special; nothing that couldn’t be done using traditional big data tools, such as IBM DB2 with BLU Accelerator or IBM Cognos, and some clever analytics. Indeed, when you look at press releases and reports around large medical institutions, IBM is not the only company involved. All of its competitors have case studies about what they are doing to help medical science.

It’s about the challenge of unstructured data and retaining context

Howeve, that misses the big picture of what IBM is actually doing with Watson and why the cognitive computing engine inside Watson is ideally suited for this type of work. The type of data involved is not highly curated, heavily structured information held inside a database. Instead, Watson is dealing with the data in two passes.

The first is to ingest very large volumes of data from medical journals, books and research papers. In a report by CiteSeer, a new scientific paper is published nearly every 30 seconds, while the best researchers struggle to read more than 23 papers per month.

This data has a lot of context within the text, which has to be preserved during the data acquisition phase. Preserving context requires a lot of complex algorithms and is something that cannot be done through word indexing. This is why many of the attempts to search unstructured data inside companies returns large numbers of hits, based on the words or phrases being searched for.

The second phase is how Watson takes that data and then creates a self-learning model that extrapolates likely outcomes. On the face of it, this is what supercomputers, and more latterly high-performance computing systems, have been doing for years. Where Watson differs, is that it can give an expectation of how correct it believes an answer is, based on that vast bank of data and context.

Multiple imports raises questions of copyright and Intellectual Property

To ensure that the most accurate results are reached, Watson ingests the material every time. This means that it can disallow certain material if it is later found to be faulty or rejected by peer review. Previous results can be saved and then compared to new runs if required, in order to see how much the now rejected data impacted on previous results.

IBM has already begun resolving the legal and copyright issues with the various publishers and scientific bodies to allow it to access the data required to make this work. As each study will require all the material being read from scratch, there is a need for proper licensing so that IBM is not accused of making it impossible for specialist publishers to exist. The fact that IBM is already talking about the copyright, legal and even ethical challenges of accessing the data demonstrates that we are closer than we might think to this being a commercial solution.

The larger the data pool the easier it is to explore hypotheses

One of the advantages of doing it this way, is that Watson can then use the data to create new paths for investigation. This was first developed in the work with Oncologists, but is now also being used in drug and medical research facilities. According to one recent study, the ability of Watson to fully explore the potential of an hypothesis at high speed, due to its ability to analyse so much data, is speeding up research by a factor of at least six.

Baylor College of Medicine looked at how Watson was able to examine 70,000 articles on a protein P53 and identify six potential proteins that can turn p53 on and off. Watson discovered these six proteins in a very short space of time, while the researchers working alongside it would expect to find, at best, one possible target per year.

Commoditising the solution for a wider research community

What makes all of this exciting, is that IBM is now working to box up this capability and not just sell it with a chunk of hardware, but by the middle of 2015, it will be available as a cloud solution. This means that small charities, working on specific diseases, who struggle for funding and would be unable to attract the level of staff required to run such a complex data processing task, will be able to access it through the cloud.

Of course, once everything is in the cloud, it means that charities from multiple countries who are working on separate solutions can begin to share access to the underlying data. This will help bring together research and speed up the process of developing promising lines of future development for both solutions and drugs.

The value and benefits go beyond medicine

This is not just about medical research. There are many other areas of scientific research where this process can be easily applied. Biochemistry, chemistry, physics, earth sciences, astronomy and even engineering, which is always looking for new materials, are open targets.

For governments, Watson offers the ability to help speed up the evaluation of drugs trials and bring complex fraud cases to court faster. This is something that will have an attraction to people, like the UK Government, which is all in on cloud computing.

But it is in the commercial sector where IBM will expect to make its money. The legal profession is one area that IBM has also begun to investigate; it will be interesting to see when a legal firm announces that it has used Watson to help discover a defence for a client.

Another area will be in general business, where companies are struggling to understand their structured data, let alone the unstructured data inside their businesses. Despite all the talk of mixing internal and external data, few companies have the processing power or tools to do this. IBM will soon announce that it has begun to provide access to public data sources and that Watson will be able to mix that data with internal corporate data. For the sales, marketing and advertising industries the potential here is boundless.

The solution to everything?

Is Watson the answer to everything? No. It will not easily replace the complex processing being done by the world’s supercomputers. However, if you suffer from an illness for which there is little help today and where charities are struggling to acquire the cash for research, Watson could soon be working on possible solutions for you.