We live in the Information Age: we have access to more information and more data than people have ever been able to access before. Most people would agree that this has had huge benefits. For example, in the past, if we had wanted to find out something, or read up on a topic (such as about a famous person or battle), we would have had to search an Encyclopaedia for an answer. If the Encyclopaedia had no answers for us, we may have had to go down to our library to research the answers we were looking for. But that is no longer the case. With the creation and expansion of the modern Web, we can find out information after typing just a few words into a search engine.
However, with the expansion of the Web and information storage, we are facing a new set of problems. We now have vast quantities of data—more than we can really be able to imagine or comprehend—and are creating more and more each day. A good example is that most modern smartphones have data storage capacities that dwarf that of the average home PC from just five years ago. At a recent conference, Erik Schmidt , executive chairman of Google, commented that we create as much information in 2 days as we did up to 2003, which is around 5 exabytes of information created every 2 days! With the proliferation of social websites, the amount of information and data being created has expanded even more rapidly:
It’s perhaps no surprise that Google’s Chief Economist, Hal Varian, has said that dealing with all this data will be “the sexy job” of the next 10 years. He has argued that all companies will need data scientists to keep up with the ever-growing requirements of the information that we are generating.
In this article, I will explore the various algorithms that are being used to deal with the massive increases in data that we are creating, covering distributed processing, Semantic Web and natural language processing.
Distributed processing – Multicore processing and Cloud computing
What is it?
A standard algorithm would operate in serial – in other words, the algorithm would run through the data available, piece by piece, until it was finished. However, this can be slow with large datasets, so one solution to dealing with massive datasets is to split up the task of dealing with those datasets into smaller pieces. This can be achieved with distributed processing, which involves running distributed algorithms on a number of different processors, either using a single machine (called parallel processing), or across multiple machines (cloud computing). The algorithm splits up the task and runs on the different processors, vastly speeding up the computation process.
How will it help?
Using distributed algorithms will save time and money compared to serial algorithms. Distributed algorithms were recently used to calculate the 2,000,000,000,000,000th digit of pi (that’s the two quadrillionth digit). “It took 23 days on 1,000 of Yahoo’s computers – on a standard PC, the calculation would have taken 500 years.” In case you were wondering, the digit was ‘0’.
What are the Limitations?
There are challenges in creating distributed processing algorithms because they are much more complex than serial algorithms. Given that parallel algorithms work by cutting up a single large task into many smaller tasks, it’s vital that the algorithm will still work on the data when it has been broken down into smaller pieces. Whenever one part of the algorithm is unable to finish because it is waiting for information from another part of the algorithm, there is lag, which can be costly, so co-ordinating the distribution process is vital. Problems can also arise when different processors which are working on different sub-tasks all try to modify the same data at the same time. This means that most efficient way to run a distributed algorithm is to have the parts independent of each other so they can be completed separately and then simultaneously combined when all sub-tasks have finished.
What is it?
Distributed algorithms are used widely for various tasks in crunching data. But let’s turn now to searching for information within the masses of data that we are generating. When we use a search engine, we enter a string of words that we are searching for, and the search engine dutifully goes off and tries to find the information we are looking for. But there’s a problem. The search engine doesn’t ‘understand’ the pages it is examining, so can often give us suggested pages that we don’t want.
One solution to this that has been proposed is the Semantic Web. The Semantic Web approach promotes detailed formatting of data and Web pages to give rise to an ‘intelligent’ version of the Web. When Web pages are formatted using Semantic content, a Semantic search engine will be able to understand the content on each page and then be more able to locate what users are searching for. Even better, it will be able to hunt down words and phrases related to what is being searched for.
Within the Semantic Web, an ontology (structure of knowledge) defines the entities, classes, relationships and rules within a specific domain of knowledge. This is achieved using RDF (Resource Description Framework) and OWL (Web Ontology Language).
An ontology is created using RDF (a framework for describing data such as title, type of content, etc) and OWL (the language for processing the information, designed to be read by computer applications to help ‘understand’ the information) to create hierarchal description of structured data.
How will it help?
A useful analogy is to think of the Semantic Web operating rather like organising a library into meaningful sub-sections, making it possible to browse through the related content and books without having to exhaustively search every book in the building. In other words, by including additional, meaningful data in Web content, searches will be more accurately able to pinpoint what the user is trying to find. When using search engines which incorporate semantic information, they should be able to suggest answers from other, related words of phrases, helping you to find what you are looking for much more easily.
The main limitation is that the semantic data needs to be set up, maintained and updated. At the moment semantic data is not on every Web page, and it would take time to add the information. This is a classic problem with adopting new forms of technology.
Natural Language Processing
What is it?
Another approach to using more ‘intelligent’ algorithms comes in the form of Natural Language Processing (NLP). This involves mining facts from unstructured data, which is useful because naturally-occurring language data is very common on open-ended information sources such as the Web. NLP uses machine learning algorithms to learn, piece by piece, a model of human language, and derives information from the models that are generated. It is a branch of artificial intelligence which utilises algorithms that can learn over time based upon the data that they receive.
NLP algorithms are capable of many tasks, such as:
- Relationship extraction – given a chunk of text it can work out relationships in the text. Then if you ask “Who is the wife of David Cameron?” after giving it a news report about David Cameron and his family it will be able to work it out from text.
- Question answering – given natural language questions it should be able to automatically answer them.
Many more tasks that they are capable of are listed here.
Perhaps the most recent example of NLP algorithms in use are those that make up the Apple application Siri which can understand and complete tasks said in natural language. It can also learn as you use it, e.g. remembering people. If you say to ‘call my wife’, it can remember this and link it to her name so in future saying ‘call my wife’ will call her without you having to say her full name.
How will it help?
NLP can be used to answer complex questions that are embedded with the open-ended, unstructured language and information that is predominant on the modern Web. NLP can therefore be used to rapidly answer complex questions that a simple Web search may not be able to address very easily. Unlike semantic content, we don’t need to add the data formatting ourselves: NLP algorithms can work it out for themselves.
As with the distributed processing algorithms, NLP algorithms are complex and difficult to create. They require a massive corpus of data to be trained efficiently, and take time to ‘understand’ the data they have available to them. Furthermore, it has been reported that these algorithms also have heavy data loads when used with Web applications. For example, iPhones using Siri gobble up twice as much data as the previous iPhone model. This means that mobile service providers may soon have to expand the data transfer speed and bandwidth of their networks to keep up with the data requirements of these NLP algorithms.
The future of algorithms involves focusing on the massive task of dealing with the substantial amount of data being generated. We need to improve the technological power through distributed processing and cloud computing so that algorithms can be faster and more efficient. Improvement in hardware is not enough though, so we also need to focus on how the data is organised to make the data easier to sort and search. The Semantic Web will help with this and future algorithms can take full advantage of it. Finally natural language processing is a step forward in complex and advanced algorithms to create a way of searching and sorting various data to make interacting with technology more natural and intuitive.