original:Discover and Discuss the World’s Open Source AI Software & Data – See more at: http://openair.allenai.org/#sthash.NFkPKbEL.dpuf
张雷：这两方面应该说都起到了很大的作用。人工智能领域的很多技术都应用在了该系统中，这是很明显的。另一方面，如果没有计算能力的进步，我们在提高计算速度方面可能也会遇到障碍。几年前，沃森在一台不错的服务器上回答一个问题需要2个多小时的时间。通过IBM Power 7强大的并行化计算能力，才把它压缩到了3秒内。另外，强大的计算能力其实也大大加速了开发进程。我们大量使用了Java语言和机器学习，而这些都是需要有较强的计算能力作为支撑的。
InfoQ：据介绍，沃森采用了包括RDF/Linked Data在内的Semantic Web技术。沃森为什么会选中这一技术？RDF及Linking Open Data的思想在沃森系统中是如何发挥作用的？
张雷：Linked Data是非常重要的结构化的知识源。我们在研发沃森的初始阶段就考察了该如何利用这一重要的知识源。我们尝试了使用Linked Data，特别是DBpedia、IMDb等，来直接回答比赛中的问题。我们也利用Linked Data来帮助沃森对答案的类型进行判断。不仅如此，沃森在很多其它场合也借鉴了一些RDF及Linked Data的思想。例如，有些从文本中挖掘得到的知识是使用三元组形式表示的；当一个字符串代表的对象有歧义时，使用URI来代表不同的对象；利用RDF三元组中的谓词作为语义提示等等。
张雷：对于人工智能实践来说，沃森的经验表明依靠单一或少数算法是很难成功的。而依靠大量的各种小算法的集成更容易取得进展。这似乎和生物界的多样性有着相似性。另外，沃森也说明，人工智能技术已经取得了相当大的进展，通过大规模的集成这些技术，很多我们看似很难的问题已经从“不可能解决”变为“可能可以解决”。例如，沃森表明，以前人工智能中的知识获取的瓶颈（knowledge acquisition bottleneck）似乎变成了一个可能可以解决的问题。
Stanford professor Andrew Ng, the man at the center of the Deep Learning movement. Photo: Ariel Zambelich/Wired
There’s a theory that human intelligence stems from a single algorithm.
The idea arises from experiments suggesting that the portion of your brain dedicated to processing sound from your ears could also handle sight for your eyes. This is possible only while your brain is in the earliest stages of development, but it implies that the brain is — at its core — a general-purpose machine that can be tuned to specific tasks.
About seven years ago, Stanford computer science professor Andrew Ng stumbled across this theory, and it changed the course of his career, reigniting a passion for artificial intelligence, or AI. “For the first time in my life,” Ng says, “it made me feel like it might be possible to make some progress on a small part of the AI dream within our lifetime.”
‘For the first time in my life, it made me feel like it might be possible to make some progress on a small part of the AI dream within our lifetime.’
In the early days of artificial intelligence, Ng says, the prevailing opinion was that human intelligence derived from thousands of simple agents working in concert, what MIT’s Marvin Minsky called “The Society of Mind.” To achieve AI, engineers believed, they would have to build and combine thousands of individual computing modules. One agent, or algorithm, would mimic language. Another would handle speech. And so on. It seemed an insurmountable feat.
When he was a kid, Andrew Ng dreamed of building machines that could think like people, but when he got to college and came face-to-face with the AI research of the day, he gave up. Later, as a professor, he would actively discourage his students from pursuing the same dream. But then he ran into the “one algorithm” hypothesis, popularized by Jeff Hawkins, an AI entrepreneur who’d dabbled in neuroscience research. And the dream returned.
It was a shift that would change much more than Ng’s career. Ng now leads a new field of computer science research known as Deep Learning, which seeks to build machines that can process data in much the same way the brain does, and this movement has extended well beyond academia, into big-name corporations like Google and Apple. In tandem with other researchers at Google, Ng is building one of the most ambitious artificial-intelligence systems to date, the so-called Google Brain.
This movement seeks to meld computer science with neuroscience — something that never quite happened in the world of artificial intelligence. “I’ve seen a surprisingly large gulf between the engineers and the scientists,” Ng says. Engineers wanted to build AI systems that just worked, he says, but scientists were still struggling to understand the intricacies of the brain. For a long time, neuroscience just didn’t have the information needed to help improve the intelligent machines engineers wanted to build.
What’s more, scientists often felt they “owned” the brain, so there was little collaboration with researchers in other fields, says Bruno Olshausen, a computational neuroscientist and the director of the Redwood Center for Theoretical Neuroscience at the University of California, Berkeley.
The end result is that engineers started building AI systems that didn’t necessarily mimic the way the brain operated. They focused on building pseudo-smart systems that turned out to be more like a Roomba vacuum cleaner than Rosie the robot maid from the Jetsons.
But, now, thanks to Ng and others, this is starting to change. “There is a sense from many places that whoever figures out how the brain computes will come up with the next generation of computers,” says Dr. Thomas Insel, the director of the National Institute of Mental Health.
What Is Deep Learning?
Deep Learning is a first step in this new direction. Basically, it involves building neural networks — networks that mimic the behavior of the human brain. Much like the brain, these multi-layered computer networks can gather information and react to it. They can build up an understanding of what objects look or sound like.
With Deep Learning, Ng says, you just give the system a lot of data ‘so it can discover by itself what some of the concepts in the world are.’
In an effort to recreate human vision, for example, you might build a basic layer of artificial neurons that can detect simple things like the edges of a particular shape. The next layer could then piece together these edges to identify the larger shape, and then the shapes could be strung together to understand an object. The key here is that the software does all this on its own — a big advantage over older AI models, which required engineers to massage the visual or auditory data so that it could be digested by the machine-learning algorithm.
With Deep Learning, Ng says, you just give the system a lot of data “so it can discover by itself what some of the concepts in the world are.” Last year, one of his algorithms taught itself torecognize cats after scanning millions of images on the internet. The algorithm didn’t know the word “cat” — Ng had to supply that — but over time, it learned to identify the furry creatures we know as cats, all on its own.
This approach is inspired by how scientists believe that humans learn. As babies, we watch our environments and start to understand the structure of objects we encounter, but until a parent tells us what it is, we can’t put a name to it.
No, Ng’s deep learning algorithms aren’t yet as accurate — or as versatile — as the human brain. But he says this will come.
Andrew Ng’s laptop explains Deep Learning. Photo: Ariel Zambelich/Wired
From Google to China to Obama
Andrew Ng is just part of a larger movement. In 2011, he launched the Deep Learning project at Google, and in recents months, the search giant has significantly expanded this effort, acquiring the artificial intelligence outfit founded by University of Toronto professor Geoffrey Hinton, widely known asthe godfather of neural networks. Chinese search giant Baidu has opened its own research lab dedicated to deep learning, vowing to invest heavy resources in this area. And according to Ng, big tech companies like Microsoft and Qualcomm are looking to hire more computer scientists with expertise in neuroscience-inspired algorithms.
Meanwhile, engineers in Japan are building artificial neural nets to control robots. And together with scientists from the European Union and Israel, neuroscientist Henry Markman is hoping to recreate ahuman brain inside a supercomputer, using data from thousands of real experiments.
‘Biology is hiding secrets well. We just don’t have the right tools to grasp the complexity of what’s going on.’
The rub is that we still don’t completely understand how the brain works, but scientists are pushing forward in this as well. The Chinese are working on what they call the Brainnetdome, described as a new atlas of the brain, and in the U.S., the Era of Big Neuroscience is unfolding with ambitious, multidisciplinary projects like President Obama’s newly announced (and much criticized) Brain Research Through Advancing Innovative Neurotechnologies Initiative — BRAIN for short.
The BRAIN planning committee had its first meeting this past Sunday, with more meetings scheduled for this week. One its goals is the development of novel technologies that can map the brain’s myriad circuits, and there are hints that the project will also focus on artificial intelligence. Half of the $100 million in federal funding allotted to this program will come from Darpa — more than the amount coming from the National Institutes of Health — and the Defense Department’s research arm hopes the project will “inspire new information processing architectures or new computing approaches.”
If we map how out how thousands of neurons are interconnected and “how information is stored and processed in neural networks,” engineers like Ng and Olshausen will have better idea of what their artificial brains should look like. The data could ultimately feed and improve Deep Learning algorithms underlying technologies like computer vision, language analysis, and the voice recognition tools offered on smartphones from the likes of Apple and Google.
“That’s where we’re going to start to learn about the tricks that biology uses. I think the key is that biology is hiding secrets well,” says Berkeley computational neuroscientist aid Olshausen. “We just don’t have the right tools to grasp the complexity of what’s going on.”
What the World Wants
With the rise of mobile devices, cracking the neural code is more important than ever. As gadgets get smaller and smaller, we’ll need new ways of making them faster and more accurate. As you shrink transistors — the fundamental build blocks for our machines — the more difficult it becomes to make them accurate and efficient. If you make them faster, for instance, that means it needs more current, and more current makes the system more noisy — i.e. less precise.
‘If we could figure out how biology naturally deals with noisy computing elements, it would lead to a completely different model of computation.’
Right now, engineers design around these issues, says Olshausen, so they skimp on speed, size, or energy efficiency to make their systems work. But AI may provide a better answer. “Instead of dodging the problem, what I think biology could tell us is just how to deal with it….The switches that biology is using are also inherently noisy, but biology has found a good way to adapt and live with that noise and exploit it,” Olshausen says. “If we could figure out how biology naturally deals with noisy computing elements, it would lead to a completely different model of computation.”
But scientists aren’t just aiming for smaller. They’re trying to build machines that do things computer have never done before. No matter how sophisticated algorithms are, today’s machines can’t fetch your groceries or pick out a purse or a dress you might like. That requires a more advanced breed of image intelligence and an ability to store and recall pertinent information in a way that’s reminiscent of human attention and memory. If you can do that, the possibilities are almost endless.
“Everybody recognizes that if you could solve these problems, it’s going to open up a vast, vast potential of commercial value,” Olshausen predicts.
That financial promise is why tech giants like Google, IBM, Microsoft, Apple, Chinese search giant Baidu and others are in an arms race to develop the best machine learning technologies. NYU’s Yann LeCun, an expert in the field, expects that in the next two years, we’ll see surge in Deep Learning startups, and many will be snatched up by larger outfits.
But even the best engineers aren’t brain experts, so having more neuro-knowledge handy is important. “We need to really work more closely with neuroscientists,” says Baidu’s Yu, who is toying with the idea of hiring one. “We are already doing that, but we need to do more.”
Ng’s dream is on the way to reality. “It gives me hope –- no, more than hope –- that we might be able to do this,” he says. “We clearly don’t have the right algorithms yet. It’s going to take decades. This is not going to be an easy one, but I think there’s hope.”
Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.
Here’s how it’s explained in in the Trends Top Charts FAQ.
“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”
One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). Thedog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.
Of course, when doing this kind of mapping of terms to objects, we only want to consider concepts that commonly have words or short phrases used to denote them. Not all concepts do, such as animals that from a long way off look like flies.
A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.
Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.
The 2010 Fall Issue of AI Magazine includes an article on “Building Watson: An Overview of the DeepQA Project,” written by the IBM Watson Research Team, led by David Ferucci. Read about this exciting project in the most detailed technical article available. We hope you will also take a moment to read through the archives of AI Magazine, and consider joining us at AAAI. To join, please read more athttp://www.aaai.org/Membership/membership.php. The most recent online volume of AI Magazine is usually only available to members of the association. However, we have made an exception for this special article on Watson to share the excitement. Congratulations to the IBM Watson Team!
Building Watson: An Overview of the DeepQA Project
Published in AI Magazine Fall, 2010. Copyright ©2010 AAAI. All rights reserved.
Written by David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty
IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense research and development by a core team of about 20 researchers, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA).
The goals of IBM Research are to advance computer science by exploring new ways for computer technology to affect science, business, and society. Roughly three years ago, IBM Research was looking for a major research challenge to rival the scientific and popular interest of Deep Blue, the computer chess-playing champion (Hsu 2002), that also would have clear relevance to IBM business interests.
With a wealth of enterprise-critical information being captured in natural language documentation of all forms, the problems with perusing only the top 10 or 20 most popular documents containing the user’s two or three key words are becoming increasingly apparent. This is especially the case in the enterprise where popularity is not as important an indicator of relevance and where recall can be as critical as precision. There is growing interest to have enterprise computer systems deeply analyze the breadth of relevant content to more precisely answer and justify answers to user’s natural language questions. We believe advances in question-answering (QA) technology can help support professionals in critical and timely decision making in areas like compliance, health care, business integrity, business intelligence, knowledge discovery, enterprise knowledge management, security, and customer support. For researchers, the open-domain QA problem is attractive as it is one of the most challenging in the realm of computer science and artificial intelligence, requiring a synthesis of information retrieval, natural language processing, knowledge representation and reasoning, machine learning, and computer-human interfaces. It has had a long history (Simmons 1970) and saw rapid advancement spurred by system building, experimentation, and government funding in the past decade (Maybury 2004, Strzalkowski and Harabagiu 2006).
With QA in mind, we settled on a challenge to build a computer system, called Watson,1 which could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise.
Jeopardy! is a well-known TV quiz show that has been airing on television in the United States for more than 25 years (see the Jeopardy! Quiz Show sidebar for more information on the show). It pits three human contestants against one another in a competition that requires answering rich natural language questions over a very broad domain of topics, with penalties for wrong answers. The nature of the three-person competition is such that confidence, precision, and answering speed are of critical importance, with roughly 3 seconds to answer each question. A computer system that could compete at human champion levels at this game would need to produce exact answers to often complex natural language questions with high precision and speed and have a reliable confidence in its answers, such that it could answer roughly 70 percent of the questions asked with greater than 80 percent precision in 3 seconds or less.
Finally, the Jeopardy Challenge represents a unique and compelling AI question similar to the one underlying DeepBlue (Hsu 2002) — can a computer system be designed to compete against the best humans at a task thought to require high levels of human intelligence, and if so, what kind of technology, algorithms, and engineering is required? While we believe the Jeopardy Challenge is an extraordinarily demanding task that will greatly advance the field, we appreciate that this challenge alone does not address all aspects of QA and does not by any means close the book on the QA challenge the way that Deep Blue may have for playing chess.
The Jeopardy Challenge
Meeting the Jeopardy Challenge requires advancing and incorporating a variety of QA technologies including parsing, question classification, question decomposition, automatic source acquisition and evaluation, entity and relation detection, logical form generation, and knowledge representation and reasoning.
Winning at Jeopardy requires accurately computing confidence in your answers. The questions and content are ambiguous and noisy and none of the individual algorithms are perfect. Therefore, each component must produce a confidence in its output, and individual component confidences must be combined to compute the overall confidence of the final answer. The final confidence is used to determine whether the computer system should risk choosing to answer at all. In Jeopardy parlance, this confidence is used to determine whether the computer will “ring in” or “buzz in” for a question. The confidence must be computed during the time the question is read and before the opportunity to buzz in. This is roughly between 1 and 6 seconds with an average around 3 seconds.
Confidence estimation was very critical to shaping our overall approach in DeepQA. There is no expectation that any component in the system does a perfect job — all components post features of the computation and associated confidences, and we use a hierarchical machine-learning method to combine all these features and decide whether or not there is enough confidence in the final answer to attempt to buzz in and risk getting the question wrong.
In this section we elaborate on the various aspects of the Jeopardy Challenge.
A 30-clue Jeopardy board is organized into six columns. Each column contains five clues and is associated with a category. Categories range from broad subject headings like “history,” “science,” or “politics” to less informative puns like “tutu much,” in which the clues are about ballet, to actual parts of the clue, like “who appointed me to the Supreme Court?” where the clue is the name of a judge, to “anything goes” categories like “potpourri.” Clearly some categories are essential to understanding the clue, some are helpful but not necessary, and some may be useless, if not misleading, for a computer.
A recurring theme in our approach is the requirement to try many alternate hypotheses in varying contexts to see which produces the most confident answers given a broad range of loosely coupled scoring algorithms. Leveraging category information is another clear area requiring this approach.
There are a wide variety of ways one can attempt to characterize the Jeopardy clues. For example, by topic, by difficulty, by grammatical construction, by answer type, and so on. A type of classification that turned out to be useful for us was based on the primary method deployed to solve the clue. The bulk of Jeopardy clues represent what we would consider factoid questions — questions whose answers are based on factual information about one or more individual entities. The questions themselves present challenges in determining what exactly is being asked for and which elements of the clue are relevant in determining the answer. Here are just a few examples (note that while the Jeopardy! game requires that answers are delivered in the form of a question (see the Jeopardy!Quiz Show sidebar), this transformation is trivial and for purposes of this paper we will just show the answers themselves):
Category: General Science
Clue: When hit by electrons, a phosphor gives off electromagnetic energy in this form.
Answer: Light (or Photons)
Category: Lincoln Blogs
Clue: Secretary Chase just submitted this to me for the third time; guess what, pal. This time I’m accepting it.
Answer: his resignation
Category: Head North
Clue: They’re the two states you could be reentering if you’re crossing Florida’s northern border.
Answer: Georgia and Alabama
Some more complex clues contain multiple facts about the answer, all of which are required to arrive at the correct response but are unlikely to occur together in one place. For example:
Category: “Rap” Sheet
Clue: This archaic term for a mischievous or annoying child can also mean a rogue or scamp.
Subclue 1: This archaic term for a mischievous or annoying child.
Subclue 2: This term can also mean a rogue or scamp.
In this case, we would not expect to find both “subclues” in one sentence in our sources; rather, if we decompose the question into these two parts and ask for answers to each one, we may find that the answer common to both questions is the answer to the original clue.
Another class of decomposable questions is one in which a subclue is nested in the outer clue, and the subclue can be replaced with its answer to form a new question that can more easily be answered. For example:
Category: Diplomatic Relations
Clue: Of the four countries in the world that the United States does not have diplomatic relations with, the one that’s farthest north.
Inner subclue: The four countries in the world that the United States does not have diplomatic relations with (Bhutan, Cuba, Iran, North Korea).
Outer subclue: Of Bhutan, Cuba, Iran, and North Korea, the one that’s farthest north.
Answer: North Korea
Decomposable Jeopardy clues generated requirements that drove the design of DeepQA to generate zero or more decomposition hypotheses for each question as possible interpretations.
Jeopardy also has categories of questions that require special processing defined by the category itself. Some of them recur often enough that contestants know what they mean without instruction; for others, part of the task is to figure out what the puzzle is as the clues and answers are revealed (categories requiring explanation by the host are not part of the challenge). Examples of well-known puzzle categories are the Before and After category, where two subclues have answers that overlap by (typically) one word, and the Rhyme Time category, where the two subclue answers must rhyme with one another. Clearly these cases also require question decomposition. For example:
Category: Before and After Goes to the Movies
Clue: Film of a typical day in the life of the Beatles, which includes running from bloodthirsty zombie fans in a Romero classic.
Subclue 2: Film of a typical day in the life of the Beatles.
Answer 1: (A Hard Day’s Night)
Subclue 2: Running from bloodthirsty zombie fans in a Romero classic.
Answer 2: (Night of the Living Dead)
Answer: A Hard Day’s Night of the Living Dead
Category: Rhyme Time
Clue: It’s where Pele stores his ball.
Subclue 1: Pele ball (soccer)
Subclue 2: where store (cabinet, drawer, locker, and so on)
Answer: soccer locker
There are many infrequent types of puzzle categories including things like converting roman numerals, solving math word problems, sounds like, finding which word in a set has the highest Scrabble score, homonyms and heteronyms, and so on. Puzzles constitute only about 2–3 percent of all clues, but since they typically occur as entire categories (five at a time) they cannot be ignored for success in the Challenge as getting them all wrong often means losing a game.
Excluded Question Types.
The Jeopardy quiz show ordinarily admits two kinds of questions that IBM and Jeopardy Productions, Inc., agreed to exclude from the computer contest: audiovisual (A/V) questions and Special Instructions questions. A/V questions require listening to or watching some sort of audio, image, or video segment to determine a correct answer. For example:
Category: Picture This
(Contestants are shown a picture of a B-52 bomber)
Clue: Alphanumeric name of the fearsome machine seen here.
Special instruction questions are those that are not “self-explanatory” but rather require a verbal explanation describing how the question should be interpreted and solved. For example:
Category: Decode the Postal Codes
Verbal instruction from host: We’re going to give you a word comprising two postal abbreviations; you have to identify the states.
Answer: Virginia and Indiana
Both present very interesting challenges from an AI perspective but were put out of scope for this contest and evaluation.
As a measure of the Jeopardy Challenge’s breadth of domain, we analyzed a random sample of 20,000 questions extracting the lexical answer type (LAT) when present. We define a LAT to be a word in the clue that indicates the type of the answer, independent of assigning semantics to that word. For example in the following clue, the LAT is the string “maneuver.”
Clue: Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.
7 Answer: Castling
About 12 percent of the clues do not indicate an explicit lexical answer type but may refer to the answer with pronouns like “it,” “these,” or “this” or not refer to it at all. In these cases the type of answer must be inferred by the context. Here’s an example:
Clue: Though it sounds “harsh,” it’s just embroidery, often in a floral pattern, done with yarn on cotton cloth.
The distribution of LATs has a very long tail, as shown in figure 1. We found 2500 distinct and explicit LATs in the 20,000 question sample. The most frequent 200 explicit LATs cover less than 50 percent of the data. Figure 1 shows the relative frequency of the LATs. It labels all the clues with no explicit type with the label “NA.” This aspect of the challenge implies that while task-specific type systems or manually curated data would have some impact if focused on the head of the LAT curve, it still leaves more than half the problems unaccounted for. Our clear technical bias for both business and scientific motivations is to create general-purpose, reusable natural language processing (NLP) and knowledge representation and reasoning (KRR) technology that can exploit as-is natural language resources and as-is structured knowledge rather than to curate task-specific knowledge resources.
Figure 1. Lexical Answer Type Frequency.
In addition to question-answering precision, the system’s game-winning performance will depend on speed, confidence estimation, clue selection, and betting strategy. Ultimately the outcome of the public contest will be decided based on whether or not Watson can win one or two games against top-ranked humans in real time. The highest amount of money earned by the end of a one- or two-game match determines the winner. A player’s final earnings, however, often will not reflect how well the player did during the game at the QA task. This is because a player may decide to bet big on Daily Double or Final Jeopardy questions. There are three hidden Daily Double questions in a game that can affect only the player lucky enough to find them, and one Final Jeopardy question at the end that all players must gamble on. Daily Double and Final Jeopardy questions represent significant events where players may risk all their current earnings. While potentially compelling for a public contest, a small number of games does not represent statistically meaningful results for the system’s raw QA performance.
While Watson is equipped with betting strategies necessary for playing full Jeopardy, from a core QA perspective we want to measure correctness, confidence, and speed, without considering clue selection, luck of the draw, and betting strategies. We measure correctness and confidence using precision and percent answered. Precision measures the percentage of questions the system gets right out of those it chooses to answer. Percent answered is the percentage of questions it chooses to answer (correctly or incorrectly). The system chooses which questions to answer based on an estimated confidence score: for a given threshold, the system will answer all questions with confidence scores above that threshold. The threshold controls the trade-off between precision and percent answered, assuming reasonable confidence estimation. For higher thresholds the system will be more conservative, answering fewer questions with higher precision. For lower thresholds, it will be more aggressive, answering more questions with lower precision. Accuracy refers to the precision if all questions are answered.
Figure 2 shows a plot of precision versus percent attempted curves for two theoretical systems. It is obtained by evaluating the two systems over a range of confidence thresholds. Both systems have 40 percent accuracy, meaning they get 40 percent of all questions correct. They differ only in their confidence estimation. The upper line represents an ideal system with perfect confidence estimation. Such a system would identify exactly which questions it gets right and wrong and give higher confidence to those it got right. As can be seen in the graph, if such a system were to answer the 50 percent of questions it had highest confidence for, it would get 80 percent of those correct. We refer to this level of performance as 80 percent precision at 50 percent answered. The lower line represents a system without meaningful confidence estimation. Since it cannot distinguish between which questions it is more or less likely to get correct, its precision is constant for all percent attempted. Developing more accurate confidence estimation means a system can deliver far higher precision even with the same overall accuracy.
Figure 2. Precision Versus Percentage Attempted.
Perfect confidence estimation (upper line) and no confidence estimation (lower line).
The Competition: Human Champion Performance
A compelling and scientifically appealing aspect of the Jeopardy Challenge is the human reference point. Figure 3 contains a graph that illustrates expert human performance on Jeopardy It is based on our analysis of nearly 2000 historical Jeopardy games. Each point on the graph represents the performance of the winner in one Jeopardygame.2 As in figure 2, the x-axis of the graph, labeled “% Answered,” represents the percentage of questions the winner answered, and the y-axis of the graph, labeled “Precision,” represents the percentage of those questions the winner answered correctly.
Figure 3. Champion Human Performance at Jeopardy.
In contrast to the system evaluation shown in figure 2, which can display a curve over a range of confidence thresholds, the human performance shows only a single point per game based on the observed precision and percent answered the winner demonstrated in the game. A further distinction is that in these historical games the human contestants did not have the liberty to answer all questions they wished. Rather the percent answered consists of those questions for which the winner was confident and fast enough to beat the competition to the buzz. The system performance graphs shown in this paper are focused on evaluating QA performance, and so do not take into account competition for the buzz. Human performance helps to position our system’s performance, but obviously, in a Jeopardy game, performance will be affected by competition for the buzz and this will depend in large part on how quickly a player can compute an accurate confidence and how the player manages risk.
The center of what we call the “Winners Cloud” (the set of light gray dots in the graph in figures 3 and 4) reveals that Jeopardy champions are confident and fast enough to acquire on average between 40 percent and 50 percent of all the questions from their competitors and to perform with between 85 percent and 95 percent precision.
Figure 4. Baseline Performance.
The darker dots on the graph represent Ken Jennings’s games. Ken Jennings had an unequaled winning streak in 2004, in which he won 74 games in a row. Based on our analysis of those games, he acquired on average 62 percent of the questions and answered with 92 percent precision. Human performance at this task sets a very high bar for precision, confidence, speed, and breadth.
Our metrics and baselines are intended to give us confidence that new methods and algorithms are improving the system or to inform us when they are not so that we can adjust research priorities.
Our most obvious baseline is the QA system called Practical Intelligent Question Answering Technology (PIQUANT) (Prager, Chu-Carroll, and Czuba 2004), which had been under development at IBM Research by a four-person team for 6 years prior to taking on the Jeopardy Challenge. At the time it was among the top three to five Text Retrieval Conference (TREC) QA systems. Developed in part under the U.S. government AQUAINT program3 and in collaboration with external teams and universities, PIQUANT was a classic QA pipeline with state-of-the-art techniques aimed largely at the TREC QA evaluation (Voorhees and Dang 2005). PIQUANT performed in the 33 percent accuracy range in TREC evaluations. While the TREC QA evaluation allowed the use of the web, PIQUANT focused on question answering using local resources. A requirement of the Jeopardy Challenge is that the system be self-contained and does not link to live web search.
The requirements of the TREC QA evaluation were different than for the Jeopardy challenge. Most notably, TREC participants were given a relatively small corpus (1M documents) from which answers to questions must be justified; TREC questions were in a much simpler form compared to Jeopardy questions, and the confidences associated with answers were not a primary metric. Furthermore, the systems are allowed to access the web and had a week to produce results for 500 questions. The reader can find details in the TREC proceedings4 and numerous follow-on publications.
An initial 4-week effort was made to adapt PIQUANT to the Jeopardy Challenge. The experiment focused on precision and confidence. It ignored issues of answering speed and aspects of the game like betting and clue values.
The questions used were 500 randomly sampled Jeopardy clues from episodes in the past 15 years. The corpus that was used contained, but did not necessarily justify, answers to more than 90 percent of the questions. The result of the PIQUANT baseline experiment is illustrated in figure 4. As shown, on the 5 percent of the clues that PIQUANT was most confident in (left end of the curve), it delivered 47 percent precision, and over all the clues in the set (right end of the curve), its precision was 13 percent. Clearly the precision and confidence estimation are far below the requirements of the Jeopardy Challenge.
A similar baseline experiment was performed in collaboration with Carnegie Mellon University (CMU) using OpenEphyra,5 an open-source QA framework developed primarily at CMU. The framework is based on the Ephyra system, which was designed for answering TREC questions. In our experiments on TREC 2002 data, OpenEphyra answered 45 percent of the questions correctly using a live web search.
We spent minimal effort adapting OpenEphyra, but like PIQUANT, its performance on Jeopardy clues was below 15 percent accuracy. OpenEphyra did not produce reliable confidence estimates and thus could not effectively choose to answer questions with higher confidence. Clearly a larger investment in tuning and adapting these baseline systems to Jeopardy would improve their performance; however, we limited this investment since we did not want the baseline systems to become significant efforts.
The PIQUANT and OpenEphyra baselines demonstrate the performance of state-of-the-art QA systems on theJeopardy task. In figure 5 we show two other baselines that demonstrate the performance of two complementary approaches on this task. The light gray line shows the performance of a system based purely on text search, using terms in the question as queries and search engine scores as confidences for candidate answers generated from retrieved document titles. The black line shows the performance of a system based on structured data, which attempts to look the answer up in a database by simply finding the named entities in the database related to the named entities in the clue. These two approaches were adapted to the Jeopardy task, including identifying and integrating relevant content.
Figure 5. Text Search Versus Knowledge Base Search.
The results form an interesting comparison. The search-based system has better performance at 100 percent answered, suggesting that the natural language content and the shallow text search techniques delivered better coverage. However, the flatness of the curve indicates the lack of accurate confidence estimation.6 The structured approach had better informed confidence when it was able to decipher the entities in the question and found the right matches in its structured knowledge bases, but its coverage quickly drops off when asked to answer more questions. To be a high-performing question-answering system, DeepQA must demonstrate both these properties to achieve high precision, high recall, and an accurate confidence estimation.
The DeepQA Approach
Early on in the project, attempts to adapt PIQUANT (Chu-Carroll et al. 2003) failed to produce promising results. We devoted many months of effort to encoding algorithms from the literature. Our investigations ran the gamut from deep logical form analysis to shallow machine-translation-based approaches. We integrated them into the standard QA pipeline that went from question analysis and answer type determination to search and then answer selection. It was difficult, however, to find examples of how published research results could be taken out of their original context and effectively replicated and integrated into different end-to-end systems to produce comparable results. Our efforts failed to have significant impact on Jeopardy or even on prior baseline studies using TREC data.
We ended up overhauling nearly everything we did, including our basic technical approach, the underlying architecture, metrics, evaluation protocols, engineering practices, and even how we worked together as a team. We also, in cooperation with CMU, began the Open Advancement of Question Answering (OAQA) initiative. OAQA is intended to directly engage researchers in the community to help replicate and reuse research results and to identify how to more rapidly advance the state of the art in QA (Ferrucci et al 2009).
As our results dramatically improved, we observed that system-level advances allowing rapid integration and evaluation of new ideas and new components against end-to-end metrics were essential to our progress. This was echoed at the OAQA workshop for experts with decades of investment in QA, hosted by IBM in early 2008. Among the workshop conclusions was that QA would benefit from the collaborative evolution of a single extensible architecture that would allow component results to be consistently evaluated in a common technical context against a growing variety of what were called “Challenge Problems.” Different challenge problems were identified to address various dimensions of the general QA problem. Jeopardy was described as one addressing dimensions including high precision, accurate confidence determination, complex language, breadth of domain, and speed.
The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique we use is how we combine them in DeepQA such that overlapping approaches can bring their strengths to bear and contribute to improvements in accuracy, confidence, or speed.
DeepQA is an architecture with an accompanying methodology, but it is not specific to the Jeopardy Challenge. We have successfully applied DeepQA to both the Jeopardy and TREC QA task. We have begun adapting it to different business applications and additional exploratory challenge problems including medicine, enterprise search, and gaming.
The overarching principles in DeepQA are massive parallelism, many experts, pervasive confidence estimation, and integration of shallow and deep knowledge.
Massive parallelism: Exploit massive parallelism in the consideration of multiple interpretations and hypotheses.
Many experts: Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics.
Pervasive confidence estimation: No component commits to an answer; all components produce features and associated confidences, scoring different question and content interpretations. An underlying confidence-processing substrate learns how to stack and combine the scores.
Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.
Figure 6 illustrates the DeepQA architecture at a very high level. The remaining parts of this section provide a bit more detail about the various architectural roles.
Figure 6. DeepQA High-Level Architecture.
The first step in any application of DeepQA to solve a QA problem is content acquisition, or identifying and gathering the content to use for the answer and evidence sources shown in figure 6.
Content acquisition is a combination of manual and automatic steps. The first step is to analyze example questions from the problem space to produce a description of the kinds of questions that must be answered and a characterization of the application domain. Analyzing example questions is primarily a manual task, while domain analysis may be informed by automatic or statistical analyses, such as the LAT analysis shown in figure 1. Given the kinds of questions and broad domain of the Jeopardy Challenge, the sources for Watson include a wide range of encyclopedias, dictionaries, thesauri, newswire articles, literary works, and so on.
Given a reasonable baseline corpus, DeepQA then applies an automatic corpus expansion process. The process involves four high-level steps: (1) identify seed documents and retrieve related documents from the web; (2) extract self-contained text nuggets from the related web documents; (3) score the nuggets based on whether they are informative with respect to the original seed document; and (4) merge the most informative nuggets into the expanded corpus. The live system itself uses this expanded corpus and does not have access to the web during play.
In addition to the content for the answer and evidence sources, DeepQA leverages other kinds of semistructured and structured content. Another step in the content-acquisition process is to identify and collect these resources, which include databases, taxonomies, and ontologies, such as dbPedia,7 WordNet (Miller 1995), and the Yago8ontology.
The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system. The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses (McCord 1990), logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering. Most of these technologies are well understood and are not discussed here, but a few require some elaboration.
Question classification is the task of identifying question types or parts of questions that require special processing. This can include anything from single words with potentially double meanings to entire clauses that have certain syntactic, semantic, or rhetorical functionality that may inform downstream components with their analysis. Question classification may identify a question as a puzzle question, a math question, a definition question, and so on. It will identify puns, constraints, definition components, or entire subclues within questions.
Focus and LAT Detection.
As discussed earlier, a lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors. An advantage to the DeepQA approach is to exploit many independently developed answer-typing algorithms. However, many of these algorithms are dependent on their own type systems. We found the best way to integrate preexisting components is not to force them into a single, common type system, but to have them map from the LAT to their own internal types.
The focus of the question is the part of the question that, if replaced by the answer, makes the question a stand-alone statement. Looking back at some of the examples shown previously, the focus of “When hit by electrons, a phosphor gives off electromagnetic energy in this form” is “this form”; the focus of “Secretary Chase just submitted this to me for the third time; guess what, pal. This time I’m accepting it” is the first “this”; and the focus of “This title character was the crusty and tough city editor of the Los Angeles Tribune” is “This title character.” The focus often (but not always) contains useful information about the answer, is often the subject or object of a relation in the clue, and can turn a question into a factual statement when replaced with a candidate, which is a useful way to gather evidence about a candidate.
Most questions contain relations, whether they are syntactic subject-verb-object predicates or semantic relationships between entities. For example, in the question, “They’re the two states you could be reentering if you’re crossing Florida’s northern border,” we can detect the relation borders(Florida,?x,north).
Watson uses relation detection throughout the QA process, from focus and LAT determination, to passage and answer scoring. Watson can also use detected relations to query a triple store and directly generate candidate answers. Due to the breadth of relations in the Jeopardy domain and the variety of ways in which they are expressed, however, Watson’s current ability to effectively use curated databases to simply “look up” the answers is limited to fewer than 2 percent of the clues.
Watson’s use of existing databases depends on the ability to analyze the question and detect the relations covered by the databases. In Jeopardy the broad domain makes it difficult to identify the most lucrative relations to detect. In 20,000 Jeopardy questions, for example, we found the distribution of Freebase9 relations to be extremely flat (figure 7). Roughly speaking, even achieving high recall on detecting the most frequent relations in the domain can at best help in about 25 percent of the questions, and the benefit of relation detection drops off fast with the less frequent relations. Broad-domain relation detection remains a major open area of research.
Figure 7. Approximate Distribution of the 50 Most Frequently
Occurring Freebase Relations in 20,000 Randomly Selected Jeopardy Clues.
As discussed above, an important requirement driven by analysis of Jeopardy clues was the ability to handle questions that are better answered through decomposition. DeepQA uses rule-based deep parsing and statistical classification methods both to recognize whether questions should be decomposed and to determine how best to break them up into subquestions. The operating hypothesis is that the correct question interpretation and derived answer(s) will score higher after all the collected evidence and all the relevant algorithms have been considered. Even if the question did not need to be decomposed to determine an answer, this method can help improve the system’s overall answer confidence.
DeepQA solves parallel decomposable questions through application of the end-to-end QA system on each subclue and synthesizes the final answers by a customizable answer combination component. These processing paths are shown in medium gray in figure 6. DeepQA also supports nested decomposable questions through recursive application of the end-to-end QA system to the inner subclue and then to the outer subclue. The customizable synthesis components allow specialized synthesis algorithms to be easily plugged into a common framework.
Hypothesis generation takes the results of question analysis and produces candidate answers by searching the system’s sources and extracting answer-sized snippets from the search results. Each candidate answer plugged back into the question is considered a hypothesis, which the system has to prove correct with some degree of confidence.
We refer to search performed in hypothesis generation as “primary search” to distinguish it from search performed during evidence gathering (described below). As with all aspects of DeepQA, we use a mixture of different approaches for primary search and candidate generation in the Watson system.
In primary search the goal is to find as much potentially answer-bearing content as possible based on the results of question analysis — the focus is squarely on recall with the expectation that the host of deeper content analytics will extract answer candidates and score this content plus whatever evidence can be found in support or refutation of candidates to drive up the precision. Over the course of the project we continued to conduct empirical studies designed to balance speed, recall, and precision. These studies allowed us to regularly tune the system to find the number of search results and candidates that produced the best balance of accuracy and computational resources. The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top 250 candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top 250 ranked candidates.
A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.
Triple store queries in primary search are based on named entities in the clue; for example, find all database entities related to the clue entities, or based on more focused queries in the cases that a semantic relation was detected. For a small number of LATs we identified as “closed LATs,” the candidate answer can be generated from a fixed list in some store of known instances of the LAT, such as “U.S. President” or “Country.”
Candidate Answer Generation.
The search results feed into candidate generation, where techniques appropriate to the kind of search results are applied to generate candidate answers. For document search results from “title-oriented” resources, the title is extracted as a candidate answer. The system may generate a number of candidate answer variants from the same title based on substring analysis or link analysis (if the underlying source contains hyperlinks). Passage search results require more detailed analysis of the passage text to identify candidate answers. For example, named entity detection may be used to extract candidate answers from the passage. Some sources, such as a triple store and reverse dictionary lookup, produce candidate answers directly as their search result.
If the correct answer(s) are not generated at this stage as a candidate, the system has no hope of answering the question. This step therefore significantly favors recall over precision, with the expectation that the rest of the processing pipeline will tease out the correct answer, even if the set of candidates is quite large. One of the goals of the system design, therefore, is to tolerate noise in the early stages of the pipeline and drive up precision downstream.
Watson generates several hundred candidate answers at this stage.
A key step in managing the resource versus precision trade-off is the application of lightweight (less resource intensive) scoring algorithms to a larger set of initial candidates to prune them down to a smaller set of candidates before the more intensive scoring components see them. For example, a lightweight scorer may compute the likelihood of a candidate answer being an instance of the LAT. We call this step soft filtering.
The system combines these lightweight analysis scores into a soft filtering score. Candidate answers that pass the soft filtering threshold proceed to hypothesis and evidence scoring, while those candidates that do not pass the filtering threshold are routed directly to the final merging stage. The soft filtering scoring model and filtering threshold are determined based on machine learning over training data.
Watson currently lets roughly 100 candidates pass the soft filter, but this a parameterizable function.
Hypothesis and Evidence Scoring
Candidate answers that pass the soft filtering threshold undergo a rigorous evaluation process that involves gathering additional supporting evidence for each candidate answer, or hypothesis, and applying a wide variety of deep scoring analytics to evaluate the supporting evidence.
To better evaluate each candidate answer that passes the soft filter, the system gathers additional supporting evidence. The architecture supports the integration of a variety of evidence-gathering techniques. One particularly effective technique is passage search where the candidate answer is added as a required term to the primary search query derived from the question. This will retrieve passages that contain the candidate answer used in the context of the original question terms. Supporting evidence may also come from other sources like triple stores. The retrieved supporting evidence is routed to the deep evidence scoring components, which evaluate the candidate answer in the context of the supporting evidence.
The scoring step is where the bulk of the deep content analysis is performed. Scoring algorithms determine the degree of certainty that retrieved evidence supports the candidate answers. The DeepQA framework supports and encourages the inclusion of many different components, or scorers, that consider different dimensions of the evidence and produce a score that corresponds to how well evidence supports a candidate answer for a given question.
DeepQA provides a common format for the scorers to register hypotheses (for example candidate answers) and confidence scores, while imposing few restrictions on the semantics of the scores themselves; this enables DeepQA developers to rapidly deploy, mix, and tune components to support each other. For example, Watson employs more than 50 scoring components that produce scores ranging from formal probabilities to counts to categorical features, based on evidence from different types of sources including unstructured text, semistructured text, and triple stores. These scorers consider things like the degree of match between a passage’s predicate-argument structure and the question, passage source reliability, geospatial location, temporal relationships, taxonomic classification, the lexical and semantic relations the candidate is known to participate in, the candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on.
Consider the question, “He was presidentially pardoned on September 8, 1974”; the correct answer, “Nixon,” is one of the generated candidates. One of the retrieved passages is “Ford pardoned Nixon on Sept. 8, 1974.” One passage scorer counts the number of IDF-weighted terms in common between the question and the passage. Another passage scorer based on the Smith-Waterman sequence-matching algorithm (Smith and Waterman 1981), measures the lengths of the longest similar subsequences between the question and passage (for example “on Sept. 8, 1974”). A third type of passage scoring measures the alignment of the logical forms of the question and passage. A logical form is a graphical abstraction of text in which nodes are terms in the text and edges represent either grammatical relationships (for example, Hermjakob, Hovy, and Lin ; Moldovan et al. ), deep semantic relationships (for example, Lenat , Paritosh and Forbus ), or both . The logical form alignment identifies Nixon as the object of the pardoning in the passage, and that the question is asking for the object of a pardoning. Logical form alignment gives “Nixon” a good score given this evidence. In contrast, a candidate answer like “Ford” would receive near identical scores to “Nixon” for term matching and passage alignment with this passage, but would receive a lower logical form alignment score.
Another type of scorer uses knowledge in triple stores, simple reasoning such as subsumption and disjointness in type taxonomies, geospatial, and temporal reasoning. Geospatial reasoning is used in Watson to detect the presence or absence of spatial relations such as directionality, borders, and containment between geoentities. For example, if a question asks for an Asian city, then spatial containment provides evidence that Beijing is a suitable candidate, whereas Sydney is not. Similarly, geocoordinate information associated with entities is used to compute relative directionality (for example, California is SW of Montana; GW Bridge is N of Lincoln Tunnel, and so on).
Temporal reasoning is used in Watson to detect inconsistencies between dates in the clue and those associated with a candidate answer. For example, the two most likely candidate answers generated by the system for the clue, “In 1594 he took a job as a tax collector in Andalusia,” are “Thoreau” and “Cervantes.” In this case, temporal reasoning is used to rule out Thoreau as he was not alive in 1594, having been born in 1817, whereas Cervantes, the correct answer, was born in 1547 and died in 1616.
Each of the scorers implemented in Watson, how they work, how they interact, and their independent impact on Watson’s performance deserves its own research paper. We cannot do this work justice here. It is important to note, however, at this point no one algorithm dominates. In fact we believe DeepQA’s facility for absorbing these algorithms, and the tools we have created for exploring their interactions and effects, will represent an important and lasting contribution of this work.
To help developers and users get a sense of how Watson uses evidence to decide between competing candidate answers, scores are combined into an overall evidence profile. The evidence profile groups individual features into aggregate evidence dimensions that provide a more intuitive view of the feature group. Aggregate evidence dimensions might include, for example, Taxonomic, Geospatial (location), Temporal, Source Reliability, Gender, Name Consistency, Relational, Passage Support, Theory Consistency, and so on. Each aggregate dimension is a combination of related feature scores produced by the specific algorithms that fired on the gathered evidence.
Consider the following question: Chile shares its longest land border with this country. In figure 8 we see a comparison of the evidence profiles for two candidate answers produced by the system for this question: Argentina and Bolivia. Simple search engine scores favor Bolivia as an answer, due to a popular border dispute that was frequently reported in the news. Watson prefers Argentina (the correct answer) over Bolivia, and the evidence profile shows why. Although Bolivia does have strong popularity scores, Argentina has strong support in the geospatial, passage support (for example, alignment and logical form graph matching of various text passages), and source reliability dimensions.
Figure 8. Evidence Profiles for Two Candidate Answers.
Dimensions are on the x-axis and relative strength is on the y-axis.
Final Merging and Ranking
It is one thing to return documents that contain key words from the question. It is quite another, however, to analyze the question and the content enough to identify the precise answer and yet another to determine an accurate enough confidence in its correctness to bet on it. Winning at Jeopardy requires exactly that ability.
The goal of final ranking and merging is to evaluate the hundreds of hypotheses based on potentially hundreds of thousands of scores to identify the single best-supported hypothesis given the evidence and to estimate its confidence — the likelihood it is correct.
Multiple candidate answers for a question may be equivalent despite very different surface forms. This is particularly confusing to ranking techniques that make use of relative differences between candidates. Without merging, ranking algorithms would be comparing multiple surface forms that represent the same answer and trying to discriminate among them. While one line of research has been proposed based on boosting confidence in similar candidates (Ko, Nyberg, and Luo 2007), our approach is inspired by the observation that different surface forms are often disparately supported in the evidence and result in radically different, though potentially complementary, scores. This motivates an approach that merges answer scores before ranking and confidence estimation. Using an ensemble of matching, normalization, and coreference resolution algorithms, Watson identifies equivalent and related hypotheses (for example, Abraham Lincoln and Honest Abe) and then enables custom merging per feature to combine scores.
Ranking and Confidence Estimation
After merging, the system must rank the hypotheses and estimate confidence based on their merged scores. We adopted a machine-learning approach that requires running the system over a set of training questions with known answers and training a model based on the scores. One could assume a very flat model and apply existing ranking algorithms (for example, Herbrich, Graepel, and Obermayer ; Joachims ) directly to these score profiles and use the ranking score for confidence. For more intelligent ranking, however, ranking and confidence estimation may be separated into two phases. In both phases sets of scores may be grouped according to their domain (for example type matching, passage scoring, and so on.) and intermediate models trained using ground truths and methods specific for that task. Using these intermediate models, the system produces an ensemble of intermediate scores. Motivated by hierarchical techniques such as mixture of experts (Jacobs et al. 1991) and stacked generalization (Wolpert 1992), a metalearner is trained over this ensemble. This approach allows for iteratively enhancing the system with more sophisticated and deeper hierarchical models while retaining flexibility for robustness and experimentation as scorers are modified and added to the system.
Watson’s metalearner uses multiple trained models to handle different question classes as, for instance, certain scores that may be crucial to identifying the correct answer for a factoid question may not be as useful on puzzle questions.
Finally, an important consideration in dealing with NLP-based scorers is that the features they produce may be quite sparse, and so accurate confidence estimation requires the application of confidence-weighted learning techniques. (Dredze, Crammer, and Pereira 2008).
Speed and Scaleout
DeepQA is developed using Apache UIMA,10 a framework implementation of the Unstructured Information Management Architecture (Ferrucci and Lally 2004). UIMA was designed to support interoperability and scaleout of text and multimodal analysis applications. All of the components in DeepQA are implemented as UIMA annotators. These are software components that analyze text and produce annotations or assertions about the text. Watson has evolved over time and the number of components in the system has reached into the hundreds. UIMA facilitated rapid component integration, testing, and evaluation.
Early implementations of Watson ran on a single processor where it took 2 hours to answer a single question. The DeepQA computation is embarrassing parallel, however. UIMA-AS, part of Apache UIMA, enables the scaleout of UIMA applications using asynchronous messaging. We used UIMA-AS to scale Watson out over 2500 compute cores. UIMA-AS handles all of the communication, messaging, and queue management necessary using the open JMS standard. The UIMA-AS deployment of Watson enabled competitive run-time latencies in the 3–5 second range.
To preprocess the corpus and create fast run-time indices we used Hadoop.11 UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes the content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process.
Jeopardy demands strategic game play to match wits against the best human players. In a typical Jeopardy game, Watson faces the following strategic decisions: deciding whether to buzz in and attempt to answer a question, selecting squares from the board, and wagering on Daily Doubles and Final Jeopardy.
The workhorse of strategic decisions is the buzz-in decision, which is required for every non–Daily Double clue on the board. This is where DeepQA’s ability to accurately estimate its confidence in its answer is critical, and Watson considers this confidence along with other game-state factors in making the final determination whether to buzz. Another strategic decision, Final Jeopardy wagering, generally receives the most attention and analysis from those interested in game strategy, and there exists a growing catalogue of heuristics such as “Clavin’s Rule” or the “Two-Thirds Rule” (Dupee 1998) as well as identification of those critical score boundaries at which particular strategies may be used (by no means does this make it easy or rote; despite this attention, we have found evidence that contestants still occasionally make irrational Final Jeopardy bets). Daily Double betting turns out to be less studied but just as challenging since the player must consider opponents’ scores and predict the likelihood of getting the question correct just as in Final Jeopardy. After a Daily Double, however, the game is not over, so evaluation of a wager requires forecasting the effect it will have on the distant, final outcome of the game.
These challenges drove the construction of statistical models of players and games, game-theoretic analyses of particular game scenarios and strategies, and the development and application of reinforcement-learning techniques for Watson to learn its strategy for playing Jeopardy. Fortunately, moderate samounts of historical data are available to serve as training data for learning techniques. Even so, it requires extremely careful modeling and game-theoretic evaluation as the game of Jeopardy has incomplete information and uncertainty to model, critical score boundaries to recognize, and savvy, competitive players to account for. It is a game where one faulty strategic choice can lose the entire match.
Status and Results
After approximately 3 years of effort by a core algorithmic team composed of 20 researchers and software engineers with a range of backgrounds in natural language processing, information retrieval, machine learning, computational linguistics, and knowledge representation and reasoning, we have driven the performance of DeepQA to operate within the winner’s cloud on the Jeopardy task, as shown in figure 9. Watson’s results illustrated in this figure were measured over blind test sets containing more than 2000 Jeopardy questions.
Figure 9. Watson’s Precision and Confidence Progress as of the Fourth Quarter 2009.
After many nonstarters, by the fourth quarter of 2007 we finally adopted the DeepQA architecture. At that point we had all moved out of our private offices and into a “war room” setting to dramatically facilitate team communication and tight collaboration. We instituted a host of disciplined engineering and experimental methodologies supported by metrics and tools to ensure we were investing in techniques that promised significant impact on end-to-end metrics. Since then, modulo some early jumps in performance, the progress has been incremental but steady. It is slowing in recent months as the remaining challenges prove either very difficult or highly specialized and covering small phenomena in the data.
By the end of 2008 we were performing reasonably well — about 70 percent precision at 70 percent attempted over the 12,000 question blind data, but it was taking 2 hours to answer a single question on a single CPU. We brought on a team specializing in UIMA and UIMA-AS to scale up DeepQA on a massively parallel high-performance computing platform. We are currently answering more than 85 percent of the questions in 5 seconds or less — fast enough to provide competitive performance, and with continued algorithmic development are performing with about 85 percent precision at 70 percent attempted.
We have more to do in order to improve precision, confidence, and speed enough to compete with grand champions. We are finding great results in leveraging the DeepQA architecture capability to quickly admit and evaluate the impact of new algorithms as we engage more university partnerships to help meet the challenge.
An Early Adaptation Experiment
Another challenge for DeepQA has been to demonstrate if and how it can adapt to other QA tasks. In mid-2008, after we had populated the basic architecture with a host of components for searching, evidence retrieval, scoring, final merging, and ranking for the Jeopardy task, IBM collaborated with CMU to try to adapt DeepQA to the TREC QA problem by plugging in only select domain-specific components previously tuned to the TREC task. In particular, we added question-analysis components from PIQUANT and OpenEphyra that identify answer types for a question, and candidate answer-generation components that identify instances of those answer types in the text. The DeepQA framework utilized both sets of components despite their different type systems — no ontology integration was performed. The identification and integration of these domain specific components into DeepQA took just a few weeks.
The extended DeepQA system was applied to TREC questions. Some of DeepQA’s answer and evidence scorers are more relevant in the TREC domain than in the Jeopardy domain and others are less relevant. We addressed this aspect of adaptation for DeepQA’s final merging and ranking by training an answer-ranking model using TREC questions; thus the extent to which each score affected the answer ranking and confidence was automatically customized for TREC.
Figure 10 shows the results of the adaptation experiment. Both the 2005 PIQUANT and 2007 OpenEphyra systems had less than 50 percent accuracy on the TREC questions and less than 15 percent accuracy on the Jeopardy clues. The DeepQA system at the time had accuracy above 50 percent on Jeopardy. Without adaptation DeepQA’s accuracy on TREC questions was about 35 percent. After adaptation, DeepQA’s accuracy on TREC exceeded 60 percent. We repeated the adaptation experiment in 2010, and in addition to the improvements to DeepQA since 2008, the adaptation included a transfer learning step for TREC questions from a model trained on Jeopardyquestions. DeepQA’s performance on TREC data was 51 percent accuracy prior to adaptation and 67 percent after adaptation, nearly level with its performance on blind Jeopardy data.
Figure 10. Accuracy on Jeopardy! and TREC.
The result performed significantly better than the original complete systems on the task for which they were designed. While just one adaptation experiment, this is exactly the sort of behavior we think an extensible QA system should exhibit. It should quickly absorb domain- or task-specific components and get better on that target task without degradation in performance in the general case or on prior tasks.
The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After 3 years of intense research and development by a core team of about 20 researcherss, Watson is performing at human expert levels in terms of precision, confidence, and speed at theJeopardy quiz show.
Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.
The architecture and methodology developed as part of this project has highlighted the need to take a systems-level approach to research in QA, and we believe this applies to research in the broader field of AI. We have developed many different algorithms for addressing different kinds of problems in QA and plan to publish many of them in more detail in the future. However, no one algorithm solves challenge problems like this. End-to-end systems tend to involve many complex and often overlapping interactions. A system design and methodology that facilitated the efficient integration and ablation studies of many probabilistic components was essential for our success to date. The impact of any one algorithm on end-to-end performance changed over time as other techniques were added and had overlapping effects. Our commitment to regularly evaluate the effects of specific techniques on end-to-end performance, and to let that shape our research investment, was necessary for our rapid progress.
Rapid experimentation was another critical ingredient to our success. The team conducted more than 5500 independent experiments in 3 years — each averaging about 2000 CPU hours and generating more than 10 GB of error-analysis data. Without DeepQA’s massively parallel architecture and a dedicated high-performance computing infrastructure, we would not have been able to perform these experiments, and likely would not have even conceived of many of them.
Tuned for the Jeopardy Challenge, Watson has begun to compete against former Jeopardy players in a series of “sparring” games. It is holding its own, winning 64 percent of the games, but has to be improved and sped up to compete favorably against the very best.
We have leveraged our collaboration with CMU and with our other university partnerships in getting this far and hope to continue our collaborative work to drive Watson to its final goal, and help openly advance QA research.
We would like to acknowledge the talented team of research scientists and engineers at IBM and at partner universities, listed below, for the incredible work they are doing to influence and develop all aspects of Watson and the DeepQA architecture. It is this team who are responsible for the work described in this paper. From IBM, Andy Aaron, Einat Amitay, Branimir Boguraev, David Carmel, Arthur Ciccolo, Jaroslaw Cwiklik, Pablo Duboue, Edward Epstein, Raul Fernandez, Radu Florian, Dan Gruhl, Tong-Haing Fin, Achille Fokoue, Karen Ingraffea, Bhavani Iyer, Hiroshi Kanayama, Jon Lenchner, Anthony Levas, Burn Lewis, Michael McCord, Paul Morarescu, Matthew Mulholland, Yuan Ni, Miroslav Novak, Yue Pan, Siddharth Patwardhan, Zhao Ming Qiu, Salim Roukos, Marshall Schor, Dafna Sheinwald, Roberto Sicconi, Hiroshi Kanayama, Kohichi Takeda, Gerry Tesauro, Chen Wang, Wlodek Zadrozny, and Lei Zhang. From our academic partners, Manas Pathak (CMU), Chang Wang (University of Massachusetts [UMass]), Hideki Shima (CMU), James Allen (UMass), Ed Hovy (University of Southern California/Information Sciences Instutute), Bruce Porter (University of Texas), Pallika Kanani (UMass), Boris Katz (Massachusetts Institute of Technology), Alessandro Moschitti, and Giuseppe Riccardi (University of Trento), Barbar Cutler, Jim Hendler, and Selmer Bringsjord (Rensselaer Polytechnic Institute).
1. Watson is named after IBM’s founder, Thomas J. Watson.
2. Random jitter has been added to help visualize the distribution of points.
6. The dip at the left end of the light gray curve is due to the disproportionately high score the search engine assigns to short queries, which typically are not sufficiently discriminative to retrieve the correct answer in top position.
Chu-Carroll, J.; Czuba, K.; Prager, J. M.; and Ittycheriah, A. 2003. Two Heads Are Better Than One in Question-Answering. Paper presented at the Human Language Technology Conference, Edmonton, Canada, 27 May–1 June.
Dredze, M.; Crammer, K.; and Pereira, F. 2008. Confidence-Weighted Linear Classification. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML). Princeton, NJ: International Machine Learning Society.
Dupee, M. 1998. How to Get on Jeopardy! … and Win: Valuable Information from a Champion. Secaucus, NJ: Citadel Press.
Ferrucci, D., and Lally, A. 2004. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Langage Engineering 10(3–4): 327–348.
Ferrucci, D.; Nyberg, E.; Allan, J.; Barker, K.; Brown, E.; Chu-Carroll, J.; Ciccolo, A.; Duboue, P.; Fan, J.; Gondek, D.; Hovy, E.; Katz, B.; Lally, A.; McCord, M.; Morarescu, P.; Murdock, W.; Porter, B.; Prager, J.; Strzalkowski, T.; Welty, W.; and Zadrozny, W. 2009. Towards the Open Advancement of Question Answer Systems. IBM Technical Report RC24789, Yorktown Heights, NY.
Herbrich, R.; Graepel, T.; and Obermayer, K. 2000. Large Margin Rank Boundaries for Ordinal Regression. InAdvances in Large Margin Classifiers, 115–132. Linköping, Sweden: Liu E-Press.
Hermjakob, U.; Hovy, E. H.; and Lin, C. 2000. Knowledge-Based Question Answering. In Proceedings of the Sixth World Multiconference on Systems, Cybernetics, and Informatics (SCI-2002). Winter Garden, FL: International Institute of Informatics and Systemics.
Hsu, F.-H. 2002. Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton, NJ: Princeton University Press.
Jacobs, R.; Jordan, M. I.; Nowlan. S. J.; and Hinton, G. E. 1991. Adaptive Mixtures of Local Experts. Neural Computation 3(1): 79-–87.
Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the Thirteenth ACM Conference on Knowledge Discovery and Data Mining (KDD). New York: Association for Computing Machinery.
Ko, J.; Nyberg, E.; and Luo Si, L. 2007. A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering. In Proceedings of the 30th Annual International ACM SIGIR Conference, 343–350. New York: Association for Computing Machinery.
Lenat, D. B. 1995. Cyc: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM 38(11): 33–38.
Maybury, Mark, ed. 2004. New Directions in Question-Answering. Menlo Park, CA: AAAI Press.
McCord, M. C. 1990. Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars. InNatural Language and Logic: International Scientific Symposium. Lecture Notes in Computer Science 459. Berlin: Springer Verlag.
Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM 38(11): 39–41.
Moldovan, D.; Clark, C.; Harabagiu, S.; and Maiorano, S. 2003. COGEX: A Logic Prover for Question Answering. Paper presented at the Human Language Technology Conference, Edmonton, Canada, 27 May–1 June..
Paritosh, P., and Forbus, K. 2005. Analysis of Strategic Knowledge in Back of the Envelope Reasoning. InProceedings of the 20th AAAI Conference on Artificial Intelligence (AAAI-05). Menlo Park, CA: AAAI Press.
Prager, J. M.; Chu-Carroll, J.; and Czuba, K. 2004. A Multi-Strategy, Multi-Question Approach to Question Answering. In New Directions in Question-Answering, ed. M. Maybury. Menlo Park, CA: AAAI Press.
Simmons, R. F. 1970. Natural Language Question-Answering Systems: 1969. Communications of the ACM 13(1): 15–30
Smith T. F., and Waterman M. S. 1981. Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1): 195–197.
Strzalkowski, T., and Harabagiu, S., eds. 2006. Advances in Open-Domain Question-Answering. Berlin: Springer.
Voorhees, E. M., and Dang, H. T. 2005. Overview of the TREC 2005 Question Answering Track. In Proceedings of the Fourteenth Text Retrieval Conference. Gaithersburg, MD: National Institute of Standards and Technology.
Wolpert, D. H. 1992. Stacked Generalization. Neural Networks 5(2): 241–259.
David Ferrucci is a research staff member and leads the Semantic Analysis and Integration department at the IBM T. J. Watson Research Center, Hawthorne, New York. Ferrucci is the principal investigator for the DeepQA/Watson project and the chief architect for UIMA, now an OASIS standard and Apache open-source project. Ferrucci’s background is in artificial intelligence and software engineering.
Eric Brown is a research staff member at the IBM T. J. Watson Research Center. His background is in information retrieval. Brown’s current research interests include question answering, unstructured information management architectures, and applications of advanced text analysis and question answering to information retrieval systems..
Jennifer Chu-Carroll is a research staff member at the IBM T. J. Watson Research Center. Chu-Carroll is on the editorial board of the Journal of Dialogue Systems, and previously served on the executive board of the North American Chapter of the Association for Computational Linguistics and as program cochair of HLT-NAACL 2006. Her research interests include question answering, semantic search, and natural language discourse and dialogue..
James Fan is a research staff member at IBM T. J. Watson Research Center. His research interests include natural language processing, question answering, and knowledge representation and reasoning. He has served as a program committee member for several top ranked AI conferences and journals, such as IJCAI and AAAI. He received his Ph.D. from the University of Texas at Austin in 2006.
David Gondek is a research staff member at the IBM T. J. Watson Research Center. His research interests include applications of machine learning, statistical modeling, and game theory to question answering and natural language processing. Gondek has contributed to journals and conferences in machine learning and data mining. He earned his Ph.D. in computer science from Brown University.
Aditya A. Kalyanpur is a research staff member at the IBM T. J. Watson Research Center. His primary research interests include knowledge representation and reasoning, natural languague programming, and question answering. He has served on W3 working groups, as program cochair of an international semantic web workshop, and as a reviewer and program committee member for several AI journals and conferences. Kalyanpur completed his doctorate in AI and semantic web related research from the University of Maryland, College Park.
Adam Lally is a senior software engineer at IBM’s T. J. Watson Research Center. He develops natural language processing and reasoning algorithms for a variety of applications and is focused on developing scalable frameworks of NLP and reasoning systems. He is a lead developer and designer for the UIMA framework and architecture specification.
J. William Murdock is a research staff member at the IBM T. J. Watson Research Center. Before joining IBM, he worked at the United States Naval Research Laboratory. His research interests include natural-language semantics, analogical reasoning, knowledge-based planning, machine learning, and computational reflection. In 2001, he earned his Ph.D. in computer science from the Georgia Institute of Technology..
Eric Nyberg is a professor at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Nyberg’s research spans a broad range of text analysis and information retrieval areas, including question answering, search, reasoning, and natural language processing architectures, systems, and software engineering principles
John Prager is a research staff member at the IBM T. J. Watson Research Center in Yorktown Heights, New York. His background includes natural-language based interfaces and semantic search, and his current interest is on incorporating user and domain models to inform question-answering. He is a member of the TREC program committee.
Nico Schlaefer is a Ph.D. student at the Language Technologies Institute in the School of Computer Science, Carnegie Mellon University and an IBM Ph.D. Fellow. His research focus is the application of machine learning techniques to natural language processing tasks. Schlaefer is the primary author of the OpenEphyra question answering system.
Chris Welty is a research staff member at the IBM Thomas J. Watson Research Center. His background is primarily in knowledge representation and reasoning. Welty’s current research focus is on hybridization of machine learning, natural language processing, and knowledge representation and reasoning in building AI systems.
本专题经 @老师木 同意， 特收录“老湿”对AI/ML的一些独到见解。如果非要问我为什么要特别收录这几篇文章，回答：个人认为，他的大部分见解已经并肩甚至超过了该领域的一般教授。如果你再八卦一下问这个专题为什么叫“褪去华衣 裸视学习”，答曰：这些见解一定程度上褪去了AI/ML的神秘色彩，可以让我们更客观的看待这一学科。
Artificial intelligence (AI) is the intelligence of machines and robots and the branch of computer science that aims to create it. AI textbooks define the field as “the study and design of intelligent agents” where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success. John McCarthy, who coined the term in 1955,defines it as “the science and engineering of making intelligent machines.”
AI research is highly technical and specialized, deeply divided into subfields that often fail to communicate with each other. Some of the division is due to social and cultural factors: subfields have grown up around particular institutions and the work of individual researchers. AI research is also divided by several technical issues. There are subfields which are focused on the solution of specific problems, on one of several possible approaches, on the use of widely differing tools and towards the accomplishment of particular applications. The central problems of AI include such traits as reasoning, knowledge, planning, learning, communication, perception and the ability to move and manipulate objects. General intelligence (or “strong AI“) is still among the field’s long term goals. Currently popular approaches include statistical methods, computational intelligence and traditional symbolic AI. There are an enormous number of tools used in AI, including versions of search and mathematical optimization, logic, methods based on probability and economics, and many others.
The field was founded on the claim that a central property of humans, intelligence—the sapience of Homo sapiens—can be so precisely described that it can be simulated by a machine. This raises philosophical issues about the nature of the mind and the ethics of creating artificial beings, issues which have been addressed by myth, fiction and philosophy since antiquity. Artificial intelligence has been the subject of optimism, but has also suffered setbacks and, today, has become an essential part of the technology industry, providing the heavy lifting for many of the most difficult problems in computer science.
Thinking machines and artificial beings appear in Greek myths, such as Talos of Crete, the bronze robot of Hephaestus, and Pygmalion’s Galatea. Human likenesses believed to have intelligence were built in every major civilization: animated cult images were worshipped in Egypt and Greece and humanoid automatons were built by Yan Shi, Hero of Alexandria and Al-Jazari. It was also widely believed that artificial beings had been created by Jābir ibn Hayyān, Judah Loew and Paracelsus. By the 19th and 20th centuries, artificial beings had become a common feature in fiction, as in Mary Shelley‘s Frankenstein or Karel Čapek‘s R.U.R. (Rossum’s Universal Robots). Pamela McCorduck argues that all of these are examples of an ancient urge, as she describes it, “to forge the gods”. Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence.
Mechanical or “formal” reasoning has been developed by philosophers and mathematicians since antiquity. The study of logic led directly to the invention of the programmable digital electronic computer, based on the work of mathematician Alan Turing and others. Turing’s theory of computation suggested that a machine, by shuffling symbols as simple as “0” and “1”, could simulate any conceivable (imaginable) act of mathematical deduction. This, along with concurrent discoveries in neurology, information theory and cybernetics, inspired a small group of researchers to begin to seriously consider the possibility of building an electronic brain.
The field of AI research was founded at a conference on the campus of Dartmouth College in the summer of 1956. The attendees, including John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon, became the leaders of AI research for many decades. They and their students wrote programs that were, to most people, simply astonishing: Computers were solving word problems in algebra, proving logical theorems and speaking English. By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense and laboratories had been established around the world. AI’s founders were profoundly optimistic about the future of the new field: Herbert Simon predicted that “machines will be capable, within twenty years, of doing any work a man can do” and Marvin Minsky agreed, writing that “within a generation … the problem of creating ‘artificial intelligence’ will substantially be solved”.
They had failed to recognize the difficulty of some of the problems they faced. In 1974, in response to the criticism of Sir James Lighthill and ongoing pressure from the US Congress to fund more productive projects, both the U.S. and British governments cut off all undirected exploratory research in AI. The next few years, when funding for projects was hard to find, would later be called the “AI winter“.
In the early 1980s, AI research was revived by the commercial success of expert systems, a form of AI program that simulated the knowledge and analytical skills of one or more human experts. By 1985 the market for AI had reached over a billion dollars. At the same time, Japan’s fifth generation computer project inspired the U.S and British governments to restore funding for academic research in the field. However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer lasting AI winter began.
In the 1990s and early 21st century, AI achieved its greatest successes, albeit somewhat behind the scenes. Artificial intelligence is used for logistics, data mining, medical diagnosis and many other areas throughout the technology industry. The success was due to several factors: the increasing computational power of computers (see Moore’s law), a greater emphasis on solving specific subproblems, the creation of new ties between AI and other fields working on similar problems, and a new commitment by researchers to solid mathematical methods and rigorous scientific standards.
On 11 May 1997, Deep Blue became the first computer chess-playing system to beat a reigning world chess champion, Garry Kasparov. In 2005, a Stanford robot won the DARPA Grand Challenge by driving autonomously for 131 miles along an unrehearsed desert trail. Two years later, a team from CMU won the DARPA Urban Challenge when their vehicle autonomously navigated 55 miles in an Urban environment while adhering to traffic hazards and all traffic laws. In February 2011, in a Jeopardy! quiz show exhibition match, IBM‘s question answering system, Watson, defeated the two greatest Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin.
The leading-edge definition of artificial intelligence research is changing over time. One pragmatic definition is: “AI research is that which computing scientists do not know how to do cost-effectively today.” For example, in 1956 optical character recognition (OCR) was considered AI, but today, sophisticated OCR software with a context-sensitive spell checker and grammar checkersoftware comes for free with most image scanners. No one would any longer consider already-solved computing science problems like OCR “artificial intelligence” today.
Low-cost entertaining chess-playing software is commonly available for tablet computers. DARPA no longer provides significant funding for chess-playing computing system development. The Kinectwhich provides a 3D body–motion interface for the Xbox 360 uses algorithms that emerged from lengthy AI research, but few consumers realize the technology source.
AI applications are no longer the exclusive domain of U.S. Department of Defense R&D, but are now commonplace consumer items and inexpensive intelligent toys.
In common usage, the term “AI” no longer seems to apply to off-the-shelf solved computing-science problems, which may have originally emerged out of years of AI research.
The general problem of simulating (or creating) intelligence has been broken down into a number of specific sub-problems. These consist of particular traits or capabilities that researchers would like an intelligent system to display. The traits described below have received the most attention.
Deduction, reasoning, problem solving
Early AI researchers developed algorithms that imitated the step-by-step reasoning that humans use when they solve puzzles or make logical deductions. By the late 1980s and ’90s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.
For difficult problems, most of these algorithms can require enormous computational resources – most experience a “combinatorial explosion“: the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size. The search for more efficient problem-solving algorithms is a high priority for AI research.
Human beings solve most of their problems using fast, intuitive judgements rather than the conscious, step-by-step deduction that early AI research was able to model. AI has made some progress at imitating this kind of “sub-symbolic” problem solving: embodied agent approaches emphasize the importance of sensorimotor skills to higher reasoning; neural net research attempts to simulate the structures inside the brain that give rise to this skill; statistical approaches to AI mimic the probabilistic nature of the human ability to guess.
Knowledge representation and knowledge engineering are central to AI research. Many of the problems machines are expected to solve will require extensive knowledge about the world. Among the things that AI needs to represent are: objects, properties, categories and relations between objects; situations, events, states and time; causes and effects; knowledge about knowledge (what we know about what other people know); and many other, less well researched domains. A representation of “what exists” is an ontology (borrowing a word from traditional philosophy), of which the most general are called upper ontologies.
Among the most difficult problems in knowledge representation are:
- Default reasoning and the qualification problem
- Many of the things people know take the form of “working assumptions.” For example, if a bird comes up in conversation, people typically picture an animal that is fist sized, sings, and flies. None of these things are true about all birds. John McCarthy identified this problem in 1969 as the qualification problem: for any commonsense rule that AI researchers care to represent, there tend to be a huge number of exceptions. Almost nothing is simply true or false in the way that abstract logic requires. AI research has explored a number of solutions to this problem.
- The breadth of commonsense knowledge
- The number of atomic facts that the average person knows is astronomical. Research projects that attempt to build a complete knowledge base of commonsense knowledge (e.g., Cyc) require enormous amounts of laborious ontological engineering — they must be built, by hand, one complicated concept at a time. A major goal is to have the computer understand enough concepts to be able to learn by reading from sources like the internet, and thus be able to add to its own ontology.
- The subsymbolic form of some commonsense knowledge
- Much of what people know is not represented as “facts” or “statements” that they could express verbally. For example, a chess master will avoid a particular chess position because it “feels too exposed” or an art critic can take one look at a statue and instantly realize that it is a fake. These are intuitions or tendencies that are represented in the brain non-consciously and sub-symbolically. Knowledge like this informs, supports and provides a context for symbolic, conscious knowledge. As with the related problem of sub-symbolic reasoning, it is hoped thatsituated AI, computational intelligence, or statistical AI will provide ways to represent this kind of knowledge.
Intelligent agents must be able to set goals and achieve them. They need a way to visualize the future (they must have a representation of the state of the world and be able to make predictions about how their actions will change it) and be able to make choices that maximize the utility (or “value”) of the available choices.
In classical planning problems, the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be. However, if the agent is not the only actor, it must periodically ascertain whether the world matches its predictions and it must change its plan as this becomes necessary, requiring the agent to reason under uncertainty.
Unsupervised learning is the ability to find patterns in a stream of input. Supervised learning includes both classification and numerical regression. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change. In reinforcement learning the agent is rewarded for good responses and punished for bad ones. These can be analyzed in terms of decision theory, using concepts like utility. The mathematical analysis of machine learning algorithms and their performance is a branch of theoretical computer science known ascomputational learning theory.
Natural language processing
Natural language processing gives machines the ability to read and understand the languages that humans speak. A sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human-written sources, such as Internet texts. Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation.
A common method of processing and extracting meaning from natural language is through semantic indexing. Increases in processing speeds and the drop in the cost of data storage makes indexing large volumes of abstractions of the users input much more efficient.
Motion and manipulation
The field of robotics is closely related to AI. Intelligence is required for robots to be able to handle such tasks as object manipulation and navigation, with sub-problems of localization (knowing where you are, or finding out where other things are), mapping (learning what is around you, building a map of the environment), and motion planning (figuring out how to get there) or path planning (going from one point in space to another point, which may involve compliant motion – where the robot moves while maintaining physical contact with an object).
Machine perception is the ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world. Computer vision is the ability to analyze visual input. A few selected subproblems are speech recognition, facial recognition and object recognition.
Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer sciences, psychology, and cognitive science. While the origins of the field may be traced as far back as to early philosophical enquiries into emotion, the more modern branch of computer science originated with Rosalind Picard‘s 1995 paper on affective computing. A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behaviour to them, giving an appropriate response for those emotions.
Emotion and social skills play two roles for an intelligent agent. First, it must be able to predict the actions of others, by understanding their motives and emotional states. (This involves elements of game theory, decision theory, as well as the ability to model human emotions and the perceptual skills to detect emotions.) Also, in an effort to facilitate human-computer interaction, an intelligent machine might want to be able to display emotions—even if it does not actually experience them itself—in order to appear sensitive to the emotional dynamics of human interaction.
A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative, or systems that identify and assess creativity). Related areas of computational research are Artificial intuition and Artificial imagination.
Most researchers think that their work will eventually be incorporated into a machine with general intelligence (known as strong AI), combining all the skills above and exceeding human abilities at most or all of them. A few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project.
Many of the problems above are considered AI-complete: to solve one problem, you must solve them all. For example, even a straightforward, specific task like machine translation requires that the machine follow the author’s argument (reason), know what is being talked about (knowledge), and faithfully reproduce the author’s intention (social intelligence). Machine translation, therefore, is believed to be AI-complete: it may require strong AI to be done as well as humans can do it.
There is no established unifying theory or paradigm that guides AI research. Researchers disagree about many issues. A few of the most long standing questions that have remained unanswered are these: should artificial intelligence simulate natural intelligence by studying psychology or neurology? Or is human biology as irrelevant to AI research as bird biology is to aeronautical engineering? Can intelligent behavior be described using simple, elegant principles (such as logic or optimization)? Or does it necessarily require solving a large number of completely unrelated problems? Can intelligence be reproduced using high-level symbols, similar to words and ideas? Or does it require “sub-symbolic” processing? John Haugeland, who coined the term GOFAI (Good Old-Fashioned Artificial Intelligence), also proposed that AI should more properly be referred to as synthetic intelligence, a term which has since been adopted by some non-GOFAI researchers.
Cybernetics and brain simulation
In the 1940s and 1950s, a number of researchers explored the connection between neurology, information theory, and cybernetics. Some of them built machines that used electronic networks to exhibit rudimentary intelligence, such as W. Grey Walter‘s turtles and the Johns Hopkins Beast. Many of these researchers gathered for meetings of the Teleological Society at Princeton University and theRatio Club in England. By 1960, this approach was largely abandoned, although elements of it would be revived in the 1980s.
When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation. The research was centered in three institutions: Carnegie Mellon University, Stanford and MIT, and each one developed its own style of research. John Haugeland named these approaches to AI “good old fashioned AI” or “GOFAI“. During the 1960s, symbolic approaches had achieved great success at simulating high-level thinking in small demonstration programs. Approaches based on cybernetics or neural networkswere abandoned or pushed into the background. Researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the goal of their field.
- Cognitive simulation
- Economist Herbert Simon and Allen Newell studied human problem-solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science. Their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems. This tradition, centered at Carnegie Mellon University would eventually culminate in the development of the Soar architecture in the middle 80s.
- Unlike Newell and Simon, John McCarthy felt that machines did not need to simulate human thought, but should instead try to find the essence of abstract reasoning and problem solving, regardless of whether people used the same algorithms. His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning. Logic was also focus of the work at the University of Edinburgh and elsewhere in Europe which led to the development of the programming languageProlog and the science of logic programming.
- “Anti-logic” or “scruffy”
- Researchers at MIT (such as Marvin Minsky and Seymour Papert) found that solving difficult problems in vision and natural language processing required ad-hoc solutions – they argued that there was no simple and general principle (like logic) that would capture all the aspects of intelligent behavior. Roger Schank described their “anti-logic” approaches as “scruffy” (as opposed to the “neat” paradigms at CMU and Stanford). Commonsense knowledge bases (such as Doug Lenat‘s Cyc) are an example of “scruffy” AI, since they must be built by hand, one complicated concept at a time.
- When computers with large memories became available around 1970, researchers from all three traditions began to build knowledge into AI applications. This “knowledge revolution” led to the development and deployment of expert systems (introduced by Edward Feigenbaum), the first truly successful form of AI software. The knowledge revolution was also driven by the realization that enormous amounts of knowledge would be required by many simple AI applications.
By the 1980s progress in symbolic AI seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition, especially perception, robotics,learning and pattern recognition. A number of researchers began to look into “sub-symbolic” approaches to specific AI problems.
- Bottom-up, embodied, situated, behavior-based or nouvelle AI
- Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI and focused on the basic engineering problems that would allow robots to move and survive. Their work revived the non-symbolic viewpoint of the early cybernetics researchers of the 50s and reintroduced the use of control theory in AI. This coincided with the development of the embodied mind thesis in the related field of cognitive science: the idea that aspects of the body (such as movement, perception and visualization) are required for higher intelligence.
- Computational Intelligence
- Interest in neural networks and “connectionism” was revived by David Rumelhart and others in the middle 1980s. These and other sub-symbolic approaches, such as fuzzy systems andevolutionary computation, are now studied collectively by the emerging discipline of computational intelligence.
In the 1990s, AI researchers developed sophisticated mathematical tools to solve specific subproblems. These tools are truly scientific, in the sense that their results are both measurable and verifiable, and they have been responsible for many of AI’s recent successes. The shared mathematical language has also permitted a high level of collaboration with more established fields (likemathematics, economics or operations research). Stuart Russell and Peter Norvig describe this movement as nothing less than a “revolution” and “the victory of the neats.” Critics argue that these techniques are too focused on particular problems and have failed to address the long term goal of general intelligence. There is an ongoing debate about the relevance and validity of statistical approaches in AI, exemplified in part by exchanges between Peter Norvig and Noam Chomsky, as described in,.
Integrating the approaches
- Intelligent agent paradigm
- An intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success. The simplest intelligent agents are programs that solve specific problems. More complicated agents include human beings and organizations of human beings (such as firms). The paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful, without agreeing on one single approach. An agent that solves a specific problem can use any approach that works – some agents are symbolic and logical, some are sub-symbolic neural networks and others may use new approaches. The paradigm also gives researchers a common language to communicate with other fields—such as decision theory and economics—that also use concepts of abstract agents. The intelligent agent paradigm became widely accepted during the 1990s.
- Agent architectures and cognitive architectures
- Researchers have designed systems to build intelligent systems out of interacting intelligent agents in a multi-agent system. A system with both symbolic and sub-symbolic components is ahybrid intelligent system, and the study of such systems is artificial intelligence systems integration. A hierarchical control system provides a bridge between sub-symbolic AI at its lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time constraints permit planning and world modelling. Rodney Brooks‘ subsumption architecture was an early proposal for such a hierarchical system.
In the course of 50 years of research, AI has developed a large number of tools to solve the most difficult problems in computer science. A few of the most general of these methods are discussed below.
Search and optimization
Many problems in AI can be solved in theory by intelligently searching through many possible solutions: Reasoning can be reduced to performing a search. For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule. Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal, a process called means-ends analysis. Robotics algorithms for moving limbs and grasping objects use local searches in configuration space. Manylearning algorithms use search algorithms based on optimization.
Simple exhaustive searches are rarely sufficient for most real world problems: the search space (the number of places to search) quickly grows to astronomical numbers. The result is a search that is too slow or never completes. The solution, for many problems, is to use “heuristics” or “rules of thumb” that eliminate choices that are unlikely to lead to the goal (called “pruning thesearch tree“). Heuristics supply the program with a “best guess” for the path on which the solution lies.
A very different kind of search came to prominence in the 1990s, based on the mathematical theory of optimization. For many problems, it is possible to begin the search with some form of a guess and then refine the guess incrementally until no more refinements can be made. These algorithms can be visualized as blind hill climbing: we begin the search at a random point on the landscape, and then, by jumps or steps, we keep moving our guess uphill, until we reach the top. Other optimization algorithms are simulated annealing, beam search and random optimization.
Evolutionary computation uses a form of optimization search. For example, they may begin with a population of organisms (the guesses) and then allow them to mutate and recombine, selecting only the fittest to survive each generation (refining the guesses). Forms of evolutionary computation include swarm intelligence algorithms (such as ant colony or particle swarm optimization) andevolutionary algorithms (such as genetic algorithms, gene expression programming, and genetic programming).
Logic is used for knowledge representation and problem solving, but it can be applied to other problems as well. For example, the satplan algorithm uses logic for planning and inductive logic programming is a method for learning.
Several different forms of logic are used in AI research. Propositional or sentential logic is the logic of statements which can be true or false. First-order logic also allows the use ofquantifiers and predicates, and can express facts about objects, their properties, and their relations with each other. Fuzzy logic, is a version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0). Fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems. Subjective logic models uncertainty in a different and more explicit manner than fuzzy-logic: a given binomial opinion satisfies belief + disbelief + uncertainty = 1 within a Beta distribution. By this method, ignorance can be distinguished from probabilistic statements that an agent makes with high confidence.
Default logics, non-monotonic logics and circumscription are forms of logic designed to help with default reasoning and the qualification problem. Several extensions of logic have been designed to handle specific domains of knowledge, such as: description logics; situation calculus, event calculus and fluent calculus (for representing events and time); causal calculus; belief calculus; and modal logics.
Probabilistic methods for uncertain reasoning
Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to operate with incomplete or uncertain information. AI researchers have devised a number of powerful tools to solve these problems using methods from probability theory and economics.
Bayesian networks are a very general tool that can be used for a large number of problems: reasoning (using the Bayesian inference algorithm), learning (using the expectation-maximization algorithm), planning (using decision networks) and perception (using dynamic Bayesian networks). Probabilistic algorithms can also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping perception systems to analyze processes that occur over time (e.g., hidden Markov models or Kalman filters).
A key concept from the science of economics is “utility“: a measure of how valuable something is to an intelligent agent. Precise mathematical tools have been developed that analyze how an agent can make choices and plan, using decision theory, decision analysis, information value theory. These tools include models such as Markov decision processes, dynamic decision networks, game theory and mechanism design.
Classifiers and statistical learning methods
The simplest AI applications can be divided into two types: classifiers (“if shiny then diamond”) and controllers (“if shiny then pick up”). Controllers do however also classify conditions before inferring actions, and therefore classification forms a central part of many AI systems. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples, making them very attractive for use in AI. These examples are known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as a data set. When a new observation is received, that observation is classified based on previous experience.
A classifier can be trained in various ways; there are many statistical and machine learning approaches. The most widely used classifiers are the neural network, kernel methods such as thesupport vector machine, k-nearest neighbor algorithm, Gaussian mixture model, naive Bayes classifier, and decision tree. The performance of these classifiers have been compared over a wide range of tasks. Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems; this is also referred to as the “no free lunch” theorem. Determining a suitable classifier for a given problem is still more an art than science.
The study of artificial neural networks began in the decade before the field AI research was founded, in the work of Walter Pitts and Warren McCullough. Other important early researchers were Frank Rosenblatt, who invented the perceptron and Paul Werbos who developed the backpropagation algorithm.
The main categories of networks are acyclic or feedforward neural networks (where the signal passes in only one direction) and recurrent neural networks (which allow feedback). Among the most popular feedforward networks are perceptrons, multi-layer perceptrons and radial basis networks. Among recurrent networks, the most famous is the Hopfield net, a form of attractor network, which was first described by John Hopfield in 1982. Neural networks can be applied to the problem ofintelligent control (for robotics) or learning, using such techniques as Hebbian learning and competitive learning.
In 1950, Alan Turing proposed a general procedure to test the intelligence of an agent now known as the Turing test. This procedure allows almost all the major problems of artificial intelligence to be tested. However, it is a very difficult challenge and at present all agents fail.
Artificial intelligence can also be evaluated on specific problems such as small problems in chemistry, hand-writing recognition and game-playing. Such tests have been termed subject matter expert Turing tests. Smaller problems provide more achievable goals and there are an ever-increasing number of positive results.
One classification for outcomes of an AI test is:
- Optimal: it is not possible to perform better.
- Strong super-human: performs better than all humans.
- Super-human: performs better than most humans.
- Sub-human: performs worse than most humans.
For example, performance at draughts is optimal, performance at chess is super-human and nearing strong super-human (see computer chess: computers versus human) and performance at many everyday tasks (such as recognizing a face or crossing a room without bumping into something) is sub-human.
A quite different approach measures machine intelligence through tests which are developed from mathematical definitions of intelligence. Examples of these kinds of tests start in the late nineties devising intelligence tests using notions from Kolmogorov complexity and data compression. Two major advantages of mathematical definitions are their applicability to nonhuman intelligences and their absence of a requirement for human testers.
Artificial intelligence techniques are pervasive and are too numerous to list. Frequently, when a technique reaches mainstream use, it is no longer considered artificial intelligence; this phenomenon is described as the AI effect.
Competitions and prizes
There are a number of competitions and prizes to promote research in artificial intelligence. The main areas promoted are: general machine intelligence, conversational behavior, data-mining, driverless cars, robot soccer and games.
A platform (or “computing platform“) is defined as “some sort of hardware architecture or software framework (including application frameworks), that allows software to run.” As Rodney Brooks pointed out many years ago, it is not just the artificial intelligence software that defines the AI features of the platform, but rather the actual platform itself that affects the AI that results, i.e., there needs to be work in AI problems on real-world platforms rather than in isolation.
A wide variety of platforms has allowed different aspects of AI to develop, ranging from expert systems, albeit PC-based but still an entire real-world system, to various robot platforms such as the widely available Roomba with open interface.
Artificial intelligence, by claiming to be able to recreate the capabilities of the human mind, is both a challenge and an inspiration for philosophy. Are there limits to how intelligent machines can be? Is there an essential difference between human intelligence and artificial intelligence? Can a machine have a mind and consciousness? A few of the most influential answers to these questions are given below.
Turing’s “polite convention”: We need not decide if a machine can “think”; we need only decide if a machine can act as intelligently as a human being. This approach to the philosophical problems associated with artificial intelligence forms the basis of the Turing test.
The Dartmouth proposal: “Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.” This conjecture was printed in the proposal for the Dartmouth Conference of 1956, and represents the position of most working AI researchers.
Newell and Simon’s physical symbol system hypothesis: “A physical symbol system has the necessary and sufficient means of general intelligent action.” Newell and Simon argue that intelligences consist of formal operations on symbols. Hubert Dreyfus argued that, on the contrary, human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a “feel” for the situation rather than explicit symbolic knowledge. (See Dreyfus’ critique of AI.)
Gödel’s incompleteness theorem: A formal system (such as a computer program) cannot prove all true statements. Roger Penrose is among those who claim that Gödel’s theorem limits what machines can do. (See The Emperor’s New Mind.)
Searle’s strong AI hypothesis: “The appropriately programmed computer with the right inputs and outputs would thereby have a mind in exactly the same sense human beings have minds.” John Searle counters this assertion with his Chinese room argument, which asks us to look inside the computer and try to find where the “mind” might be.
The artificial brain argument: The brain can be simulated. Hans Moravec, Ray Kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software, and that such a simulation will be essentially identical to the original.
Predictions and ethics
Artificial Intelligence is a common topic in both science fiction and projections about the future of technology and society. The existence of an artificial intelligence that rivals human intelligence raises difficult ethical issues, and the potential power of the technology inspires both hopes and fears.
In fiction, Artificial Intelligence has appeared fulfilling many roles, including a servant (R2D2 in Star Wars), a law enforcer (K.I.T.T. “Knight Rider“), a comrade (Lt. Commander Data in Star Trek: The Next Generation), a conqueror/overlord (The Matrix, Omnius), a dictator (With Folded Hands), a benevolent provider/de facto ruler (The Culture), an assassin (Terminator), a sentient race (Battlestar Galactica/Transformers/Mass Effect), an extension to human abilities (Ghost in the Shell) and the savior of the human race (R. Daneel Olivaw in Isaac Asimov‘s Robot series).
Mary Shelley‘s Frankenstein considers a key issue in the ethics of artificial intelligence: if a machine can be created that has intelligence, could it also feel? If it can feel, does it have the same rights as a human? The idea also appears in modern science fiction, including the films I Robot, Blade Runner and A.I.: Artificial Intelligence, in which humanoid machines have the ability to feel human emotions. This issue, now known as “robot rights“, is currently being considered by, for example, California’s Institute for the Future, although many critics believe that the discussion is premature. The subject is profoundly discussed in the 2010 documentary film Plug & Pray.
Martin Ford, author of The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future, and others argue that specialized artificial intelligence applications, robotics and other forms of automation will ultimately result in significant unemployment as machines begin to match and exceed the capability of workers to perform most routine and repetitive jobs. Ford predicts that many knowledge-based occupations—and in particular entry level jobs—will be increasingly susceptible to automation via expert systems, machine learning and other AI-enhanced applications. AI-based applications may also be used to amplify the capabilities of low-wage offshore workers, making it more feasible to outsource knowledge work.
Joseph Weizenbaum wrote that AI applications can not, by definition, successfully simulate genuine human empathy and that the use of AI technology in fields such as customer service orpsychotherapy was deeply misguided. Weizenbaum was also bothered that AI researchers (and some philosophers) were willing to view the human mind as nothing more than a computer program (a position now known as computationalism). To Weizenbaum these points suggest that AI research devalues human life.
Many futurists believe that artificial intelligence will ultimately transcend the limits of progress. Ray Kurzweil has used Moore’s law (which describes the relentless exponential improvement in digital technology) to calculate that desktop computers will have the same processing power as human brains by the year 2029. He also predicts that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past, a scenario that science fiction writer Vernor Vinge named the “singularity“.
Robot designer Hans Moravec, cyberneticist Kevin Warwick and inventor Ray Kurzweil have predicted that humans and machines will merge in the future into cyborgs that are more capable and powerful than either. This idea, called transhumanism, which has roots in Aldous Huxley and Robert Ettinger, has been illustrated in fiction as well, for example in the manga Ghost in the Shell and the science-fiction series Dune. In the 1980s artist Hajime Sorayama‘s Sexy Robots series were painted and published in Japan depicting the actual organic human form with life-like muscular metallic skins and later “the Gynoids” book followed that was used by or influenced movie makers including George Lukas and other creatives. Sorayama never considered these organic robots to be real part of nature but always unnatural product of the human mind, a fantasy existing in the mind even when realized in actual form. Almost 20 years later, the first AI robotic pet (AIBO) came available as a companion to people. AIBO grew out of Sony’s Computer Science Laboratory (CSL). Famed engineer Dr. Toshitada Doiis credited as AIBO’s original progenitor: in 1994 he had started work on robots with artificial intelligence expert Masahiro Fujita within CSL of Sony. Doi’s, friend, the artist Hajime Sorayama, was enlisted to create the initial designs for the AIBO’s body. Those designs are now part of the permanent collections of Museum of Modern Art and the Smithsonian Institution, with later versions of AIBO being used in studies in Carnegie Mellon University. In 2006, AIBO was added into Carnegie Mellon University’s “Robot Hall of Fame”.
Political scientist Charles T. Rubin believes that AI can be neither designed nor guaranteed to be friendly. He argues that “any sufficiently advanced benevolence may be indistinguishable from malevolence.” Humans should not assume machines or robots would treat us favorably, because there is no a priori reason to believe that they would be sympathetic to our system of morality, which has evolved along with our particular biology (which AIs would not share).
Edward Fredkin argues that “artificial intelligence is the next stage in evolution”, an idea first proposed by Samuel Butler‘s “Darwin among the Machines” (1863), and expanded upon by George Dysonin his book of the same name in 1998.
- Artificial intelligence in fiction
- Artificial Intelligence (journal)
- Artificial intelligence (video games)
- Synthetic intelligence
- Cognitive sciences
- Computer Go
- Human Cognome Project
- Friendly artificial intelligence
- List of basic artificial intelligence topics
- List of AI researchers
- List of important AI publications
- List of AI projects
- List of machine learning algorithms
- List of emerging technologies
- List of scientific journals
- Philosophy of mind
- Technological singularity
- Never-Ending Language Learning
- ^ Definition of AI as the study of intelligent agents:
- Poole, Mackworth & Goebel 1998, p. 1, which provides the version that is used in this article. Note that they use the term “computational intelligence” as a synonym for artificial intelligence.
- Russell & Norvig (2003) (who prefer the term “rational agent”) and write “The whole-agent view is now widely accepted in the field” (Russell & Norvig 2003, p. 55).
- Nilsson 1998
- ^ a b The intelligent agent paradigm:
- Russell & Norvig 2003, pp. 27, 32–58, 968–972
- Poole, Mackworth & Goebel 1998, pp. 7–21
- Luger & Stubblefield 2004, pp. 235–240
The definition used in this article, in terms of goals, actions, perception and environment, is due to Russell & Norvig (2003). Other definitions also include knowledge and learning as additional criteria.
- ^ Although there is some controversy on this point (see Crevier (1993, p. 50)), McCarthy states unequivocally “I came up with the term” in a c|net interview. (Skillings 2006) McCarthy first used the term in the proposal for the Dartmouth conference, which appeared in 1955. (McCarthy et al. 1955)
- ^ McCarthy‘s definition of AI:
- ^ Pamela McCorduck (2004, pp. 424) writes of “the rough shattering of AI in subfields—vision, natural language, decision theory, genetic algorithms, robotics … and these with own sub-subfield—that would hardly have anything to say to each other.”
- ^ a b This list of intelligent traits is based on the topics covered by the major AI textbooks, including:
- ^ a b General intelligence (strong AI) is discussed in popular introductions to AI:
- ^ See the Dartmouth proposal, under Philosophy, below.
- ^ a b This is a central idea of Pamela McCorduck‘s Machines Who Think. She writes: “I like to think of artificial intelligence as the scientific apotheosis of a venerable cultural tradition.” (McCorduck 2004, p. 34) “Artificial intelligence in one form or another is an idea that has pervaded Western intellectual history, a dream in urgent need of being realized.” (McCorduck 2004, p. xviii) “Our history is full of attempts—nutty, eerie, comical, earnest, legendary and real—to make artificial intelligences, to reproduce what is the essential us—bypassing the ordinary means. Back and forth between myth and reality, our imaginations supplying what our workshops couldn’t, we have engaged for a long time in this odd form of self-reproduction.” (McCorduck 2004, p. 3) She traces the desire back to its Hellenistic roots and calls it the urge to “forge the Gods.” (McCorduck 2004, pp. 340–400)
- ^ The optimism referred to includes the predictions of early AI researchers (see optimism in the history of AI) as well as the ideas of modern transhumanists such as Ray Kurzweil.
- ^ The “setbacks” referred to include the ALPAC report of 1966, the abandonment of perceptrons in 1970, the Lighthill Report of 1973 and the collapse of the lisp machine market in 1987.
- ^ a b AI applications widely used behind the scenes:
- ^ AI in myth:
- ^ Cult images as artificial intelligence:
These were the first machines to be believed to have true intelligence and consciousness. Hermes Trismegistus expressed the common belief that with these statues, craftsman had reproduced “the true nature of the gods”, their sensus and spiritus. McCorduck makes the connection between sacred automatons and Mosaic law (developed around the same time), which expressly forbids the worship of robots (McCorduck 2004, pp. 6–9)
- ^ Humanoid automata:
- Needham 1986, p. 53
- McCorduck 2004, p. 6
- “A Thirteenth Century Programmable Robot”. Shef.ac.uk. Retrieved 25 April 2009.
- McCorduck 2004, p. 17
- ^ Artificial beings:
Jābir ibn Hayyān‘s Takwin:
- O’Connor, Kathleen Malone (1994). The alchemical creation of life (takwin) and other concepts of Genesis in medieval Islam. University of Pennsylvania. Retrieved 10 January 2007.
- McCorduck 2004, pp. 13–14
- ^ AI in early science fiction.
- McCorduck 2004, pp. 17–25
- ^ This insight, that digital computers can simulate any process of formal reasoning, is known as the Church–Turing thesis.
- ^ Formal reasoning:
- ^ a b AI’s immediate precursors:
- McCorduck 2004, pp. 51–107
- Crevier 1993, pp. 27–32
- Russell & Norvig 2003, pp. 15, 940
- Moravec 1988, p. 3
See also Cybernetics and early neural networks (in History of artificial intelligence). Among the researchers who laid the foundations of AI were Alan Turing, John Von Neumann, Norbert Wiener, Claude Shannon, Warren McCullough, Walter Pitts and Donald Hebb.
- ^ Dartmouth conference:
- ^ Hegemony of the Dartmouth conference attendees:
- ^ Russell and Norvig write “it was astonishing whenever a computer did anything kind of smartish.” Russell & Norvig 2003, p. 18
- ^ “Golden years” of AI (successful symbolic reasoning programs 1956–1973):
- McCorduck 2004, pp. 243–252
- Crevier 1993, pp. 52–107
- Moravec 1988, p. 9
- Russell & Norvig 2003, pp. 18–21
- ^ DARPA pours money into undirected pure research into AI during the 1960s:
- ^ AI in England:
- ^ Optimism of early AI:
- ^ See The problems (in History of artificial intelligence)
- ^ First AI Winter, Mansfield Amendment, Lighthill report
- ^ a b Expert systems:
- ^ Boom of the 1980s: rise of expert systems, Fifth Generation Project, Alvey, MCC, SCI:
- ^ Second AI winter:
- ^ a b Formal methods are now preferred (“Victory of the neats“):
- ^ McCorduck 2004, pp. 480–483
- ^ DARPA Grand Challenge – home page
- ^ “Welcome”. Archive.darpa.mil. Retrieved 31 October 2011.
- ^ Markoff, John (16 February 2011). “On ‘Jeopardy!’ Watson Win Is All but Trivial”.The New York Times.
- ^ Kinect’s AI breakthrough explained
- ^ Problem solving, puzzle solving, game playing and deduction:
- ^ Uncertain reasoning:
- ^ Intractability and efficiency and the combinatorial explosion:
- Russell & Norvig 2003, pp. 9, 21–22
- ^ Psychological evidence of sub-symbolic reasoning:
- Wason & Shapiro (1966) showed that people do poorly on completely abstract problems, but if the problem is restated to allow the use of intuitive social intelligence, performance dramatically improves. (See Wason selection task)
- Kahneman, Slovic & Tversky (1982) have shown that people are terrible at elementary problems that involve uncertain reasoning. (See list of cognitive biases for several examples).
- Lakoff & Núñez (2000) have controversially argued that even our skills at mathematics depend on knowledge and skills that come from “the body”, i.e. sensorimotor and perceptual skills. (See Where Mathematics Comes From)
- ^ Knowledge representation:
- ^ Knowledge engineering:
- ^ a b Representing categories and relations: Semantic networks,description logics, inheritance (including frames and scripts):
- ^ a b Representing events and time:Situation calculus, event calculus, fluent calculus (including solving the frame problem):
- ^ a b Causal calculus:
- Poole, Mackworth & Goebel 1998, pp. 335–337
- ^ a b Representing knowledge about knowledge: Belief calculus, modal logics:
- ^ Ontology:
- Russell & Norvig 2003, pp. 320–328
- ^ Qualification problem:
While McCarthy was primarily concerned with issues in the logical representation of actions, Russell & Norvig 2003 apply the term to the more general issue of default reasoning in the vast network of assumptions underlying all our commonsense knowledge.
- ^ a b Default reasoning and default logic, non-monotonic logics,circumscription, closed world assumption, abduction (Poole et al.places abduction under “default reasoning”. Luger et al. places this under “uncertain reasoning”):
- ^ Breadth of commonsense knowledge:
- ^ Dreyfus & Dreyfus 1986
- ^ Gladwell 2005
- ^ a b Expert knowledge as embodied intuition:
- Dreyfus & Dreyfus 1986 (Hubert Dreyfus is a philosopher and critic of AI who was among the first to argue that most useful human knowledge was encoded sub-symbolically. See Dreyfus’ critique of AI)
- Gladwell 2005 (Gladwell’s Blink is a popular introduction to sub-symbolic reasoning and knowledge.)
- Hawkins & Blakeslee 2005 (Hawkins argues that sub-symbolic knowledge should be the primary focus of AI research.)
- ^ Planning:
- ^ a b Information value theory:
- Russell & Norvig 2003, pp. 600–604
- ^ Classical planning:
- ^ Planning and acting in non-deterministic domains: conditional planning, execution monitoring, replanning and continuous planning:
- Russell & Norvig 2003, pp. 430–449
- ^ Multi-agent planning and emergent behavior:
- Russell & Norvig 2003, pp. 449–455
- ^ This is a form of Tom Mitchell‘s widely quoted definition of machine learning: “A computer program is set to learn from an experience E with respect to some task T and some performance measure P if its performance on T as measured by P improves with experience E.”
- ^ Learning:
- ^ Alan Turing discussed the centrality of learning as early as 1950, in his classic paper Computing Machinery and Intelligence.(Turing 1950) In 1956, at the original Dartmouth AI summer conference, Ray Solomonoff wrote a report on unsupervised probabilistic machine learning: “An Inductive Inference Machine”.(pdf scanned copy of the original)(version published in 1957, An Inductive Inference Machine,” IRE Convention Record, Section on Information Theory, Part 2, pp. 56–62)
- ^ Reinforcement learning:
- ^ Computational learning theory:
- CITATION IN PROGRESS.
- ^ Natural language processing:
- ^ Applications of natural language processing, including information retrieval (i.e. text mining) and machine translation:
- ^ Robotics:
- ^ a b Moving and configuration space:
- Russell & Norvig 2003, pp. 916–932
- ^ Tecuci, G. (2012), Artificial intelligence. WIREs Comp Stat, 4: 168–180. doi: 10.1002/wics.200
- ^ Robotic mapping (localization, etc):
- Russell & Norvig 2003, pp. 908–915
- ^ Machine perception:
- ^ Computer vision:
- ^ Speech recognition:
- ^ Object recognition:
- Russell & Norvig 2003, pp. 885–892
- ^ “Kismet”. MIT Artificial Intelligence Laboratory, Humanoid Robotics Group.
- ^ Thro, Ellen (1993). Robotics. New York.
- ^ Edelson, Edward (1991). The Nervous System. New York: Remmel Nunn.
- ^ Tao, Jianhua; Tieniu Tan (2005). “Affective Computing: A Review”.Affective Computing and Intelligent Interaction. LNCS 3784. Springer. pp. 981–995. doi:10.1007/11573548.
- ^ James, William (1884). “What is Emotion”. Mind 9: 188–205.doi:10.1093/mind/os-IX.34.188. Cited by Tao and Tan.
- ^ “Affective Computing”MIT Technical Report #321 (Abstract), 1995
- ^ Kleine-Cosack, Christian (October 2006). “Recognition and Simulation of Emotions”(PDF). Archived from the originalon 28 May 2008. Retrieved 13 May 2008. “The introduction of emotion to computer science was done by Pickard (sic) who created the field of affective computing.”
- ^ Diamond, David (December 2003). “The Love Machine; Building computers that care”. Wired.Archivedfrom the original on 18 May 2008. Retrieved 13 May 2008. “Rosalind Picard, a genial MIT professor, is the field’s godmother; her 1997 book, Affective Computing, triggered an explosion of interest in the emotional side of computers and their users.”
- ^ Emotion and affective computing:
- ^ Gerald Edelman, Igor Aleksander and others have both argued thatartificial consciousness is required for strong AI. (Aleksander 1995; Edelman 2007)
- ^ a b Artificial brain arguments: AI requires a simulation of the operation of the human brain
A few of the people who make some form of the argument:
- ^ AI complete: Shapiro 1992, p. 9
- ^ Nils Nilsson writes: “Simply put, there is wide disagreement in the field about what AI is all about” (Nilsson 1983, p. 10).
- ^ a b Biological intelligence vs. intelligence in general:
- Russell & Norvig 2003, pp. 2–3, who make the analogy withaeronautical engineering.
- McCorduck 2004, pp. 100–101, who writes that there are “two major branches of artificial intelligence: one aimed at producing intelligent behavior regardless of how it was accomplioshed, and the other aimed at modeling intelligent processes found in nature, particularly human ones.”
- Kolata 1982, a paper in Science, which describes McCathy’sindifference to biological models. Kolata quotes McCarthy as writing: “This is AI, so we don’t care if it’s psychologically real”. McCarthy recently reiterated his position at the AI@50 conference where he said “Artificial intelligence is not, by definition, simulation of human intelligence” (Maker 2006).
- ^ a b Neats vs. scruffies:
- ^ a b Symbolic vs. sub-symbolic AI:
- Nilsson (1998, p. 7), who uses the term “sub-symbolic”.
- ^ Haugeland 1985, p. 255.
- ^ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.8384&rep=rep1&type=pdf
- ^ Pei Wang (2008). Artificial general intelligence, 2008: proceedings of the First AGI Conference. IOS Press. p. 63. ISBN 978-1-58603-833-5. Retrieved 31 October 2011.
- ^ Haugeland 1985, pp. 112–117
- ^ The most dramatic case of sub-symbolic AI being pushed into the background was the devastating critique of perceptrons by Marvin Minsky and Seymour Papert in 1969. See History of AI, AI winter, orFrank Rosenblatt.
- ^ Cognitive simulation, Newell and Simon, AI at CMU (then calledCarnegie Tech):
- ^ Soar (history):
- ^ McCarthy and AI research at SAIL and SRI International:
- ^ AI research at Edinburgh and in France, birth of Prolog:
- ^ AI at MIT under Marvin Minsky in the 1960s :
- ^ Cyc:
- ^ Knowledge revolution:
- ^ Embodied approaches to AI:
- ^ Revival of connectionism:
- ^ Computational intelligence
- ^ Pat Langley, “The changing science of machine learning”, Machine Learning, Volume 82, Number 3, 275–279, doi:10.1007/s10994-011-5242-y
- ^ Yarden Katz, “Noam Chomsky on Where Artificial Intelligence Went Wrong”, The Atlantic, November 1, 2012
- ^ Peter Norvig, “On Chomsky and the Two Cultures of Statistical Learning”
- ^ Agent architectures, hybrid intelligent systems:
- ^ Hierarchical control system:
- Albus, J. S. 4-D/RCS reference model architecture for unmanned ground vehicles.In G Gerhart, R Gunderson, and C Shoemaker, editors, Proceedings of the SPIE AeroSense Session on Unmanned Ground Vehicle Technology, volume 3693, pages 11—20
- ^ Subsumption architecture:
- CITATION IN PROGRESS.
- ^ Search algorithms:
- ^ Forward chaining, backward chaining, Horn clauses, and logical deduction as search:
- ^ State space search and planning:
- ^ Uninformed searches (breadth first search, depth first search and general state space search):
- ^ Heuristic or informed searches (e.g., greedy best first and A*):
- ^ Optimization searches:
- ^ Artificial life and society based learning:
- Luger & Stubblefield 2004, pp. 530–541
- ^ Genetic programming and genetic algorithms:
- Luger & Stubblefield 2004, pp. 509–530,
- Nilsson 1998, chpt. 4.2.
- Holland, John H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. ISBN 0-262-58111-6.
- Koza, John R. (1992). Genetic Programming. MIT Press. ISBN 0-262-11170-5.
- Poli, R., Langdon, W. B., McPhee, N. F. (2008). A Field Guide to Genetic Programming. Lulu.com, freely available fromhttp://www.gp-field-guide.org.uk/. ISBN 978-1-4092-0073-4.
- ^ Logic:
- ^ Satplan:
- ^ Explanation based learning, relevance based learning, inductive logic programming, case based reasoning:
- ^ Propositional logic:
- ^ First-order logic and features such as equality:
- ^ Fuzzy logic:
- Russell & Norvig 2003, pp. 526–527
- ^ Subjective logic:
- CITATION IN PROGRESS.
- ^ Stochastic methods for uncertain reasoning:
- ^ Bayesian networks:
- ^ Bayesian inference algorithm:
- ^ Bayesian learning and the expectation-maximization algorithm:
- ^ Bayesian decision theory and Bayesian decision networks:
- Russell & Norvig 2003, pp. 597–600
- ^ a b c Stochastic temporal models:
- Russell & Norvig 2003, pp. 537–581
- Russell & Norvig 2003, pp. 551–557
- (Russell & Norvig 2003, pp. 549–551)
- Russell & Norvig 2003, pp. 551–557
- ^ decision theory and decision analysis:
- ^ Markov decision processes and dynamic decision networks:
- Russell & Norvig 2003, pp. 613–631
- ^ Game theory and mechanism design:
- Russell & Norvig 2003, pp. 631–643
- ^ Statistical learning methods and classifiers:
- ^ a b Neural networks and connectionism:
- ^ kernel methods such as the support vector machine, Kernel methods:
- Russell & Norvig 2003, pp. 749–752
- ^ K-nearest neighbor algorithm:
- Russell & Norvig 2003, pp. 733–736
- ^ Gaussian mixture model:
- Russell & Norvig 2003, pp. 725–727
- ^ Naive Bayes classifier:
- Russell & Norvig 2003, pp. 718
- ^ Decision tree:
- ^ Classifier performance:
- ^ Backpropagation:
- ^ Feedforward neural networks, perceptrons and radial basis networks:
- ^ Recurrent neural networks, Hopfield nets:
- ^ Competitive learning, Hebbian coincidence learning, Hopfield networks and attractor networks:
- Luger & Stubblefield 2004, pp. 474–505
- ^ Hierarchical temporal memory:
- ^ Control theory:
- ^ Lisp:
- ^ Prolog:
- ^ a b The Turing test:
Turing’s original publication:
Historical influence and philosophical implications:
- ^ Subject matter expert Turing test:
- CITATION IN PROGRESS.
- ^ Rajani, Sandeep (2011). “Artificial Intelligence – Man or Machine”. International Journal of Information Technology and Knowlede Management 4 (1): 173–176. Retrieved 24 September 2012.
- ^ Game AI:
- CITATION IN PROGRESS.
- ^ Mathematical definitions of intelligence:
- Jose Hernandez-Orallo (2000). “Beyond the Turing Test”.Journal of Logic, Language and Information 9 (4): 447–466.doi:10.1023/A:1008367325700. CiteSeerX:10.1.1.44.8943.
- D L Dowe and A R Hajek (1997). “A computational extension to the Turing Test”. Proceedings of the 4th Conference of the Australasian Cognitive Science jSociety. Retrieved 21 July 2009.
- J Hernandez-Orallo and D L Dowe (2010). “Measuring Universal Intelligence: Towards an Anytime Intelligence Test”.Artificial Intelligence Journal 174 (18): 1508–1539.doi:10.1016/j.artint.2010.09.006.
- ^ “AI set to exceed human brain power”(web article). CNN. 26 July 2006. Archivedfrom the original on 19 February 2008. Retrieved 26 February 2008.
- ^ Brooks, R.A., “How to build complete creatures rather than isolated cognitive simulators,” in K. VanLehn (ed.), Architectures for Intelligence, pp. 225–239, Lawrence Erlbaum Associates, Hillsdale, NJ, 1991.
- ^ Hacking Roomba » Search Results » atmel
- ^ Philosophy of AI. All of these positions in this section are mentioned in standard discussions of the subject, such as:
- ^ Dartmouth proposal:
- ^ The physical symbol systems hypothesis:
- ^ Dreyfus criticized the necessary condition of the physical symbol system hypothesis, which he called the “psychological assumption”: “The mind can be viewed as a device operating on bits of information according to formal rules”. (Dreyfus 1992, p. 156)
- ^ Dreyfus’ critique of artificial intelligence:
- ^ This is a paraphrase of the relevant implication of Gödel’s theorems.
- ^ The Mathematical Objection:
Making the Mathematical Objection:
Refuting Mathematical Objection:
- Gödel 1931, Church 1936, Kleene 1935, Turing 1937
- ^ This version is from Searle (1999), and is also quoted in Dennett 1991, p. 435. Searle’s original formulation was “The appropriately programmed computer really is a mind, in the sense that computers given the right programs can be literally said to understand and have other cognitive states.” (Searle 1980, p. 1). Strong AI is defined similarly by Russell & Norvig (2003, p. 947): “The assertion that machines could possibly act intelligently (or, perhaps better, act as if they were intelligent) is called the ‘weak AI’ hypothesis by philosophers, and the assertion that machines that do so are actually thinking (as opposed to simulating thinking) is called the ‘strong AI’ hypothesis.”
- ^ Searle’s Chinese room argument:
- ^ Robot rights:
- Russell & Norvig 2003, p. 964
- “Robots could demand legal rights”. BBC News. 21 December 2006. Retrieved 3 February 2011.
- Henderson, Mark (24 April 2007). “Human rights for robots? We’re getting carried away”. The Times Online (London).
- ^ Independent documentary Plug & Pray, featuring Joseph Weizenbaum and Raymond Kurzweil
- ^ Ford, Martin R. (2009), The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future, Acculant Publishing,ISBN 978-1448659814. (e-book available free online.)
- ^ “Machine Learning: A Job Killer?”
- ^ AI could decrease the demand for human labor:
- Russell & Norvig 2003, pp. 960–961
- Ford, Martin (2009). The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future. Acculant Publishing.ISBN 978-1-4486-5981-4.
- ^ In the early 70s, Kenneth Colby presented a version of Weizenbaum’s ELIZA known as DOCTOR which he promoted as a serious therapeutic tool. (Crevier 1993, pp. 132–144)
- ^ Joseph Weizenbaum‘s critique of AI:
- ^ Technological singularity:
- ^ Transhumanism:
- ^ Rubin, Charles (Spring 2003). “Artificial Intelligence and Human Nature”. The New Atlantis 1: 88–100.
- ^ AI as evolution:
- Edward Fredkin is quoted in McCorduck (2004, p. 401).
- Butler, Samuel (13 June 1863). the Press (Christchurch, New Zealand). http://www.nzetc.org/tm/scholarly/tei-ButFir-t1-g1-t1-g1-t4-body.html, Letter to the Editor.
- Dyson, George (1998). Darwin among the Machiens. Allan Lane Science. ISBN 0-7382-0030-1.
- Luger, George; Stubblefield, William (2004). Artificial Intelligence: Structures and Strategies for Complex Problem Solving(5th ed.). The Benjamin/Cummings Publishing Company, Inc.. ISBN 0-8053-4780-1.
- Neapolitan, Richard; Jiang, Xia (2012). Contemporary Artificial Intelligence. Chapman & Hall/CRC. ISBN 978-143984-469-4.
- Nilsson, Nils (1998). Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publishers. ISBN 978-1-55860-467-4.
- Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach(2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, ISBN 0-13-790395-2
- Poole, David; Mackworth, Alan; Goebel, Randy (1998). Computational Intelligence: A Logical Approach. New York: Oxford University Press. ISBN 0-19-510270-3.
- Winston, Patrick Henry (1984). Artificial Intelligence. Reading, Massachusetts: Addison-Wesley. ISBN 0-201-08259-4.
History of AI
- Crevier, Daniel (1993), AI: The Tumultuous Search for Artificial Intelligence, New York, NY: BasicBooks, ISBN 0-465-02997-3
- McCorduck, Pamela (2004), Machines Who Think(2nd ed.), Natick, MA: A. K. Peters, Ltd., ISBN 1-56881-205-1
- Nilsson, Nils (2010), The Quest for Artificial Intelligence: A History of Ideas and Achievements, New York, ISBN 978-0-521-12293-1
- “ACM Computing Classification System: Artificial intelligence”. ACM. 1998. Retrieved 30 August 2007.
- Aleksander, Igor (1995). Artificial Neuroconsciousness: An Update. IWANN. Archived from the originalon 2 March 1997. BibTexInternet Archive
- Brooks, Rodney (1990). “Elephants Don’t Play Chess”(PDF). Robotics and Autonomous Systems 6: 3–15. doi:10.1016/S0921-8890(05)80025-9. Archivedfrom the original on 9 August 2007. Retrieved 30 August 2007..
- Buchanan, Bruce G. (2005). “A (Very) Brief History of Artificial Intelligence”(PDF). AI Magazine: 53–60. Archivedfrom the original on 26 September 2007. Retrieved 30 August 2007.
- Dennett, Daniel (1991). Consciousness Explained. The Penguin Press. ISBN 0-7139-9037-6.
- Dreyfus, Hubert (1972). What Computers Can’t Do. New York: MIT Press. ISBN 0-06-011082-1.
- Dreyfus, Hubert (1979). What Computers Still Can’t Do. New York: MIT Press. ISBN 0-262-04134-0.
- Dreyfus, Hubert; Dreyfus, Stuart (1986). Mind over Machine: The Power of Human Intuition and Expertise in the Era of the Computer. Oxford, UK: Blackwell. ISBN 0-02-908060-6.
- Dreyfus, Hubert (1992). What Computers Still Can’t Do. New York: MIT Press. ISBN 0-262-54067-3.
- Edelman, Gerald (23 November 2007). “Gerald Edelman – Neural Darwinism and Brain-based Devices”. Talking Robots.
- Fearn, Nicholas (2007). The Latest Answers to the Oldest Questions: A Philosophical Adventure with the World’s Greatest Thinkers. New York: Grove Press. ISBN 0-8021-1839-9.
- Forster, Dion (2006). “Self validating consciousness in strong artificial intelligence: An African theological contribution”. Pretoria: University of South Africa.
- Gladwell, Malcolm (2005). Blink. New York: Little, Brown and Co.. ISBN 0-316-17232-4.
- Haugeland, John (1985). Artificial Intelligence: The Very Idea. Cambridge, Mass.: MIT Press. ISBN 0-262-08153-9.
- Hawkins, Jeff; Blakeslee, Sandra (2005). On Intelligence. New York, NY: Owl Books. ISBN 0-8050-7853-3.
- Hofstadter, Douglas (1979). Gödel, Escher, Bach: an Eternal Golden Braid. New York, NY: Vintage Books. ISBN 0-394-74502-7.
- Howe, J. (November 1994). “Artificial Intelligence at Edinburgh University: a Perspective”. Retrieved 30 August 2007..
- Kahneman, Daniel; Slovic, D.; Tversky, Amos (1982). Judgment under uncertainty: Heuristics and biases. New York: Cambridge University Press. ISBN 0-521-28414-7.
- Kolata, G. (1982). “How can computers get common sense?”. Science 217 (4566): 1237–1238. doi:10.1126/science.217.4566.1237. PMID 17837639.
- Kurzweil, Ray (1999). The Age of Spiritual Machines. Penguin Books. ISBN 0-670-88217-8.
- Kurzweil, Ray (2005). The Singularity is Near. Penguin Books. ISBN 0-670-03384-7.
- Lakoff, George (1987). Women, Fire, and Dangerous Things: What Categories Reveal About the Mind. University of Chicago Press. ISBN 0-226-46804-6.
- Lakoff, George; Núñez, Rafael E. (2000). Where Mathematics Comes From: How the Embodied Mind Brings Mathematics into Being. Basic Books. ISBN 0-465-03771-2..
- Lenat, Douglas; Guha, R. V. (1989). Building Large Knowledge-Based Systems. Addison-Wesley. ISBN 0-201-51752-3.
- Lighthill, Professor Sir James (1973). “Artificial Intelligence: A General Survey”. Artificial Intelligence: a paper symposium. Science Research Council.
- Lucas, John (1961). “Minds, Machines and Gödel”. In Anderson, A.R.. Minds and Machines. Archivedfrom the original on 19 August 2007. Retrieved 30 August 2007.
- Maker, Meg Houston (2006). “AI@50: AI Past, Present, Future”. Dartmouth College. Archivedfrom the original on 8 October 2008. Retrieved 16 October 2008.
- McCarthy, John; Minsky, Marvin; Rochester, Nathan; Shannon, Claude (1955). “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence”. Archivedfrom the original on 26 August 2007. Retrieved 30 August 2007..
- McCarthy, John; Hayes, P. J. (1969). “Some philosophical problems from the standpoint of artificial intelligence”. Machine Intelligence 4: 463–502. Archivedfrom the original on 10 August 2007. Retrieved 30 August 2007.
- McCarthy, John (12 November 2007). “What Is Artificial Intelligence?”.
- Minsky, Marvin (1967). Computation: Finite and Infinite Machines. Englewood Cliffs, N.J.: Prentice-Hall. ISBN 0-13-165449-7.
- Minsky, Marvin (2006). The Emotion Machine. New York, NY: Simon & Schusterl. ISBN 0-7432-7663-9.
- Moravec, Hans (1976). “The Role of Raw Power in Intelligence”. Retrieved 30 August 2007.
- Moravec, Hans (1988). Mind Children. Harvard University Press. ISBN 0-674-57616-0.
- NRC, (United States National Research Council) (1999). “Developments in Artificial Intelligence”. Funding a Revolution: Government Support for Computing Research. National Academy Press.
- Needham, Joseph (1986). Science and Civilization in China: Volume 2. Caves Books Ltd..
- Newell, Allen; Simon, H. A. (1963). “GPS: A Program that Simulates Human Thought”. In Feigenbaum, E.A.; Feldman, J.. Computers and Thought. New York: McGraw-Hill.
- Newell, Allen; Simon, H. A. (1976). “Computer Science as Empirical Inquiry: Symbols and Search”. Communications of the ACM. 19..
- Nilsson, Nils (1983), “Artificial Intelligence Prepares for 2001”, AI Magazine 1 (1), Presidential Address to the Association for the Advancement of Artificial Intelligence.
- Penrose, Roger (1989). The Emperor’s New Mind: Concerning Computer, Minds and The Laws of Physics. Oxford University Press. ISBN 0-19-851973-7.
- Searle, John (1980). “Minds, Brains and Programs”. Behavioral and Brain Sciences 3 (3): 417–457. doi:10.1017/S0140525X00005756.
- Searle, John (1999). Mind, language and society. New York, NY: Basic Books. ISBN 0-465-04521-9. OCLC 43689264 231867665 43689264.
- Serenko, Alexander; Detlor, Brian (2004). “Intelligent agents as innovations”. AI and Society 18 (4): 364–381. doi:10.1007/s00146-004-0310-5.
- Serenko, Alexander; Ruhi, Umar; Cocosila, Mihail (2007). “Unplanned effects of intelligent agents on Internet use: Social Informatics approach”. AI and Society 21 (1–2): 141–166. doi:10.1007/s00146-006-0051-8.
- Shapiro, Stuart C. (1992). “Artificial Intelligence”. In Shapiro, Stuart C.. Encyclopedia of Artificial Intelligence (2nd ed.). New York: John Wiley. pp. 54–57.ISBN 0-471-50306-1.
- Simon, H. A. (1965). The Shape of Automation for Men and Management. New York: Harper & Row.
- Skillings, Jonathan (3 July 2006). “Getting Machines to Think Like Us”. cnet. Retrieved 3 February 2011.
- Tecuci, Gheorghe (March/April 2012). “Artificial Intelligence”. Wiley Interdisciplinary Reviews: Computational Statistics (Wiley) 4 (2): 168–180. doi:10.1002/wics.200.
- Turing, Alan (October 1950), “Computing Machinery and Intelligence”, Mind LIX (236): 433–460, doi:10.1093/mind/LIX.236.433, ISSN 0026-4423, retrieved 2008-08-18.
- van der Walt, Christiaan; Bernard, Etienne (2006<!––year is presumed based on acknowledgements at the end of the article––>). “Data characteristics that determine classifier performance”(PDF). Retrieved 5 August 2009.
- Vinge, Vernor (1993). “The Coming Technological Singularity: How to Survive in the Post-Human Era”.
- Wason, P. C.; Shapiro, D. (1966). “Reasoning”. In Foss, B. M.. New horizons in psychology. Harmondsworth: Penguin.
- Weizenbaum, Joseph (1976). Computer Power and Human Reason. San Francisco: W.H. Freeman & Company. ISBN 0-7167-0464-1.
- TechCast Article Series, John Sagi, Framing Consciousness
- Boden, Margaret, Mind As Machine, Oxford University Press, 2006
- Johnston, John (2008) “The Allure of Machinic Life: Cybernetics, Artificial Life, and the New AI”, MIT Press
- Myers, Courtney Boyd ed. (2009). The AI Report. Forbes June 2009
- Serenko, Alexander (2010). “The development of an AI journal ranking based on the revealed preference approach”(PDF). Journal of Informetrics 4 (4): 447–459. doi:10.1016/j.joi.2010.04.001.
- Sun, R. & Bookman, L. (eds.), Computational Architectures: Integrating Neural and Symbolic Processes. Kluwer Academic Publishers, Needham, MA. 1994.
- What Is AI?— An introduction to artificial intelligence by AI founder John McCarthy.
- Logic and Artificial Intelligenceentry by Richmond Thomason in the Stanford Encyclopedia of Philosophy
- AIat the Open Directory Project
- AITopics— A large directory of links and other resources maintained by the Association for the Advancement of Artificial Intelligence, the leading organization of academic AI researchers.
- Artificial Intelligence Discussion group