Tag Archives: AI

[repost ]Discover and Discuss the World’s Open Source AI Software & Data

original:Discover and Discuss the World’s Open Source AI Software & Data – See more at: http://openair.allenai.org/#sthash.NFkPKbEL.dpuf

[repost ]张雷博士谈IBM沃森背后的AI技术

original:http://www.infoq.com/cn/articles/ibm-watson-ai

不久前,IBM超级计算机沃森(Watson)在美国电视智力答题节目《危险边缘(Jeopardy!)》中上演了人机大战,并最终击败两位人类冠军,赢得最后的胜利。沃森由IBM全球多个研究院和大学共同研发,历经四年研制而成。IBM中国研究院也参与了该项目的研发。InfoQ中文站有幸采访到来自IBM中国研究院直接参与了沃森项目的张雷博士。张雷博士是IBM中国研究院信息与知识管理部门研究员,在过去的三年中,他和他的研究团队与全球研究团队一起,致力于深度问答项目(DeepQA)的工作,研究并开发了沃森系统。在IBM期间他申请过多项专利并获得过IBM杰出技术成就奖。在学术领域,张雷博士研究兴趣广泛,涉及语义Web、知识表示与推理、信息抽取与检索、问题回答系统以及机器学习等,发表学术论文20余篇。他是WWW、IJCAI、ISWC等重要国际学术会议的程序委员会委员、第九届国际语义网大会(ISWC2010)的本地组织者之一,还是第一届中国语义万维网论坛(CSWS2007)的主要发起人之一。下面有请张博士为我们揭开沃森背后的技术奥秘。

多种AI技术的综合运用和强大的计算能力共同造就了沃森

InfoQ:张博士,您好!我想大家都已经通过《Jeopardy!》节目领教到沃森的威力了。从接受人类语言的提问,到用人类语言给出答案,让机器完成这一过程、还要保证相当的正确率,听上去有些不可思议。能否请您举例介绍一下沃森在答题时背后发生了什么?其技术原理是怎样的?

张雷:徐涵,您好!非常高兴能接受您的的采访。沃森在拿到问题后,会进行一系列的计算,包括语法语义分析、对各个知识库进行搜索、提取备选答案、对备选答案证据的搜寻、对证据强度的计算和综合等等。它综合运用了自然语言处理、知识表示与推理、机器学习等技术。我们知道,从单一的知识源或者少数的算法出发,很难让问题回答系统达到接近人类的水平。所以,沃森的主要技术原理是通过搜寻很多知识源,从多角度运用非常多的小算法,对各种可能的答案进行综合判断和学习。这就使得系统依赖少数知识源或少数算法的脆弱性得到了极大的降低,从而大大提高其性能。

InfoQ:14年前,“深蓝”凭借其大规模计算与枚举能力战胜了国际象棋世界冠军卡斯帕罗夫。如今沃森的成功,有多大程度是依赖于其强大的计算能力,多大程度是依赖人工智能理论本身的发展?

张雷:这两方面应该说都起到了很大的作用。人工智能领域的很多技术都应用在了该系统中,这是很明显的。另一方面,如果没有计算能力的进步,我们在提高计算速度方面可能也会遇到障碍。几年前,沃森在一台不错的服务器上回答一个问题需要2个多小时的时间。通过IBM Power 7强大的并行化计算能力,才把它压缩到了3秒内。另外,强大的计算能力其实也大大加速了开发进程。我们大量使用了Java语言和机器学习,而这些都是需要有较强的计算能力作为支撑的。

InfoQ:沃森与同样致力于问题回答的计算知识引擎Wolfram|Alpha,还有著名的人工智能项目CYC有何相同和不同之处?

张雷:据我所知,Wolfram|Alpha主要依赖于事先由人手工编辑的结构化数据作为知识源,而沃森则以现有的非结构化数据为主,适当辅以一些结构化数据。两者的计算方法也完全不同。我估计Wolfram|Alpha是以基于规则的匹配和计算为主,而沃森则是以统计推理为主。和CYC相比,沃森并不去构造基于形式逻辑的知识库,而是直接使用现有的用人类语言书写的知识,例如各种百科全书。CYC以形式化的逻辑推理作为基本的计算方法,而沃森是以统计推理为主。

InfoQ:沃森看上去像是一种决策系统。作为决策系统,它不仅要给出答案,还要提供相关依据。这在沃森系统中是如何做到的?另外,如被告知回答错误,沃森是否具备自我学习和完善的能力?

张雷:沃森系统的一个关键步骤是评价备选答案的可靠性。这个可靠性是由上百个算法从各种不同的角度评价得出的。例如:关键字匹配程度、时间关系的匹配程度、地理位置匹配的程度、类型匹配程度等等。沃森在每一个角度上都能得到量化的可靠性评价。而且这些评价算法所依赖的知识源也是可追溯的。所以,如果需要,沃森可以为用户提供答案的依据。

在沃森参赛之前,它会从历史数据中进行学习。比如,如果它回答错了一个往期节目上的问题,它会从中学习到一些信息。在参赛之时,它主要依赖以前学习的结果,但也进行一些简单的在线学习。例如,它可以从已经被其它选手回答的同一类型问题中归纳出一些特点,指导其回答这类问题。另外,答错题目也会导致沃森调整其游戏策略。因此可以说,沃森具备了初步的自我学习和完善的能力。

语义网技术在沃森中的应用

InfoQ:在沃森系统中,从各处采集而来的海量知识是如何表达和管理的?知识出现不一致时将如何处理?

张雷:非结构化知识主要就是以其原始的文本来表示的,而结构化知识则使用了诸如RDF这样的表示和管理方法。知识出现不一致时,沃森通过对大量往期题目的学习来发现哪些是在该游戏中更值得依赖的知识,而哪些在该游戏场景中是不可靠的。

InfoQ:据介绍,沃森采用了包括RDF/Linked Data在内的Semantic Web技术。沃森为什么会选中这一技术?RDF及Linking Open Data的思想在沃森系统中是如何发挥作用的?

张雷:Linked Data是非常重要的结构化的知识源。我们在研发沃森的初始阶段就考察了该如何利用这一重要的知识源。我们尝试了使用Linked Data,特别是DBpediaIMDb等,来直接回答比赛中的问题。我们也利用Linked Data来帮助沃森对答案的类型进行判断。不仅如此,沃森在很多其它场合也借鉴了一些RDF及Linked Data的思想。例如,有些从文本中挖掘得到的知识是使用三元组形式表示的;当一个字符串代表的对象有歧义时,使用URI来代表不同的对象;利用RDF三元组中的谓词作为语义提示等等。

InfoQ:沃森是否还应用了其他Semantic Web技术,如本体(ontology)、逻辑推理等?沃森强大的运算能力是否足以应付时间复杂度很高的推理需求?

张雷:沃森应用了本体来帮助其对答案类型进行判断。例如,判断哈利波特是不是文学作品。DBpedia可能会告诉我们哈利波特是小说,而本体会告诉我们小说是一种文学作品。在沃森系统中,我们应用了简单的基于本体的逻辑推理,例如上下位关系、不相交关系(disjointness)等。这些简单的逻辑推理可以用简单快速的方法实现。

InfoQ:在本体工程实践中,沃森是如何将DBPedia、YAGO、Wordnet等不同的本体结合起来的?涉及到本体映射与转换吗?

张雷:很多本体是分开来单独使用的。因为我们需要沃森能够学习得到这些本体在这个问答环境中的不同的可靠程度。对于YAGO和Wordnet来说,它们有着天然的对应关系。我们也几乎没有使用其它本体映射和转换。

InfoQ:请否请您简单介绍一下严格语义与浅层语义,以及沃森是如何平衡运用这两种技术的?

张雷:我不知道”严格语义”与“浅层语义”是否有精确的定义。我的理解是:“严格语义”是和符号化的、形式化的逻辑系统相关联的,通常是精确无二义的。每个符号的语义都在逻辑系统中由其它符号精确的解释和定义。“浅层语义”则是和自然语言或者常识相关联的,具有一定的模糊性。两者有不同的适用性。沃森通过使用机器学习的方法,来学习得到该在什么场合应用什么方法。

沃森代表自然语言处理和人工智能技术的突破

InfoQ:随着沃森在《Jeopardy!》中击败人类选手,人工智能再次成为热点话题:有人称沃森是人工智能发展的一个里程碑;有人认为人工智能的前途一片光明;还有一些人则对机器越来越聪明表示担心。相对于人工智能历史上的屡次失败,沃森无疑是一次成功的人工智能实践。作为沃森项目的参与者之一,能否请您谈谈沃森在人工智能实践上的成功经验?人工智能前景如何?还有,您认为对人工智能的担忧有必要吗?

张雷:对于人工智能实践来说,沃森的经验表明依靠单一或少数算法是很难成功的。而依靠大量的各种小算法的集成更容易取得进展。这似乎和生物界的多样性有着相似性。另外,沃森也说明,人工智能技术已经取得了相当大的进展,通过大规模的集成这些技术,很多我们看似很难的问题已经从“不可能解决”变为“可能可以解决”。例如,沃森表明,以前人工智能中的知识获取的瓶颈(knowledge acquisition bottleneck)似乎变成了一个可能可以解决的问题。

对人工智能的担忧在现阶段是没有必要的。我们还没有看到机器具有自我意识。所有的功能都是由人控制和提供的。在现阶段,人工智能技术,包括沃森,是用来帮助人的,而不是取代人的。

InfoQ:沃森除了用于《Jeopardy!》,还可以用于哪些领域?针对其他领域,需要做出哪些额外的努力?有没有什么是沃森不能做的?

张雷:沃森代表的是自然语言处理和人工智能技术的突破,可以应用于很多领域,例如医疗、金融、电信、政府服务等。例如,在医疗领域,医疗记录、文本、杂志和研究资料都以自然语言编写——这是一种传统计算机难以理解的语言。一个可以立即从这些文件中找出准确答案的系统能够给医疗行业带来巨大的改变。IBM最近宣布与Nuance通信公司签署协议,在医疗行业探索、开发沃森计算系统的先进分析能力,并实现其商业化。当然,为了让沃森真正服务于这些领域,可能还需要准备相应的专业知识库等额外的努力。沃森不是万能的,对于具有很大主观性或依赖个人生活经验的问题,沃森现在是不擅长回答的。

沃森的未来

InfoQ:沃森未来是否有2.0版本?如果有,下一步会做什么?

张雷:沃森的下一步开发计划主要是将沃森应用在实际生活的各个领域中,例如前面提到的医疗、金融、电信、政府服务等。

InfoQ:像沃森这样的巨型系统,并非一般企业所能拥有,但它提供的服务又是广泛需要的。在这种情况之下,沃森今后会不会考虑提供云服务?

张雷:沃森确实是一个庞大的系统。但具体来说,也就是运行在不到100台的IBM Power7服务器上。因此,它也并不是可望而不可及的。很多企业和机构已经拥有远不止100台服务器。当然,要让沃森服务越来越多数量的问答请求,需要的机器数量会上升。因此,我们也不排除通过云服务的方式来提供沃森。

InfoQ:最后,您能介绍一下IBM中国研究院在沃森项目中的主要工作及成果吗?

张雷:IBM中国研究院在研发沃森系统的过程中,发挥了重要的作用。我们为沃森系统采集、分析和使用各种结构化的知识,利用结构化和高可靠的知识提供问题解答,排除让系统显得“愚蠢”的答案,以及帮助沃森系统提高其学习能力。来自IBM中国研究院的很多技术成果已经融入在沃森系统中,而有的研究成果则为整个科研团队提供借鉴和参考。

[repost ]The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI

original:http://www.wired.com/wiredenterprise/2013/05/neuro-artificial-intelligence/all/

Stanford professor Andrew Ng, the man at the center of the Deep Learning movement. Photo: Ariel Zambelich/Wired

 

There’s a theory that human intelligence stems from a single algorithm.

The idea arises from experiments suggesting that the portion of your brain dedicated to processing sound from your ears could also handle sight for your eyes. This is possible only while your brain is in the earliest stages of development, but it implies that the brain is — at its core — a general-purpose machine that can be tuned to specific tasks.

About seven years ago, Stanford computer science professor Andrew Ng stumbled across this theory, and it changed the course of his career, reigniting a passion for artificial intelligence, or AI. “For the first time in my life,” Ng says, “it made me feel like it might be possible to make some progress on a small part of the AI dream within our lifetime.”

‘For the first time in my life, it made me feel like it might be possible to make some progress on a small part of the AI dream within our lifetime.’

— Andrew Ng

In the early days of artificial intelligence, Ng says, the prevailing opinion was that human intelligence derived from thousands of simple agents working in concert, what MIT’s Marvin Minsky called “The Society of Mind.” To achieve AI, engineers believed, they would have to build and combine thousands of individual computing modules. One agent, or algorithm, would mimic language. Another would handle speech. And so on. It seemed an insurmountable feat.

When he was a kid, Andrew Ng dreamed of building machines that could think like people, but when he got to college and came face-to-face with the AI research of the day, he gave up. Later, as a professor, he would actively discourage his students from pursuing the same dream. But then he ran into the “one algorithm” hypothesis, popularized by Jeff Hawkins, an AI entrepreneur who’d dabbled in neuroscience research. And the dream returned.

It was a shift that would change much more than Ng’s career. Ng now leads a new field of computer science research known as Deep Learning, which seeks to build machines that can process data in much the same way the brain does, and this movement has extended well beyond academia, into big-name corporations like Google and Apple. In tandem with other researchers at Google, Ng is building one of the most ambitious artificial-intelligence systems to date, the so-called Google Brain.

This movement seeks to meld computer science with neuroscience — something that never quite happened in the world of artificial intelligence. “I’ve seen a surprisingly large gulf between the engineers and the scientists,” Ng says. Engineers wanted to build AI systems that just worked, he says, but scientists were still struggling to understand the intricacies of the brain. For a long time, neuroscience just didn’t have the information needed to help improve the intelligent machines engineers wanted to build.

What’s more, scientists often felt they “owned” the brain, so there was little collaboration with researchers in other fields, says Bruno Olshausen, a computational neuroscientist and the director of the Redwood Center for Theoretical Neuroscience at the University of California, Berkeley.

The end result is that engineers started building AI systems that didn’t necessarily mimic the way the brain operated. They focused on building pseudo-smart systems that turned out to be more like a Roomba vacuum cleaner than Rosie the robot maid from the Jetsons.

But, now, thanks to Ng and others, this is starting to change. “There is a sense from many places that whoever figures out how the brain computes will come up with the next generation of computers,” says Dr. Thomas Insel, the director of the National Institute of Mental Health.

What Is Deep Learning?

Deep Learning is a first step in this new direction. Basically, it involves building neural networks — networks that mimic the behavior of the human brain. Much like the brain, these multi-layered computer networks can gather information and react to it. They can build up an understanding of what objects look or sound like.

With Deep Learning, Ng says, you just give the system a lot of data ‘so it can discover by itself what some of the concepts in the world are.’

In an effort to recreate human vision, for example, you might build a basic layer of artificial neurons that can detect simple things like the edges of a particular shape. The next layer could then piece together these edges to identify the larger shape, and then the shapes could be strung together to understand an object. The key here is that the software does all this on its own — a big advantage over older AI models, which required engineers to massage the visual or auditory data so that it could be digested by the machine-learning algorithm.

With Deep Learning, Ng says, you just give the system a lot of data “so it can discover by itself what some of the concepts in the world are.” Last year, one of his algorithms taught itself torecognize cats after scanning millions of images on the internet. The algorithm didn’t know the word “cat” — Ng had to supply that — but over time, it learned to identify the furry creatures we know as cats, all on its own.

This approach is inspired by how scientists believe that humans learn. As babies, we watch our environments and start to understand the structure of objects we encounter, but until a parent tells us what it is, we can’t put a name to it.

No, Ng’s deep learning algorithms aren’t yet as accurate — or as versatile — as the human brain. But he says this will come.

 

Andrew Ng’s laptop explains Deep Learning. Photo: Ariel Zambelich/Wired

From Google to China to Obama

Andrew Ng is just part of a larger movement. In 2011, he launched the Deep Learning project at Google, and in recents months, the search giant has significantly expanded this effort, acquiring the artificial intelligence outfit founded by University of Toronto professor Geoffrey Hinton, widely known asthe godfather of neural networks. Chinese search giant Baidu has opened its own research lab dedicated to deep learning, vowing to invest heavy resources in this area. And according to Ng, big tech companies like Microsoft and Qualcomm are looking to hire more computer scientists with expertise in neuroscience-inspired algorithms.

Meanwhile, engineers in Japan are building artificial neural nets to  control robots. And together with scientists from the European Union and Israel, neuroscientist Henry Markman is hoping to recreate ahuman brain inside a supercomputer, using data from thousands of real experiments.

‘Biology is hiding secrets well. We just don’t have the right tools to grasp the complexity of what’s going on.’

— Bruno Olshausen

The rub is that we still don’t completely understand how the brain works, but scientists are pushing forward in this as well. The Chinese are working on what they call the Brainnetdome, described as a new atlas of the brain, and in the U.S., the Era of Big Neuroscience is unfolding with ambitious, multidisciplinary projects like President Obama’s newly announced (and much criticized) Brain Research Through Advancing Innovative Neurotechnologies Initiative — BRAIN for short.

The BRAIN planning committee had its first meeting this past Sunday, with more meetings scheduled for this week. One its goals is the development of novel technologies that can map the brain’s myriad circuits, and there are hints that the project will also focus on artificial intelligence. Half of the $100 million in federal funding allotted to this program will come from Darpa — more than the amount coming from the National Institutes of Health — and the Defense Department’s research arm hopes the project will “inspire new information processing architectures or new computing approaches.”

If we map how out how thousands of neurons are interconnected and “how information is stored and processed in neural networks,” engineers like Ng and Olshausen will have better idea of what their artificial brains should look like. The data could ultimately feed and improve Deep Learning algorithms underlying technologies like computer vision, language analysis, and the voice recognition tools offered on smartphones from the likes of Apple and Google.

“That’s where we’re going to start to learn about the tricks that biology uses. I think the key is that biology is hiding secrets well,” says Berkeley computational neuroscientist aid Olshausen. “We just don’t have the right tools to grasp the complexity of what’s going on.”

What the World Wants

With the rise of mobile devices, cracking the neural code is more important than ever. As gadgets get smaller and smaller, we’ll need new ways of making them faster and more accurate. As you shrink transistors — the fundamental build blocks for our machines — the more difficult it becomes to make them accurate and efficient. If you make them faster, for instance, that means it needs more current, and more current makes the system more noisy — i.e. less precise.

‘If we could figure out how biology naturally deals with noisy computing elements, it would lead to a completely different model of computation.’

— Bruno Olshausen

Right now, engineers design around these issues, says Olshausen, so they skimp on speed, size, or energy efficiency to make their systems work. But AI may provide a better answer. “Instead of dodging the problem, what I think biology could tell us is just how to deal with it….The switches that biology is using are also inherently noisy, but biology has found a good way to adapt and live with that noise and exploit it,” Olshausen says. “If we could figure out how biology naturally deals with noisy computing elements, it would lead to a completely different model of computation.”

But scientists aren’t just aiming for smaller. They’re trying to build machines that do things computer have never done before. No matter how sophisticated algorithms are, today’s machines can’t fetch your groceries or pick out a purse or a dress you might like. That requires a more advanced breed of image intelligence and an ability to store and recall pertinent information in a way that’s reminiscent of human attention and memory. If you can do that, the possibilities are almost endless.

“Everybody recognizes that if you could solve these problems, it’s going to open up a vast, vast potential of commercial value,” Olshausen predicts.

That financial promise is why tech giants like Google, IBM, Microsoft, Apple, Chinese search giant Baidu and others are in an arms race to develop the best machine learning technologies. NYU’s Yann LeCun, an expert in the field, expects that in the next two years, we’ll see surge in Deep Learning startups, and many will be snatched up by larger outfits.

But even the best engineers aren’t brain experts, so having more neuro-knowledge handy is important. “We need to really work more closely with neuroscientists,” says Baidu’s Yu, who is toying with the idea of hiring one. “We are already doing that, but we need to do more.”

Ng’s dream is on the way to reality. “It gives me hope –- no, more than hope –- that we might be able to do this,” he says. “We clearly don’t have the right algorithms yet. It’s going to take decades. This is not going to be an easy one, but I think there’s hope.”

[repost ]Google Top Charts uses the Knowledge Graph for entity recognition and disambiguation

original:http://ebiquity.umbc.edu/blogger/2013/05/23/googles-top-charts-uses-the-knowledge-graph-for-entity-recognition/

Top Charts is a new feature for Google Trends that identifies the popular searches within a category, i.e., books or actors. What’s interesting about it, from a technology standpoint, is that it uses Google’s Knowledge Graph to provide a universe of things and the categories into which they belong. This is a great example of “Things, not strings”, Google’s clever slogan to explain the importance of the Knowledge Graph.

Here’s how it’s explained in in the Trends Top Charts FAQ.

“Top Charts relies on technology from the Knowledge Graph to identify when search queries seem to be about particular real-world people, places and things. The Knowledge Graph enables our technology to connect searches with real-world entities and their attributes. For example, if you search for ice ice baby, you’re probably searching for information about the musician Vanilla Ice or his music. Whereas if you search for vanilla ice cream recipe, you’re probably looking for information about the tasty dessert. Top Charts builds on work we’ve done so our systems do a better job finding you the information you’re actually looking for, whether tasty desserts or musicians.”

One thing to note is that the Knowledge Graph, which is said to have more than 18 billion facts about 570 million objects, is that its objects include more than the traditional named entities (e.g., people, places, things). For example, there is a top chart for Animals that shows that dogs are the most popular animal in Google searches followed by cats (no surprises here) with chickens at number three on the list (could their high rank be due to recipe searches?). Thedog object, in most knowledge representation schemes, would be modeled as a concept or class as opposed to an object or instance. In some representation systems, the same term (e.g., dog) can be used to refer to both a class of instances (a class that includes Lassie) and also to an instance (e.g., an instance of the class animal types). Which sense of the term dog is meant (class vs. instance) is determined by the context. In the semantic web representation language OWL 2, the ability to use the same term to refer to a class or a related instance is called punning.

Of course, when doing this kind of mapping of terms to objects, we only want to consider concepts that commonly have words or short phrases used to denote them. Not all concepts do, such as animals that from a long way off look like flies.

A second observation is that once you have a nice knowledge base like the Knowledge Graph, you have a new problem: how can you recognize mentions of its instances in text. In the DBpedia knowledge based (derived from Wikipedia) there are nine individuals named Michael Jordan and two of them were professional basketball players in the NBA. So, when you enter a search query like “When did Michael Jordan play for Penn”, we have to use information in the query, its context and what we know about the possible referents (e.g., those nine Michael Jordans) to decide (1) if this is likely to be a reference to any of the objects in our knowledge base, and (2) if so, to which one. This task, which is a fundamental one in language processing, is not trivial, but luckily, in applications like Top Charts, we don’t have to do it with perfect accuracy.

Google’s Top Charts is a simple, but effective, example that demonstrates the potential usefulness of semantic technology to make our information systems better in the near future.

[repost ]THE AI BEHIND WATSON — THE TECHNICAL ARTICLE

original:http://www.aaai.org/Magazine/Watson/watson.php

The 2010 Fall Issue of AI Magazine includes an article on “Building Watson: An Overview of the DeepQA Project,” written by the IBM Watson Research Team, led by David Ferucci. Read about this exciting project in the most detailed technical article available. We hope you will also take a moment to read through the archives of AI Magazine, and consider joining us at AAAI. To join, please read more athttp://www.aaai.org/Membership/membership.php. The most recent online volume of AI Magazine is usually only available to members of the association. However, we have made an exception for this special article on Watson to share the excitement. Congratulations to the IBM Watson Team!

Building Watson: An Overview of the DeepQA Project

Published in AI Magazine Fall, 2010. Copyright ©2010 AAAI. All rights reserved.

Written by David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty

Abstract

IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense research and development by a core team of about 20 researchers, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA).

The goals of IBM Research are to advance computer science by exploring new ways for computer technology to affect science, business, and society. Roughly three years ago, IBM Research was looking for a major research challenge to rival the scientific and popular interest of Deep Blue, the computer chess-playing champion (Hsu 2002), that also would have clear relevance to IBM business interests.

With a wealth of enterprise-critical information being captured in natural language documentation of all forms, the problems with perusing only the top 10 or 20 most popular documents containing the user’s two or three key words are becoming increasingly apparent. This is especially the case in the enterprise where popularity is not as important an indicator of relevance and where recall can be as critical as precision. There is growing interest to have enterprise computer systems deeply analyze the breadth of relevant content to more precisely answer and justify answers to user’s natural language questions. We believe advances in question-answering (QA) technology can help support professionals in critical and timely decision making in areas like compliance, health care, business integrity, business intelligence, knowledge discovery, enterprise knowledge management, security, and customer support. For researchers, the open-domain QA problem is attractive as it is one of the most challenging in the realm of computer science and artificial intelligence, requiring a synthesis of information retrieval, natural language processing, knowledge representation and reasoning, machine learning, and computer-human interfaces. It has had a long history (Simmons 1970) and saw rapid advancement spurred by system building, experimentation, and government funding in the past decade (Maybury 2004, Strzalkowski and Harabagiu 2006).

With QA in mind, we settled on a challenge to build a computer system, called Watson,1 which could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise.

Jeopardy! is a well-known TV quiz show that has been airing on television in the United States for more than 25 years (see the Jeopardy! Quiz Show sidebar for more information on the show). It pits three human contestants against one another in a competition that requires answering rich natural language questions over a very broad domain of topics, with penalties for wrong answers. The nature of the three-person competition is such that confidence, precision, and answering speed are of critical importance, with roughly 3 seconds to answer each question. A computer system that could compete at human champion levels at this game would need to produce exact answers to often complex natural language questions with high precision and speed and have a reliable confidence in its answers, such that it could answer roughly 70 percent of the questions asked with greater than 80 percent precision in 3 seconds or less.

Finally, the Jeopardy Challenge represents a unique and compelling AI question similar to the one underlying DeepBlue (Hsu 2002) — can a computer system be designed to compete against the best humans at a task thought to require high levels of human intelligence, and if so, what kind of technology, algorithms, and engineering is required? While we believe the Jeopardy Challenge is an extraordinarily demanding task that will greatly advance the field, we appreciate that this challenge alone does not address all aspects of QA and does not by any means close the book on the QA challenge the way that Deep Blue may have for playing chess.

The Jeopardy Challenge

Meeting the Jeopardy Challenge requires advancing and incorporating a variety of QA technologies including parsing, question classification, question decomposition, automatic source acquisition and evaluation, entity and relation detection, logical form generation, and knowledge representation and reasoning.

Winning at Jeopardy requires accurately computing confidence in your answers. The questions and content are ambiguous and noisy and none of the individual algorithms are perfect. Therefore, each component must produce a confidence in its output, and individual component confidences must be combined to compute the overall confidence of the final answer. The final confidence is used to determine whether the computer system should risk choosing to answer at all. In Jeopardy parlance, this confidence is used to determine whether the computer will “ring in” or “buzz in” for a question. The confidence must be computed during the time the question is read and before the opportunity to buzz in. This is roughly between 1 and 6 seconds with an average around 3 seconds.

Confidence estimation was very critical to shaping our overall approach in DeepQA. There is no expectation that any component in the system does a perfect job — all components post features of the computation and associated confidences, and we use a hierarchical machine-learning method to combine all these features and decide whether or not there is enough confidence in the final answer to attempt to buzz in and risk getting the question wrong.

In this section we elaborate on the various aspects of the Jeopardy Challenge.

The Categories

A 30-clue Jeopardy board is organized into six columns. Each column contains five clues and is associated with a category. Categories range from broad subject headings like “history,” “science,” or “politics” to less informative puns like “tutu much,” in which the clues are about ballet, to actual parts of the clue, like “who appointed me to the Supreme Court?” where the clue is the name of a judge, to “anything goes” categories like “potpourri.” Clearly some categories are essential to understanding the clue, some are helpful but not necessary, and some may be useless, if not misleading, for a computer.

A recurring theme in our approach is the requirement to try many alternate hypotheses in varying contexts to see which produces the most confident answers given a broad range of loosely coupled scoring algorithms. Leveraging category information is another clear area requiring this approach.

The Questions

There are a wide variety of ways one can attempt to characterize the Jeopardy clues. For example, by topic, by difficulty, by grammatical construction, by answer type, and so on. A type of classification that turned out to be useful for us was based on the primary method deployed to solve the clue. The bulk of Jeopardy clues represent what we would consider factoid questions — questions whose answers are based on factual information about one or more individual entities. The questions themselves present challenges in determining what exactly is being asked for and which elements of the clue are relevant in determining the answer. Here are just a few examples (note that while the Jeopardy! game requires that answers are delivered in the form of a question (see the Jeopardy!Quiz Show sidebar), this transformation is trivial and for purposes of this paper we will just show the answers themselves):

Category: General Science
Clue: When hit by electrons, a phosphor gives off electromagnetic energy in this form.
Answer: Light (or Photons)

Category: Lincoln Blogs
Clue: Secretary Chase just submitted this to me for the third time; guess what, pal. This time I’m accepting it.
Answer: his resignation

Category: Head North
Clue: They’re the two states you could be reentering if you’re crossing Florida’s northern border.
Answer: Georgia and Alabama

Decomposition.

Some more complex clues contain multiple facts about the answer, all of which are required to arrive at the correct response but are unlikely to occur together in one place. For example:

Category: “Rap” Sheet
Clue: This archaic term for a mischievous or annoying child can also mean a rogue or scamp.
Subclue 1: This archaic term for a mischievous or annoying child.
Subclue 2: This term can also mean a rogue or scamp.
Answer: Rapscallion

In this case, we would not expect to find both “subclues” in one sentence in our sources; rather, if we decompose the question into these two parts and ask for answers to each one, we may find that the answer common to both questions is the answer to the original clue.

Another class of decomposable questions is one in which a subclue is nested in the outer clue, and the subclue can be replaced with its answer to form a new question that can more easily be answered. For example:

Category: Diplomatic Relations
Clue: Of the four countries in the world that the United States does not have diplomatic relations with, the one that’s farthest north.
Inner subclue: The four countries in the world that the United States does not have diplomatic relations with (Bhutan, Cuba, Iran, North Korea).
Outer subclue: Of Bhutan, Cuba, Iran, and North Korea, the one that’s farthest north.
Answer: North Korea

Decomposable Jeopardy clues generated requirements that drove the design of DeepQA to generate zero or more decomposition hypotheses for each question as possible interpretations.

Puzzles.

Jeopardy also has categories of questions that require special processing defined by the category itself. Some of them recur often enough that contestants know what they mean without instruction; for others, part of the task is to figure out what the puzzle is as the clues and answers are revealed (categories requiring explanation by the host are not part of the challenge). Examples of well-known puzzle categories are the Before and After category, where two subclues have answers that overlap by (typically) one word, and the Rhyme Time category, where the two subclue answers must rhyme with one another. Clearly these cases also require question decomposition. For example:

Category: Before and After Goes to the Movies
Clue: Film of a typical day in the life of the Beatles, which includes running from bloodthirsty zombie fans in a Romero classic.
Subclue 2: Film of a typical day in the life of the Beatles.
Answer 1: (A Hard Day’s Night)
Subclue 2: Running from bloodthirsty zombie fans in a Romero classic.
Answer 2: (Night of the Living Dead)
Answer: A Hard Day’s Night of the Living Dead

Category: Rhyme Time
Clue: It’s where Pele stores his ball.
Subclue 1: Pele ball (soccer)
Subclue 2: where store (cabinet, drawer, locker, and so on)
Answer: soccer locker

There are many infrequent types of puzzle categories including things like converting roman numerals, solving math word problems, sounds like, finding which word in a set has the highest Scrabble score, homonyms and heteronyms, and so on. Puzzles constitute only about 2–3 percent of all clues, but since they typically occur as entire categories (five at a time) they cannot be ignored for success in the Challenge as getting them all wrong often means losing a game.

Excluded Question Types.

The Jeopardy quiz show ordinarily admits two kinds of questions that IBM and Jeopardy Productions, Inc., agreed to exclude from the computer contest: audiovisual (A/V) questions and Special Instructions questions. A/V questions require listening to or watching some sort of audio, image, or video segment to determine a correct answer. For example:

Category: Picture This
(Contestants are shown a picture of a B-52 bomber)
Clue: Alphanumeric name of the fearsome machine seen here.
Answer: B-52

Special instruction questions are those that are not “self-explanatory” but rather require a verbal explanation describing how the question should be interpreted and solved. For example:

Category: Decode the Postal Codes
Verbal instruction from host: We’re going to give you a word comprising two postal abbreviations; you have to identify the states.
Clue: Vain
Answer: Virginia and Indiana

Both present very interesting challenges from an AI perspective but were put out of scope for this contest and evaluation.

The Domain

As a measure of the Jeopardy Challenge’s breadth of domain, we analyzed a random sample of 20,000 questions extracting the lexical answer type (LAT) when present. We define a LAT to be a word in the clue that indicates the type of the answer, independent of assigning semantics to that word. For example in the following clue, the LAT is the string “maneuver.”

Category: Oooh….Chess
Clue: Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.
Answer: Castling

About 12 percent of the clues do not indicate an explicit lexical answer type but may refer to the answer with pronouns like “it,” “these,” or “this” or not refer to it at all. In these cases the type of answer must be inferred by the context. Here’s an example:

Category: Decorating
Clue: Though it sounds “harsh,” it’s just embroidery, often in a floral pattern, done with yarn on cotton cloth.
Answer: crewel

The distribution of LATs has a very long tail, as shown in figure 1. We found 2500 distinct and explicit LATs in the 20,000 question sample. The most frequent 200 explicit LATs cover less than 50 percent of the data. Figure 1 shows the relative frequency of the LATs. It labels all the clues with no explicit type with the label “NA.” This aspect of the challenge implies that while task-specific type systems or manually curated data would have some impact if focused on the head of the LAT curve, it still leaves more than half the problems unaccounted for. Our clear technical bias for both business and scientific motivations is to create general-purpose, reusable natural language processing (NLP) and knowledge representation and reasoning (KRR) technology that can exploit as-is natural language resources and as-is structured knowledge rather than to curate task-specific knowledge resources.

Figure 1

Figure 1. Lexical Answer Type Frequency.

The Metrics

In addition to question-answering precision, the system’s game-winning performance will depend on speed, confidence estimation, clue selection, and betting strategy. Ultimately the outcome of the public contest will be decided based on whether or not Watson can win one or two games against top-ranked humans in real time. The highest amount of money earned by the end of a one- or two-game match determines the winner. A player’s final earnings, however, often will not reflect how well the player did during the game at the QA task. This is because a player may decide to bet big on Daily Double or Final Jeopardy questions. There are three hidden Daily Double questions in a game that can affect only the player lucky enough to find them, and one Final Jeopardy question at the end that all players must gamble on. Daily Double and Final Jeopardy questions represent significant events where players may risk all their current earnings. While potentially compelling for a public contest, a small number of games does not represent statistically meaningful results for the system’s raw QA performance.

While Watson is equipped with betting strategies necessary for playing full Jeopardy, from a core QA perspective we want to measure correctness, confidence, and speed, without considering clue selection, luck of the draw, and betting strategies. We measure correctness and confidence using precision and percent answered. Precision measures the percentage of questions the system gets right out of those it chooses to answer. Percent answered is the percentage of questions it chooses to answer (correctly or incorrectly). The system chooses which questions to answer based on an estimated confidence score: for a given threshold, the system will answer all questions with confidence scores above that threshold. The threshold controls the trade-off between precision and percent answered, assuming reasonable confidence estimation. For higher thresholds the system will be more conservative, answering fewer questions with higher precision. For lower thresholds, it will be more aggressive, answering more questions with lower precision. Accuracy refers to the precision if all questions are answered.

Figure 2 shows a plot of precision versus percent attempted curves for two theoretical systems. It is obtained by evaluating the two systems over a range of confidence thresholds. Both systems have 40 percent accuracy, meaning they get 40 percent of all questions correct. They differ only in their confidence estimation. The upper line represents an ideal system with perfect confidence estimation. Such a system would identify exactly which questions it gets right and wrong and give higher confidence to those it got right. As can be seen in the graph, if such a system were to answer the 50 percent of questions it had highest confidence for, it would get 80 percent of those correct. We refer to this level of performance as 80 percent precision at 50 percent answered. The lower line represents a system without meaningful confidence estimation. Since it cannot distinguish between which questions it is more or less likely to get correct, its precision is constant for all percent attempted. Developing more accurate confidence estimation means a system can deliver far higher precision even with the same overall accuracy.

Figure 2

Figure 2. Precision Versus Percentage Attempted.
Perfect confidence estimation (upper line) and no confidence estimation (lower line).

The Competition: Human Champion Performance

A compelling and scientifically appealing aspect of the Jeopardy Challenge is the human reference point. Figure 3 contains a graph that illustrates expert human performance on Jeopardy It is based on our analysis of nearly 2000 historical Jeopardy games. Each point on the graph represents the performance of the winner in one Jeopardygame.2 As in figure 2, the x-axis of the graph, labeled “% Answered,” represents the percentage of questions the winner answered, and the y-axis of the graph, labeled “Precision,” represents the percentage of those questions the winner answered correctly.

Figure 3

Figure 3. Champion Human Performance at Jeopardy.

In contrast to the system evaluation shown in figure 2, which can display a curve over a range of confidence thresholds, the human performance shows only a single point per game based on the observed precision and percent answered the winner demonstrated in the game. A further distinction is that in these historical games the human contestants did not have the liberty to answer all questions they wished. Rather the percent answered consists of those questions for which the winner was confident and fast enough to beat the competition to the buzz. The system performance graphs shown in this paper are focused on evaluating QA performance, and so do not take into account competition for the buzz. Human performance helps to position our system’s performance, but obviously, in a Jeopardy game, performance will be affected by competition for the buzz and this will depend in large part on how quickly a player can compute an accurate confidence and how the player manages risk.

The center of what we call the “Winners Cloud” (the set of light gray dots in the graph in figures 3 and 4) reveals that Jeopardy champions are confident and fast enough to acquire on average between 40 percent and 50 percent of all the questions from their competitors and to perform with between 85 percent and 95 percent precision.

Figure 4

Figure 4. Baseline Performance.

The darker dots on the graph represent Ken Jennings’s games. Ken Jennings had an unequaled winning streak in 2004, in which he won 74 games in a row. Based on our analysis of those games, he acquired on average 62 percent of the questions and answered with 92 percent precision. Human performance at this task sets a very high bar for precision, confidence, speed, and breadth.

Baseline Performance

Our metrics and baselines are intended to give us confidence that new methods and algorithms are improving the system or to inform us when they are not so that we can adjust research priorities.

Our most obvious baseline is the QA system called Practical Intelligent Question Answering Technology (PIQUANT) (Prager, Chu-Carroll, and Czuba 2004), which had been under development at IBM Research by a four-person team for 6 years prior to taking on the Jeopardy Challenge. At the time it was among the top three to five Text Retrieval Conference (TREC) QA systems. Developed in part under the U.S. government AQUAINT program3 and in collaboration with external teams and universities, PIQUANT was a classic QA pipeline with state-of-the-art techniques aimed largely at the TREC QA evaluation (Voorhees and Dang 2005). PIQUANT performed in the 33 percent accuracy range in TREC evaluations. While the TREC QA evaluation allowed the use of the web, PIQUANT focused on question answering using local resources. A requirement of the Jeopardy Challenge is that the system be self-contained and does not link to live web search.

The requirements of the TREC QA evaluation were different than for the Jeopardy challenge. Most notably, TREC participants were given a relatively small corpus (1M documents) from which answers to questions must be justified; TREC questions were in a much simpler form compared to Jeopardy questions, and the confidences associated with answers were not a primary metric. Furthermore, the systems are allowed to access the web and had a week to produce results for 500 questions. The reader can find details in the TREC proceedings4 and numerous follow-on publications.

An initial 4-week effort was made to adapt PIQUANT to the Jeopardy Challenge. The experiment focused on precision and confidence. It ignored issues of answering speed and aspects of the game like betting and clue values.

The questions used were 500 randomly sampled Jeopardy clues from episodes in the past 15 years. The corpus that was used contained, but did not necessarily justify, answers to more than 90 percent of the questions. The result of the PIQUANT baseline experiment is illustrated in figure 4. As shown, on the 5 percent of the clues that PI­QUANT was most confident in (left end of the curve), it delivered 47 percent precision, and over all the clues in the set (right end of the curve), its precision was 13 percent. Clearly the precision and confidence estimation are far below the requirements of the Jeopardy Challenge.

A similar baseline experiment was performed in collaboration with Carnegie Mellon University (CMU) using OpenEphyra,5 an open-source QA framework developed primarily at CMU. The framework is based on the Ephyra system, which was designed for answering TREC questions. In our experiments on TREC 2002 data, OpenEphyra answered 45 percent of the questions correctly using a live web search.

We spent minimal effort adapting OpenEphyra, but like PIQUANT, its performance on Jeopardy clues was below 15 percent accuracy. OpenEphyra did not produce reliable confidence estimates and thus could not effectively choose to answer questions with higher confidence. Clearly a larger investment in tuning and adapting these baseline systems to Jeopardy would improve their performance; however, we limited this investment since we did not want the baseline systems to become significant efforts.

The PIQUANT and OpenEphyra baselines demonstrate the performance of state-of-the-art QA systems on theJeopardy task. In figure 5 we show two other baselines that demonstrate the performance of two complementary approaches on this task. The light gray line shows the performance of a system based purely on text search, using terms in the question as queries and search engine scores as confidences for candidate answers generated from retrieved document titles. The black line shows the performance of a system based on structured data, which attempts to look the answer up in a database by simply finding the named entities in the database related to the named entities in the clue. These two approaches were adapted to the Jeopardy task, including identifying and integrating relevant content.

Figure 5

Figure 5. Text Search Versus Knowledge Base Search.

The results form an interesting comparison. The search-based system has better performance at 100 percent answered, suggesting that the natural language content and the shallow text search techniques delivered better coverage. However, the flatness of the curve indicates the lack of accurate confidence estimation.6 The structured approach had better informed confidence when it was able to decipher the entities in the question and found the right matches in its structured knowledge bases, but its coverage quickly drops off when asked to answer more questions. To be a high-performing question-answering system, DeepQA must demonstrate both these properties to achieve high precision, high recall, and an accurate confidence estimation.

The DeepQA Approach

Early on in the project, attempts to adapt PIQUANT (Chu-Carroll et al. 2003) failed to produce promising results. We devoted many months of effort to encoding algorithms from the literature. Our investigations ran the gamut from deep logical form analysis to shallow machine-translation-based approaches. We integrated them into the standard QA pipeline that went from question analysis and answer type determination to search and then answer selection. It was difficult, however, to find examples of how published research results could be taken out of their original context and effectively replicated and integrated into different end-to-end systems to produce comparable results. Our efforts failed to have significant impact on Jeopardy or even on prior baseline studies using TREC data.

We ended up overhauling nearly everything we did, including our basic technical approach, the underlying architecture, metrics, evaluation protocols, engineering practices, and even how we worked together as a team. We also, in cooperation with CMU, began the Open Advancement of Question Answering (OAQA) initiative. OAQA is intended to directly engage researchers in the community to help replicate and reuse research results and to identify how to more rapidly advance the state of the art in QA (Ferrucci et al 2009).

As our results dramatically improved, we observed that system-level advances allowing rapid integration and evaluation of new ideas and new components against end-to-end metrics were essential to our progress. This was echoed at the OAQA workshop for experts with decades of investment in QA, hosted by IBM in early 2008. Among the workshop conclusions was that QA would benefit from the collaborative evolution of a single extensible architecture that would allow component results to be consistently evaluated in a common technical context against a growing variety of what were called “Challenge Problems.” Different challenge problems were identified to address various dimensions of the general QA problem. Jeopardy was described as one addressing dimensions including high precision, accurate confidence determination, complex language, breadth of domain, and speed.

The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique we use is how we combine them in DeepQA such that overlapping approaches can bring their strengths to bear and contribute to improvements in accuracy, confidence, or speed.

DeepQA is an architecture with an accompanying methodology, but it is not specific to the Jeopardy Challenge. We have successfully applied DeepQA to both the Jeopardy and TREC QA task. We have begun adapting it to different business applications and additional exploratory challenge problems including medicine, enterprise search, and gaming.

The overarching principles in DeepQA are massive parallelism, many experts, pervasive confidence estimation, and integration of shallow and deep knowledge.

Massive parallelism: Exploit massive parallelism in the consideration of multiple interpretations and hypotheses.

Many experts: Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics.

Pervasive confidence estimation: No component commits to an answer; all components produce features and associated confidences, scoring different question and content interpretations. An underlying confidence-processing substrate learns how to stack and combine the scores.

Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

Figure 6 illustrates the DeepQA architecture at a very high level. The remaining parts of this section provide a bit more detail about the various architectural roles.

Figure 6

Figure 6. DeepQA High-Level Architecture.

Content Acquisition

The first step in any application of DeepQA to solve a QA problem is content acquisition, or identifying and gathering the content to use for the answer and evidence sources shown in figure 6.

Content acquisition is a combination of manual and automatic steps. The first step is to analyze example questions from the problem space to produce a description of the kinds of questions that must be answered and a characterization of the application domain. Analyzing example questions is primarily a manual task, while domain analysis may be informed by automatic or statistical analyses, such as the LAT analysis shown in figure 1. Given the kinds of questions and broad domain of the Jeopardy Challenge, the sources for Watson include a wide range of encyclopedias, dictionaries, thesauri, newswire articles, literary works, and so on.

Given a reasonable baseline corpus, DeepQA then applies an automatic corpus expansion process. The process involves four high-level steps: (1) identify seed documents and retrieve related documents from the web; (2) extract self-contained text nuggets from the related web documents; (3) score the nuggets based on whether they are informative with respect to the original seed document; and (4) merge the most informative nuggets into the expanded corpus. The live system itself uses this expanded corpus and does not have access to the web during play.

In addition to the content for the answer and evidence sources, DeepQA leverages other kinds of semistructured and structured content. Another step in the content-acquisition process is to identify and collect these resources, which include databases, taxonomies, and ontologies, such as dbPedia,7 WordNet (Miller 1995), and the Yago8ontology.

Question Analysis

The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system. The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses (McCord 1990), logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering. Most of these technologies are well understood and are not discussed here, but a few require some elaboration.

Question Classification.

Question classification is the task of identifying question types or parts of questions that require special processing. This can include anything from single words with potentially double meanings to entire clauses that have certain syntactic, semantic, or rhetorical functionality that may inform downstream components with their analysis. Question classification may identify a question as a puzzle question, a math question, a definition question, and so on. It will identify puns, constraints, definition components, or entire subclues within questions.

Focus and LAT Detection.

As discussed earlier, a lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors. An advantage to the DeepQA approach is to exploit many independently developed answer-typing algorithms. However, many of these algorithms are dependent on their own type systems. We found the best way to integrate preexisting components is not to force them into a single, common type system, but to have them map from the LAT to their own internal types.

The focus of the question is the part of the question that, if replaced by the answer, makes the question a stand-alone statement. Looking back at some of the examples shown previously, the focus of “When hit by electrons, a phosphor gives off electromagnetic energy in this form” is “this form”; the focus of “Secretary Chase just submitted this to me for the third time; guess what, pal. This time I’m accepting it” is the first “this”; and the focus of “This title character was the crusty and tough city editor of the Los Angeles Tribune” is “This title character.” The focus often (but not always) contains useful information about the answer, is often the subject or object of a relation in the clue, and can turn a question into a factual statement when replaced with a candidate, which is a useful way to gather evidence about a candidate.

Relation Detection.

Most questions contain relations, whether they are syntactic subject-verb-object predicates or semantic relationships between entities. For example, in the question, “They’re the two states you could be reentering if you’re crossing Florida’s northern border,” we can detect the relation borders(Florida,?x,north).

Watson uses relation detection throughout the QA process, from focus and LAT determination, to passage and answer scoring. Watson can also use detected relations to query a triple store and directly generate candidate answers. Due to the breadth of relations in the Jeopardy domain and the variety of ways in which they are expressed, however, Watson’s current ability to effectively use curated databases to simply “look up” the answers is limited to fewer than 2 percent of the clues.

Watson’s use of existing databases depends on the ability to analyze the question and detect the relations covered by the databases. In Jeopardy the broad domain makes it difficult to identify the most lucrative relations to detect. In 20,000 Jeopardy questions, for example, we found the distribution of Freebase9 relations to be extremely flat (figure 7). Roughly speaking, even achieving high recall on detecting the most frequent relations in the domain can at best help in about 25 percent of the questions, and the benefit of relation detection drops off fast with the less frequent relations. Broad-domain relation detection remains a major open area of research.

Figure 7

Figure 7. Approximate Distribution of the 50 Most Frequently
Occurring Freebase Relations in 20,000 Randomly Selected Jeopardy Clues.

Decomposition.

As discussed above, an important requirement driven by analysis of Jeopardy clues was the ability to handle questions that are better answered through decomposition. DeepQA uses rule-based deep parsing and statistical classification methods both to recognize whether questions should be decomposed and to determine how best to break them up into subquestions. The operating hypothesis is that the correct question interpretation and derived answer(s) will score higher after all the collected evidence and all the relevant algorithms have been considered. Even if the question did not need to be decomposed to determine an answer, this method can help improve the system’s overall answer confidence.

DeepQA solves parallel decomposable questions through application of the end-to-end QA system on each subclue and synthesizes the final answers by a customizable answer combination component. These processing paths are shown in medium gray in figure 6. DeepQA also supports nested decomposable questions through recursive application of the end-to-end QA system to the inner subclue and then to the outer subclue. The customizable synthesis components allow specialized synthesis algorithms to be easily plugged into a common framework.

Hypothesis Generation

Hypothesis generation takes the results of question analysis and produces candidate answers by searching the system’s sources and extracting answer-sized snippets from the search results. Each candidate answer plugged back into the question is considered a hypothesis, which the system has to prove correct with some degree of confidence.

We refer to search performed in hypothesis generation as “primary search” to distinguish it from search performed during evidence gathering (described below). As with all aspects of DeepQA, we use a mixture of different approaches for primary search and candidate generation in the Watson system.

Primary Search.

In primary search the goal is to find as much potentially answer-bearing content as possible based on the results of question analysis — the focus is squarely on recall with the expectation that the host of deeper content analytics will extract answer candidates and score this content plus whatever evidence can be found in support or refutation of candidates to drive up the precision. Over the course of the project we continued to conduct empirical studies designed to balance speed, recall, and precision. These studies allowed us to regularly tune the system to find the number of search results and candidates that produced the best balance of accuracy and computational resources. The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top 250 candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top 250 ranked candidates.

A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.

Triple store queries in primary search are based on named entities in the clue; for example, find all database entities related to the clue entities, or based on more focused queries in the cases that a semantic relation was detected. For a small number of LATs we identified as “closed LATs,” the candidate answer can be generated from a fixed list in some store of known instances of the LAT, such as “U.S. President” or “Country.”

Candidate Answer Generation.

The search results feed into candidate generation, where techniques appropriate to the kind of search results are applied to generate candidate answers. For document search results from “title-oriented” resources, the title is extracted as a candidate answer. The system may generate a number of candidate answer variants from the same title based on substring analysis or link analysis (if the underlying source contains hyperlinks). Passage search results require more detailed analysis of the passage text to identify candidate answers. For example, named entity detection may be used to extract candidate answers from the passage. Some sources, such as a triple store and reverse dictionary lookup, produce candidate answers directly as their search result.

If the correct answer(s) are not generated at this stage as a candidate, the system has no hope of answering the question. This step therefore significantly favors recall over precision, with the expectation that the rest of the processing pipeline will tease out the correct answer, even if the set of candidates is quite large. One of the goals of the system design, therefore, is to tolerate noise in the early stages of the pipeline and drive up precision downstream.

Watson generates several hundred candidate answers at this stage.

Soft Filtering

A key step in managing the resource versus precision trade-off is the application of lightweight (less resource intensive) scoring algorithms to a larger set of initial candidates to prune them down to a smaller set of candidates before the more intensive scoring components see them. For example, a lightweight scorer may compute the likelihood of a candidate answer being an instance of the LAT. We call this step soft filtering.

The system combines these lightweight analysis scores into a soft filtering score. Candidate answers that pass the soft filtering threshold proceed to hypothesis and evidence scoring, while those candidates that do not pass the filtering threshold are routed directly to the final merging stage. The soft filtering scoring model and filtering threshold are determined based on machine learning over training data.

Watson currently lets roughly 100 candidates pass the soft filter, but this a parameterizable function.

Hypothesis and Evidence Scoring

Candidate answers that pass the soft filtering threshold undergo a rigorous evaluation process that involves gathering additional supporting evidence for each candidate answer, or hypothesis, and applying a wide variety of deep scoring analytics to evaluate the supporting evidence.

Evidence Retrieval.

To better evaluate each candidate answer that passes the soft filter, the system gathers additional supporting evidence. The architecture supports the integration of a variety of evidence-gathering techniques. One particularly effective technique is passage search where the candidate answer is added as a required term to the primary search query derived from the question. This will retrieve passages that contain the candidate answer used in the context of the original question terms. Supporting evidence may also come from other sources like triple stores. The retrieved supporting evidence is routed to the deep evidence scoring components, which evaluate the candidate answer in the context of the supporting evidence.

Scoring.

The scoring step is where the bulk of the deep content analysis is performed. Scoring algorithms determine the degree of certainty that retrieved evidence supports the candidate answers. The DeepQA framework supports and encourages the inclusion of many different components, or scorers, that consider different dimensions of the evidence and produce a score that corresponds to how well evidence supports a candidate answer for a given question.

DeepQA provides a common format for the scorers to register hypotheses (for example candidate answers) and confidence scores, while imposing few restrictions on the semantics of the scores themselves; this enables DeepQA developers to rapidly deploy, mix, and tune components to support each other. For example, Watson employs more than 50 scoring components that produce scores ranging from formal probabilities to counts to categorical features, based on evidence from different types of sources including unstructured text, semistructured text, and triple stores. These scorers consider things like the degree of match between a passage’s predicate-argument structure and the question, passage source reliability, geospatial location, temporal relationships, taxonomic classification, the lexical and semantic relations the candidate is known to participate in, the candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on.

Consider the question, “He was presidentially pardoned on September 8, 1974”; the correct answer, “Nixon,” is one of the generated candidates. One of the retrieved passages is “Ford pardoned Nixon on Sept. 8, 1974.” One passage scorer counts the number of IDF-weighted terms in common between the question and the passage. Another passage scorer based on the Smith-Waterman sequence-matching algorithm (Smith and Waterman 1981), measures the lengths of the longest similar subsequences between the question and passage (for example “on Sept. 8, 1974”). A third type of passage scoring measures the alignment of the logical forms of the question and passage. A logical form is a graphical abstraction of text in which nodes are terms in the text and edges represent either grammatical relationships (for example, Hermjakob, Hovy, and Lin [2000]; Moldovan et al. [2003]), deep semantic relationships (for example, Lenat [1995], Paritosh and Forbus [2005]), or both . The logical form alignment identifies Nixon as the object of the pardoning in the passage, and that the question is asking for the object of a pardoning. Logical form alignment gives “Nixon” a good score given this evidence. In contrast, a candidate answer like “Ford” would receive near identical scores to “Nixon” for term matching and passage alignment with this passage, but would receive a lower logical form alignment score.

Another type of scorer uses knowledge in triple stores, simple reasoning such as subsumption and disjointness in type taxonomies, geospatial, and temporal reasoning. Geospatial reasoning is used in Watson to detect the presence or absence of spatial relations such as directionality, borders, and containment between geoentities. For example, if a question asks for an Asian city, then spatial containment provides evidence that Beijing is a suitable candidate, whereas Sydney is not. Similarly, geocoordinate information associated with entities is used to compute relative directionality (for example, California is SW of Montana; GW Bridge is N of Lincoln Tunnel, and so on).

Temporal reasoning is used in Watson to detect inconsistencies between dates in the clue and those associated with a candidate answer. For example, the two most likely candidate answers generated by the system for the clue, “In 1594 he took a job as a tax collector in Andalusia,” are “Thoreau” and “Cervantes.” In this case, temporal reasoning is used to rule out Thoreau as he was not alive in 1594, having been born in 1817, whereas Cervantes, the correct answer, was born in 1547 and died in 1616.

Each of the scorers implemented in Watson, how they work, how they interact, and their independent impact on Watson’s performance deserves its own research paper. We cannot do this work justice here. It is important to note, however, at this point no one algorithm dominates. In fact we believe DeepQA’s facility for absorbing these algorithms, and the tools we have created for exploring their interactions and effects, will represent an important and lasting contribution of this work.

To help developers and users get a sense of how Watson uses evidence to decide between competing candidate answers, scores are combined into an overall evidence profile. The evidence profile groups individual features into aggregate evidence dimensions that provide a more intuitive view of the feature group. Aggregate evidence dimensions might include, for example, Taxonomic, Geospatial (location), Temporal, Source Reliability, Gender, Name Consistency, Relational, Passage Support, Theory Consistency, and so on. Each aggregate dimension is a combination of related feature scores produced by the specific algorithms that fired on the gathered evidence.

Consider the following question: Chile shares its longest land border with this country. In figure 8 we see a comparison of the evidence profiles for two candidate answers produced by the system for this question: Argentina and Bolivia. Simple search engine scores favor Bolivia as an answer, due to a popular border dispute that was frequently reported in the news. Watson prefers Argentina (the correct answer) over Bolivia, and the evidence profile shows why. Although Bolivia does have strong popularity scores, Argentina has strong support in the geospatial, passage support (for example, alignment and logical form graph matching of various text passages), and source reliability dimensions.

Figure 8

Figure 8. Evidence Profiles for Two Candidate Answers.
Dimensions are on the x-axis and relative strength is on the y-axis.

Final Merging and Ranking

It is one thing to return documents that contain key words from the question. It is quite another, however, to analyze the question and the content enough to identify the precise answer and yet another to determine an accurate enough confidence in its correctness to bet on it. Winning at Jeopardy requires exactly that ability.

The goal of final ranking and merging is to evaluate the hundreds of hypotheses based on potentially hundreds of thousands of scores to identify the single best-supported hypothesis given the evidence and to estimate its confidence — the likelihood it is correct.

Answer Merging

Multiple candidate answers for a question may be equivalent despite very different surface forms. This is particularly confusing to ranking techniques that make use of relative differences between candidates. Without merging, ranking algorithms would be comparing multiple surface forms that represent the same answer and trying to discriminate among them. While one line of research has been proposed based on boosting confidence in similar candidates (Ko, Nyberg, and Luo 2007), our approach is inspired by the observation that different surface forms are often disparately supported in the evidence and result in radically different, though potentially complementary, scores. This motivates an approach that merges answer scores before ranking and confidence estimation. Using an ensemble of matching, normalization, and coreference resolution algorithms, Watson identifies equivalent and related hypotheses (for example, Abraham Lincoln and Honest Abe) and then enables custom merging per feature to combine scores.

Ranking and Confidence Estimation

After merging, the system must rank the hypotheses and estimate confidence based on their merged scores. We adopted a machine-learning approach that requires running the system over a set of training questions with known answers and training a model based on the scores. One could assume a very flat model and apply existing ranking algorithms (for example, Herbrich, Graepel, and Obermayer [2000]; Joachims [2002]) directly to these score profiles and use the ranking score for confidence. For more intelligent ranking, however, ranking and confidence estimation may be separated into two phases. In both phases sets of scores may be grouped according to their domain (for example type matching, passage scoring, and so on.) and intermediate models trained using ground truths and methods specific for that task. Using these intermediate models, the system produces an ensemble of intermediate scores. Motivated by hierarchical techniques such as mixture of experts (Jacobs et al. 1991) and stacked generalization (Wolpert 1992), a metalearner is trained over this ensemble. This approach allows for iteratively enhancing the system with more sophisticated and deeper hierarchical models while retaining flexibility for robustness and experimentation as scorers are modified and added to the system.

Watson’s metalearner uses multiple trained models to handle different question classes as, for instance, certain scores that may be crucial to identifying the correct answer for a factoid question may not be as useful on puzzle questions.

Finally, an important consideration in dealing with NLP-based scorers is that the features they produce may be quite sparse, and so accurate confidence estimation requires the application of confidence-weighted learning techniques. (Dredze, Crammer, and Pereira 2008).

Speed and Scaleout

DeepQA is developed using Apache UIMA,10 a framework implementation of the Unstructured Information Management Architecture (Ferrucci and Lally 2004). UIMA was designed to support interoperability and scaleout of text and multimodal analysis applications. All of the components in DeepQA are implemented as UIMA annotators. These are software components that analyze text and produce annotations or assertions about the text. Watson has evolved over time and the number of components in the system has reached into the hundreds. UIMA facilitated rapid component integration, testing, and evaluation.

Early implementations of Watson ran on a single processor where it took 2 hours to answer a single question. The DeepQA computation is embarrassing parallel, however. UIMA-AS, part of Apache UIMA, enables the scaleout of UIMA applications using asynchronous messaging. We used UIMA-AS to scale Watson out over 2500 compute cores. UIMA-AS handles all of the communication, messaging, and queue management necessary using the open JMS standard. The UIMA-AS deployment of Watson enabled competitive run-time latencies in the 3–5 second range.

To preprocess the corpus and create fast run-time indices we used Hadoop.11 UIMA annotators were easily deployed as mappers in the Hadoop map-reduce framework. Hadoop distributes the content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process.

Strategy

Jeopardy demands strategic game play to match wits against the best human players. In a typical Jeopardy game, Watson faces the following strategic decisions: deciding whether to buzz in and attempt to answer a question, selecting squares from the board, and wagering on Daily Doubles and Final Jeopardy.

The workhorse of strategic decisions is the buzz-in decision, which is required for every non–Daily Double clue on the board. This is where DeepQA’s ability to accurately estimate its confidence in its answer is critical, and Watson considers this confidence along with other game-state factors in making the final determination whether to buzz. Another strategic decision, Final Jeopardy wagering, generally receives the most attention and analysis from those interested in game strategy, and there exists a growing catalogue of heuristics such as “Clavin’s Rule” or the “Two-Thirds Rule” (Dupee 1998) as well as identification of those critical score boundaries at which particular strategies may be used (by no means does this make it easy or rote; despite this attention, we have found evidence that contestants still occasionally make irrational Final Jeopardy bets). Daily Double betting turns out to be less studied but just as challenging since the player must consider opponents’ scores and predict the likelihood of getting the question correct just as in Final Jeopardy. After a Daily Double, however, the game is not over, so evaluation of a wager requires forecasting the effect it will have on the distant, final outcome of the game.

These challenges drove the construction of statistical models of players and games, game-theoretic analyses of particular game scenarios and strategies, and the development and application of reinforcement-learning techniques for Watson to learn its strategy for playing Jeopardy. Fortunately, moderate samounts of historical data are available to serve as training data for learning techniques. Even so, it requires extremely careful modeling and game-theoretic evaluation as the game of Jeopardy has incomplete information and uncertainty to model, critical score boundaries to recognize, and savvy, competitive players to account for. It is a game where one faulty strategic choice can lose the entire match.

Status and Results

After approximately 3 years of effort by a core algorithmic team composed of 20 researchers and software engineers with a range of backgrounds in natural language processing, information retrieval, machine learning, computational linguistics, and knowledge representation and reasoning, we have driven the performance of DeepQA to operate within the winner’s cloud on the Jeopardy task, as shown in figure 9. Watson’s results illustrated in this figure were measured over blind test sets containing more than 2000 Jeopardy questions.

Figure 9

Figure 9. Watson’s Precision and Confidence Progress as of the Fourth Quarter 2009.

After many nonstarters, by the fourth quarter of 2007 we finally adopted the DeepQA architecture. At that point we had all moved out of our private offices and into a “war room” setting to dramatically facilitate team communication and tight collaboration. We instituted a host of disciplined engineering and experimental methodologies supported by metrics and tools to ensure we were investing in techniques that promised significant impact on end-to-end metrics. Since then, modulo some early jumps in performance, the progress has been incremental but steady. It is slowing in recent months as the remaining challenges prove either very difficult or highly specialized and covering small phenomena in the data.

By the end of 2008 we were performing reasonably well — about 70 percent precision at 70 percent attempted over the 12,000 question blind data, but it was taking 2 hours to answer a single question on a single CPU. We brought on a team specializing in UIMA and UIMA-AS to scale up DeepQA on a massively parallel high-performance computing platform. We are currently answering more than 85 percent of the questions in 5 seconds or less — fast enough to provide competitive performance, and with continued algorithmic development are performing with about 85 percent precision at 70 percent attempted.

We have more to do in order to improve precision, confidence, and speed enough to compete with grand champions. We are finding great results in leveraging the DeepQA architecture capability to quickly admit and evaluate the impact of new algorithms as we engage more university partnerships to help meet the challenge.

An Early Adaptation Experiment

Another challenge for DeepQA has been to demonstrate if and how it can adapt to other QA tasks. In mid-2008, after we had populated the basic architecture with a host of components for searching, evidence retrieval, scoring, final merging, and ranking for the Jeopardy task, IBM collaborated with CMU to try to adapt DeepQA to the TREC QA problem by plugging in only select domain-specific components previously tuned to the TREC task. In particular, we added question-analysis components from PIQUANT and OpenEphyra that identify answer types for a question, and candidate answer-generation components that identify instances of those answer types in the text. The DeepQA framework utilized both sets of components despite their different type systems — no ontology integration was performed. The identification and integration of these domain specific components into DeepQA took just a few weeks.

The extended DeepQA system was applied to TREC questions. Some of DeepQA’s answer and evidence scorers are more relevant in the TREC domain than in the Jeopardy domain and others are less relevant. We addressed this aspect of adaptation for DeepQA’s final merging and ranking by training an answer-ranking model using TREC questions; thus the extent to which each score affected the answer ranking and confidence was automatically customized for TREC.

Figure 10 shows the results of the adaptation experiment. Both the 2005 PIQUANT and 2007 OpenEphyra systems had less than 50 percent accuracy on the TREC questions and less than 15 percent accuracy on the Jeopardy clues. The DeepQA system at the time had accuracy above 50 percent on Jeopardy. Without adaptation DeepQA’s accuracy on TREC questions was about 35 percent. After adaptation, DeepQA’s accuracy on TREC exceeded 60 percent. We repeated the adaptation experiment in 2010, and in addition to the improvements to DeepQA since 2008, the adaptation included a transfer learning step for TREC questions from a model trained on Jeopardyquestions. DeepQA’s performance on TREC data was 51 percent accuracy prior to adaptation and 67 percent after adaptation, nearly level with its performance on blind Jeopardy data.

Figure 10

Figure 10. Accuracy on Jeopardy! and TREC.

The result performed significantly better than the original complete systems on the task for which they were designed. While just one adaptation experiment, this is exactly the sort of behavior we think an extensible QA system should exhibit. It should quickly absorb domain- or task-specific components and get better on that target task without degradation in performance in the general case or on prior tasks.

Summary

The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After 3 years of intense research and development by a core team of about 20 researcherss, Watson is performing at human expert levels in terms of precision, confidence, and speed at theJeopardy quiz show.

Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.

The architecture and methodology developed as part of this project has highlighted the need to take a systems-level approach to research in QA, and we believe this applies to research in the broader field of AI. We have developed many different algorithms for addressing different kinds of problems in QA and plan to publish many of them in more detail in the future. However, no one algorithm solves challenge problems like this. End-to-end systems tend to involve many complex and often overlapping interactions. A system design and methodology that facilitated the efficient integration and ablation studies of many probabilistic components was essential for our success to date. The impact of any one algorithm on end-to-end performance changed over time as other techniques were added and had overlapping effects. Our commitment to regularly evaluate the effects of specific techniques on end-to-end performance, and to let that shape our research investment, was necessary for our rapid progress.

Rapid experimentation was another critical ingredient to our success. The team conducted more than 5500 independent experiments in 3 years — each averaging about 2000 CPU hours and generating more than 10 GB of error-analysis data. Without DeepQA’s massively parallel architecture and a dedicated high-performance computing infrastructure, we would not have been able to perform these experiments, and likely would not have even conceived of many of them.

Tuned for the Jeopardy Challenge, Watson has begun to compete against former Jeopardy players in a series of “sparring” games. It is holding its own, winning 64 percent of the games, but has to be improved and sped up to compete favorably against the very best.

We have leveraged our collaboration with CMU and with our other university partnerships in getting this far and hope to continue our collaborative work to drive Watson to its final goal, and help openly advance QA research.

Acknowledgements

We would like to acknowledge the talented team of research scientists and engineers at IBM and at partner universities, listed below, for the incredible work they are doing to influence and develop all aspects of Watson and the DeepQA architecture. It is this team who are responsible for the work described in this paper. From IBM, Andy Aaron, Einat Amitay, Branimir Boguraev, David Carmel, Arthur Ciccolo, Jaroslaw Cwiklik, Pablo Duboue, Edward Epstein, Raul Fernandez, Radu Florian, Dan Gruhl, Tong-Haing Fin, Achille Fokoue, Karen Ingraffea, Bhavani Iyer, Hiroshi Kanayama, Jon Lenchner, Anthony Levas, Burn Lewis, Michael McCord, Paul Morarescu, Matthew Mulholland, Yuan Ni, Miroslav Novak, Yue Pan, Siddharth Patwardhan, Zhao Ming Qiu, Salim Roukos, Marshall Schor, Dafna Sheinwald, Roberto Sicconi, Hiroshi Kanayama, Kohichi Takeda, Gerry Tesauro, Chen Wang, Wlodek Zadrozny, and Lei Zhang. From our academic partners, Manas Pathak (CMU), Chang Wang (University of Massachusetts [UMass]), Hideki Shima (CMU), James Allen (UMass), Ed Hovy (University of Southern California/Information Sciences Instutute), Bruce Porter (University of Texas), Pallika Kanani (UMass), Boris Katz (Massachusetts Institute of Technology), Alessandro Moschitti, and Giuseppe Riccardi (University of Trento), Barbar Cutler, Jim Hendler, and Selmer Bringsjord (Rensselaer Polytechnic Institute).

Notes

1. Watson is named after IBM’s founder, Thomas J. Watson.

2. Random jitter has been added to help visualize the distribution of points.

3. www-nlpir.nist.gov/projects/aquaint.

4. trec.nist.gov/proceedings/proceedings.html.

5. sourceforge.net/projects/openephyra/.

6. The dip at the left end of the light gray curve is due to the disproportionately high score the search engine assigns to short queries, which typically are not sufficiently discriminative to retrieve the correct answer in top position.

7. dbpedia.org.

8. www.mpi-inf.mpg.de/yago-naga/yago.

9. freebase.com.

10. incubator.apache.org/uima.

11. hadoop.apache.org.

References

Chu-Carroll, J.; Czuba, K.; Prager, J. M.; and Ittycheriah, A. 2003. Two Heads Are Better Than One in Question-Answering. Paper presented at the Human Language Technology Conference, Edmonton, Canada, 27 May–1 June.

Dredze, M.; Crammer, K.; and Pereira, F. 2008. Confidence-Weighted Linear Classification. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML). Princeton, NJ: International Machine Learning Society.

Dupee, M. 1998. How to Get on Jeopardy! … and Win: Valuable Information from a Champion. Secaucus, NJ: Citadel Press.

Ferrucci, D., and Lally, A. 2004. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Langage Engineering 10(3–4): 327–348.

Ferrucci, D.; Nyberg, E.; Allan, J.; Barker, K.; Brown, E.; Chu-Carroll, J.; Ciccolo, A.; Duboue, P.; Fan, J.; Gondek, D.; Hovy, E.; Katz, B.; Lally, A.; McCord, M.; Morarescu, P.; Murdock, W.; Porter, B.; Prager, J.; Strzalkowski, T.; Welty, W.; and Zadrozny, W. 2009. Towards the Open Advancement of Question Answer Systems. IBM Technical Report RC24789, Yorktown Heights, NY.

Herbrich, R.; Graepel, T.; and Obermayer, K. 2000. Large Margin Rank Boundaries for Ordinal Regression. InAdvances in Large Margin Classifiers, 115–132. Linköping, Sweden: Liu E-Press.

Hermjakob, U.; Hovy, E. H.; and Lin, C. 2000. Knowledge-Based Question Answering. In Proceedings of the Sixth World Multiconference on Systems, Cybernetics, and Informatics (SCI-2002). Winter Garden, FL: International Institute of Informatics and Systemics.

Hsu, F.-H. 2002. Behind Deep Blue: Building the Computer That Defeated the World Chess Champion. Princeton, NJ: Princeton University Press.

Jacobs, R.; Jordan, M. I.; Nowlan. S. J.; and Hinton, G. E. 1991. Adaptive Mixtures of Local Experts. Neural Computation 3(1): 79-–87.

Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the Thirteenth ACM Conference on Knowledge Discovery and Data Mining (KDD). New York: Association for Computing Machinery.

Ko, J.; Nyberg, E.; and Luo Si, L. 2007. A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering. In Proceedings of the 30th Annual International ACM SIGIR Conference, 343–350. New York: Association for Computing Machinery.

Lenat, D. B. 1995. Cyc: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM 38(11): 33–38.

Maybury, Mark, ed. 2004. New Directions in Question-Answering. Menlo Park, CA: AAAI Press.

McCord, M. C. 1990. Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars. InNatural Language and Logic: International Scientific Symposium. Lecture Notes in Computer Science 459. Berlin: Springer Verlag.

Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM 38(11): 39–41.

Moldovan, D.; Clark, C.; Harabagiu, S.; and Maiorano, S. 2003. COGEX: A Logic Prover for Question Answering. Paper presented at the Human Language Technology Conference, Edmonton, Canada, 27 May–1 June..

Paritosh, P., and Forbus, K. 2005. Analysis of Strategic Knowledge in Back of the Envelope Reasoning. InProceedings of the 20th AAAI Conference on Artificial Intelligence (AAAI-05). Menlo Park, CA: AAAI Press.

Prager, J. M.; Chu-Carroll, J.; and Czuba, K. 2004. A Multi-Strategy, Multi-Question Approach to Question Answering. In New Directions in Question-Answering, ed. M. Maybury. Menlo Park, CA: AAAI Press.

Simmons, R. F. 1970. Natural Language Question-Answering Systems: 1969. Communications of the ACM 13(1): 15–30

Smith T. F., and Waterman M. S. 1981. Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1): 195–197.

Strzalkowski, T., and Harabagiu, S., eds. 2006. Advances in Open-Domain Question-Answering. Berlin: Springer.

Voorhees, E. M., and Dang, H. T. 2005. Overview of the TREC 2005 Question Answering Track. In Proceedings of the Fourteenth Text Retrieval Conference. Gaithersburg, MD: National Institute of Standards and Technology.

Wolpert, D. H. 1992. Stacked Generalization. Neural Networks 5(2): 241–259.

David Ferrucci is a research staff member and leads the Semantic Analysis and Integration department at the IBM T. J. Watson Research Center, Hawthorne, New York. Ferrucci is the principal investigator for the DeepQA/Watson project and the chief architect for UIMA, now an OASIS standard and Apache open-source project. Ferrucci’s background is in artificial intelligence and software engineering.

Eric Brown is a research staff member at the IBM T. J. Watson Research Center. His background is in information retrieval. Brown’s current research interests include question answering, unstructured information management architectures, and applications of advanced text analysis and question answering to information retrieval systems..

Jennifer Chu-Carroll is a research staff member at the IBM T. J. Watson Research Center. Chu-Carroll is on the editorial board of the Journal of Dialogue Systems, and previously served on the executive board of the North American Chapter of the Association for Computational Linguistics and as program cochair of HLT-NAACL 2006. Her research interests include question answering, semantic search, and natural language discourse and dialogue..

James Fan is a research staff member at IBM T. J. Watson Research Center. His research interests include natural language processing, question answering, and knowledge representation and reasoning. He has served as a program committee member for several top ranked AI conferences and journals, such as IJCAI and AAAI. He received his Ph.D. from the University of Texas at Austin in 2006.

David Gondek is a research staff member at the IBM T. J. Watson Research Center. His research interests include applications of machine learning, statistical modeling, and game theory to question answering and natural language processing. Gondek has contributed to journals and conferences in machine learning and data mining. He earned his Ph.D. in computer science from Brown University.

Aditya A. Kalyanpur is a research staff member at the IBM T. J. Watson Research Center. His primary research interests include knowledge representation and reasoning, natural languague programming, and question answering. He has served on W3 working groups, as program cochair of an international semantic web workshop, and as a reviewer and program committee member for several AI journals and conferences. Kalyanpur completed his doctorate in AI and semantic web related research from the University of Maryland, College Park.

Adam Lally is a senior software engineer at IBM’s T. J. Watson Research Center. He develops natural language processing and reasoning algorithms for a variety of applications and is focused on developing scalable frameworks of NLP and reasoning systems. He is a lead developer and designer for the UIMA framework and architecture specification.

J. William Murdock is a research staff member at the IBM T. J. Watson Research Center. Before joining IBM, he worked at the United States Naval Research Laboratory. His research interests include natural-language semantics, analogical reasoning, knowledge-based planning, machine learning, and computational reflection. In 2001, he earned his Ph.D. in computer science from the Georgia Institute of Technology..

Eric Nyberg is a professor at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Nyberg’s research spans a broad range of text analysis and information retrieval areas, including question answering, search, reasoning, and natural language processing architectures, systems, and software engineering principles

John Prager is a research staff member at the IBM T. J. Watson Research Center in Yorktown Heights, New York. His background includes natural-language based interfaces and semantic search, and his current interest is on incorporating user and domain models to inform question-answering. He is a member of the TREC program committee.

Nico Schlaefer is a Ph.D. student at the Language Technologies Institute in the School of Computer Science, Carnegie Mellon University and an IBM Ph.D. Fellow. His research focus is the application of machine learning techniques to natural language processing tasks. Schlaefer is the primary author of the OpenEphyra question answering system.

Chris Welty is a research staff member at the IBM Thomas J. Watson Research Center. His background is primarily in knowledge representation and reasoning. Welty’s current research focus is on hybridization of machine learning, natural language processing, and knowledge representation and reasoning in building AI systems.

[repost ]褪去华衣 裸视学习 探讨系列

original:http://www.guzili.com/?p=45204

写在最前:

本专题经 @老师木 同意, 特收录“老湿”对AI/ML的一些独到见解。如果非要问我为什么要特别收录这几篇文章,回答:个人认为,他的大部分见解已经并肩甚至超过了该领域的一般教授。如果你再八卦一下问这个专题为什么叫“褪去华衣 裸视学习”,答曰:这些见解一定程度上褪去了AI/ML的神秘色彩,可以让我们更客观的看待这一学科。

专题分为六个篇章:

1)机器学习 基础篇

褪去华衣 裸视学习 之 高斯分布

褪去华衣 裸视学习 之 sigmod函数

褪去华衣 裸视学习 之 关于‘基’

褪去华衣 裸视学习 之 随机数学

褪去华衣 裸视学习 之 小赞deep learning

2)机器学习 进阶篇

褪去华衣 裸视学习 之 有监督学习

褪去华衣 裸视学习 之 线性分类器

褪去华衣 裸视学习 之 SVM神话

褪去华衣 裸视学习 之 无监督学习

褪去华衣 裸视学习 之 概率图

3)机器学习 实战篇

褪去华衣 裸视学习 之 人机对话小思路

褪去华衣 裸视学习 之 Marr 视觉计算理论

褪去华衣 裸视学习 之 小议无监督分词

褪去华衣 裸视学习 之 博弈论与广告

4)机器学习 反思篇

褪去华衣 裸视学习 之 《机器学习那点事儿》解读

褪去华衣 裸视学习 之 李国杰院士大数据一文的不同意见

褪去华衣 裸视学习 之 机器学习无用论

褪去华衣 裸视学习 之 机器学习有点用

5)机器学习 方法论

褪去华衣 裸视学习 之 规则与统计

褪去华衣 裸视学习 之 优质数据

褪去华衣 裸视学习 之 跨界的机器学习

褪去华衣 裸视学习 之 概率与统计区别

6)机器学习 番外篇

褪去华衣 裸视学习 之 产品设计

褪去华衣 裸视学习 之 学术界

褪去华衣 裸视学习 之 如何学习

褪去华衣 裸视学习 之 机器学习教材

[wiki ]Artificial intelligence

original:http://en.wikipedia.org/w/index.php?title=Artificial_intelligence&printable=yes

Artificial intelligence

From Wikipedia, the free encyclopedia

Artificial intelligence (AI) is the intelligence of machines and robots and the branch of computer science that aims to create it. AI textbooks define the field as “the study and design of intelligent agents”[1] where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success.[2] John McCarthy, who coined the term in 1955,[3]defines it as “the science and engineering of making intelligent machines.”[4]

AI research is highly technical and specialized, deeply divided into subfields that often fail to communicate with each other.[5] Some of the division is due to social and cultural factors: subfields have grown up around particular institutions and the work of individual researchers. AI research is also divided by several technical issues. There are subfields which are focused on the solution of specific problems, on one of several possible approaches, on the use of widely differing tools and towards the accomplishment of particular applications. The central problems of AI include such traits as reasoning, knowledge, planning, learning, communication, perception and the ability to move and manipulate objects.[6] General intelligence (or “strong AI“) is still among the field’s long term goals.[7] Currently popular approaches include statistical methodscomputational intelligence and traditional symbolic AI. There are an enormous number of tools used in AI, including versions of search and mathematical optimizationlogicmethods based on probability and economics, and many others.

The field was founded on the claim that a central property of humans, intelligence—the sapience of Homo sapiens—can be so precisely described that it can be simulated by a machine.[8] This raises philosophical issues about the nature of the mind and the ethics of creating artificial beings, issues which have been addressed by mythfiction and philosophy since antiquity.[9] Artificial intelligence has been the subject of optimism,[10] but has also suffered setbacks[11] and, today, has become an essential part of the technology industry, providing the heavy lifting for many of the most difficult problems in computer science.[12]

Contents

History

Thinking machines and artificial beings appear in Greek myths, such as Talos of Crete, the bronze robot of Hephaestus, and Pygmalion’s Galatea.[13] Human likenesses believed to have intelligence were built in every major civilization: animated cult images were worshipped in Egypt and Greece[14] and humanoid automatons were built by Yan ShiHero of Alexandria and Al-Jazari.[15] It was also widely believed that artificial beings had been created by Jābir ibn HayyānJudah Loew and Paracelsus.[16] By the 19th and 20th centuries, artificial beings had become a common feature in fiction, as in Mary Shelley‘s Frankenstein or Karel Čapek‘s R.U.R. (Rossum’s Universal Robots).[17] Pamela McCorduck argues that all of these are examples of an ancient urge, as she describes it, “to forge the gods”.[9] Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence.

Mechanical or “formal” reasoning has been developed by philosophers and mathematicians since antiquity. The study of logic led directly to the invention of the programmable digital electronic computer, based on the work of mathematician Alan Turing and others. Turing’s theory of computation suggested that a machine, by shuffling symbols as simple as “0” and “1”, could simulate any conceivable (imaginable) act of mathematical deduction.[18][19] This, along with concurrent discoveries in neurologyinformation theory and cybernetics, inspired a small group of researchers to begin to seriously consider the possibility of building an electronic brain.[20]

The field of AI research was founded at a conference on the campus of Dartmouth College in the summer of 1956.[21] The attendees, including John McCarthyMarvin MinskyAllen Newell and Herbert Simon, became the leaders of AI research for many decades.[22] They and their students wrote programs that were, to most people, simply astonishing:[23] Computers were solving word problems in algebra, proving logical theorems and speaking English.[24] By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense[25] and laboratories had been established around the world.[26] AI’s founders were profoundly optimistic about the future of the new field: Herbert Simon predicted that “machines will be capable, within twenty years, of doing any work a man can do” and Marvin Minsky agreed, writing that “within a generation … the problem of creating ‘artificial intelligence’ will substantially be solved”.[27]

They had failed to recognize the difficulty of some of the problems they faced.[28] In 1974, in response to the criticism of Sir James Lighthill and ongoing pressure from the US Congress to fund more productive projects, both the U.S. and British governments cut off all undirected exploratory research in AI. The next few years, when funding for projects was hard to find, would later be called the “AI winter“.[29]

In the early 1980s, AI research was revived by the commercial success of expert systems,[30] a form of AI program that simulated the knowledge and analytical skills of one or more human experts. By 1985 the market for AI had reached over a billion dollars. At the same time, Japan’s fifth generation computer project inspired the U.S and British governments to restore funding for academic research in the field.[31] However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer lasting AI winter began.[32]

In the 1990s and early 21st century, AI achieved its greatest successes, albeit somewhat behind the scenes. Artificial intelligence is used for logistics, data miningmedical diagnosis and many other areas throughout the technology industry.[12] The success was due to several factors: the increasing computational power of computers (see Moore’s law), a greater emphasis on solving specific subproblems, the creation of new ties between AI and other fields working on similar problems, and a new commitment by researchers to solid mathematical methods and rigorous scientific standards.[33]

On 11 May 1997, Deep Blue became the first computer chess-playing system to beat a reigning world chess champion, Garry Kasparov.[34] In 2005, a Stanford robot won the DARPA Grand Challenge by driving autonomously for 131 miles along an unrehearsed desert trail.[35] Two years later, a team from CMU won the DARPA Urban Challenge when their vehicle autonomously navigated 55 miles in an Urban environment while adhering to traffic hazards and all traffic laws.[36] In February 2011, in a Jeopardy! quiz show exhibition match, IBM‘s question answering systemWatson, defeated the two greatest Jeopardy! champions, Brad Rutter and Ken Jennings, by a significant margin.[37]

The leading-edge definition of artificial intelligence research is changing over time. One pragmatic definition is: “AI research is that which computing scientists do not know how to do cost-effectively today.” For example, in 1956 optical character recognition (OCR) was considered AI, but today, sophisticated OCR software with a context-sensitive spell checker and grammar checkersoftware comes for free with most image scanners. No one would any longer consider already-solved computing science problems like OCR “artificial intelligence” today.

Low-cost entertaining chess-playing software is commonly available for tablet computers. DARPA no longer provides significant funding for chess-playing computing system development. The Kinectwhich provides a 3D body–motion interface for the Xbox 360 uses algorithms that emerged from lengthy AI research,[38] but few consumers realize the technology source.

AI applications are no longer the exclusive domain of U.S. Department of Defense R&D, but are now commonplace consumer items and inexpensive intelligent toys.

In common usage, the term “AI” no longer seems to apply to off-the-shelf solved computing-science problems, which may have originally emerged out of years of AI research.

Problems

The general problem of simulating (or creating) intelligence has been broken down into a number of specific sub-problems. These consist of particular traits or capabilities that researchers would like an intelligent system to display. The traits described below have received the most attention.[6]

Deduction, reasoning, problem solving

Early AI researchers developed algorithms that imitated the step-by-step reasoning that humans use when they solve puzzles or make logical deductions.[39] By the late 1980s and ’90s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.[40]

For difficult problems, most of these algorithms can require enormous computational resources – most experience a “combinatorial explosion“: the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size. The search for more efficient problem-solving algorithms is a high priority for AI research.[41]

Human beings solve most of their problems using fast, intuitive judgements rather than the conscious, step-by-step deduction that early AI research was able to model.[42] AI has made some progress at imitating this kind of “sub-symbolic” problem solving: embodied agent approaches emphasize the importance of sensorimotor skills to higher reasoning; neural net research attempts to simulate the structures inside the brain that give rise to this skill; statistical approaches to AI mimic the probabilistic nature of the human ability to guess.

Knowledge representation

An ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts.

Knowledge representation[43] and knowledge engineering[44] are central to AI research. Many of the problems machines are expected to solve will require extensive knowledge about the world. Among the things that AI needs to represent are: objects, properties, categories and relations between objects;[45] situations, events, states and time;[46] causes and effects;[47] knowledge about knowledge (what we know about what other people know);[48] and many other, less well researched domains. A representation of “what exists” is an ontology (borrowing a word from traditional philosophy), of which the most general are called upper ontologies.[49]

Among the most difficult problems in knowledge representation are:

Default reasoning and the qualification problem
Many of the things people know take the form of “working assumptions.” For example, if a bird comes up in conversation, people typically picture an animal that is fist sized, sings, and flies. None of these things are true about all birds. John McCarthy identified this problem in 1969[50] as the qualification problem: for any commonsense rule that AI researchers care to represent, there tend to be a huge number of exceptions. Almost nothing is simply true or false in the way that abstract logic requires. AI research has explored a number of solutions to this problem.[51]
The breadth of commonsense knowledge
The number of atomic facts that the average person knows is astronomical. Research projects that attempt to build a complete knowledge base of commonsense knowledge (e.g., Cyc) require enormous amounts of laborious ontological engineering — they must be built, by hand, one complicated concept at a time.[52] A major goal is to have the computer understand enough concepts to be able to learn by reading from sources like the internet, and thus be able to add to its own ontology.[citation needed]
The subsymbolic form of some commonsense knowledge
Much of what people know is not represented as “facts” or “statements” that they could express verbally. For example, a chess master will avoid a particular chess position because it “feels too exposed”[53] or an art critic can take one look at a statue and instantly realize that it is a fake.[54] These are intuitions or tendencies that are represented in the brain non-consciously and sub-symbolically.[55] Knowledge like this informs, supports and provides a context for symbolic, conscious knowledge. As with the related problem of sub-symbolic reasoning, it is hoped thatsituated AIcomputational intelligence, or statistical AI will provide ways to represent this kind of knowledge.[55]

Planning

hierarchical control systemis a form of control system in which a set of devices and governing software is arranged in a hierarchy.

Intelligent agents must be able to set goals and achieve them.[56] They need a way to visualize the future (they must have a representation of the state of the world and be able to make predictions about how their actions will change it) and be able to make choices that maximize the utility (or “value”) of the available choices.[57]

In classical planning problems, the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be.[58] However, if the agent is not the only actor, it must periodically ascertain whether the world matches its predictions and it must change its plan as this becomes necessary, requiring the agent to reason under uncertainty.[59]

Multi-agent planning uses the cooperation and competition of many agents to achieve a given goal. Emergent behavior such as this is used by evolutionary algorithmsand swarm intelligence.[60]

Learning

Main article: Machine learning

Machine learning is the study of computer algorithms that improve automatically through experience.[61][62] It has been central to AI research from the beginning.[63]

Unsupervised learning is the ability to find patterns in a stream of input. Supervised learning includes both classification and numerical regression. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change. In reinforcement learning[64] the agent is rewarded for good responses and punished for bad ones. These can be analyzed in terms of decision theory, using concepts like utility. The mathematical analysis of machine learning algorithms and their performance is a branch of theoretical computer science known ascomputational learning theory.[65]

Natural language processing

parse tree represents thesyntactic structure of a sentence according to someformal grammar.

Natural language processing[66] gives machines the ability to read and understand the languages that humans speak. A sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human-written sources, such as Internet texts. Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation.[67]

A common method of processing and extracting meaning from natural language is through semantic indexing. Increases in processing speeds and the drop in the cost of data storage makes indexing large volumes of abstractions of the users input much more efficient.

Motion and manipulation

Main article: Robotics

The field of robotics[68] is closely related to AI. Intelligence is required for robots to be able to handle such tasks as object manipulation[69] and navigation, with sub-problems of localization (knowing where you are, or finding out where other things are), mapping (learning what is around you, building a map of the environment), and motion planning (figuring out how to get there) or path planning (going from one point in space to another point, which may involve compliant motion – where the robot moves while maintaining physical contact with an object).[70][71]

Perception

Machine perception[72] is the ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world. Computer vision[73] is the ability to analyze visual input. A few selected subproblems are speech recognition,[74] facial recognition and object recognition.[75]

Social intelligence

Main article: Affective computing

Kismet, a robot with rudimentary social skills[76]

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects.[77][78] It is an interdisciplinary field spanning computer sciencespsychology, and cognitive science.[79] While the origins of the field may be traced as far back as to early philosophical enquiries into emotion,[80] the more modern branch of computer science originated with Rosalind Picard‘s 1995 paper[81] on affective computing.[82][83] A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behaviour to them, giving an appropriate response for those emotions.

Emotion and social skills[84] play two roles for an intelligent agent. First, it must be able to predict the actions of others, by understanding their motives and emotional states. (This involves elements of game theorydecision theory, as well as the ability to model human emotions and the perceptual skills to detect emotions.) Also, in an effort to facilitate human-computer interaction, an intelligent machine might want to be able to display emotions—even if it does not actually experience them itself—in order to appear sensitive to the emotional dynamics of human interaction.

Creativity

A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative, or systems that identify and assess creativity). Related areas of computational research are Artificial intuition and Artificial imagination.

General intelligence

Main articles: Strong AI and AI-complete

Most researchers think that their work will eventually be incorporated into a machine with general intelligence (known as strong AI), combining all the skills above and exceeding human abilities at most or all of them.[7] A few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project.[85][86]

Many of the problems above are considered AI-complete: to solve one problem, you must solve them all. For example, even a straightforward, specific task like machine translation requires that the machine follow the author’s argument (reason), know what is being talked about (knowledge), and faithfully reproduce the author’s intention (social intelligence). Machine translation, therefore, is believed to be AI-complete: it may require strong AI to be done as well as humans can do it.[87]

Approaches

There is no established unifying theory or paradigm that guides AI research. Researchers disagree about many issues.[88] A few of the most long standing questions that have remained unanswered are these: should artificial intelligence simulate natural intelligence by studying psychology or neurology? Or is human biology as irrelevant to AI research as bird biology is to aeronautical engineering?[89] Can intelligent behavior be described using simple, elegant principles (such as logic or optimization)? Or does it necessarily require solving a large number of completely unrelated problems?[90] Can intelligence be reproduced using high-level symbols, similar to words and ideas? Or does it require “sub-symbolic” processing?[91] John Haugeland, who coined the term GOFAI (Good Old-Fashioned Artificial Intelligence), also proposed that AI should more properly be referred to as synthetic intelligence,[92] a term which has since been adopted by some non-GOFAI researchers.[93][94]

Cybernetics and brain simulation

In the 1940s and 1950s, a number of researchers explored the connection between neurologyinformation theory, and cybernetics. Some of them built machines that used electronic networks to exhibit rudimentary intelligence, such as W. Grey Walter‘s turtles and the Johns Hopkins Beast. Many of these researchers gathered for meetings of the Teleological Society at Princeton University and theRatio Club in England.[20] By 1960, this approach was largely abandoned, although elements of it would be revived in the 1980s.

Symbolic

Main article: GOFAI

When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation. The research was centered in three institutions: Carnegie Mellon UniversityStanford and MIT, and each one developed its own style of research. John Haugeland named these approaches to AI “good old fashioned AI” or “GOFAI“.[95] During the 1960s, symbolic approaches had achieved great success at simulating high-level thinking in small demonstration programs. Approaches based on cybernetics or neural networkswere abandoned or pushed into the background.[96] Researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the goal of their field.

Cognitive simulation
Economist Herbert Simon and Allen Newell studied human problem-solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive scienceoperations research and management science. Their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems. This tradition, centered at Carnegie Mellon University would eventually culminate in the development of the Soar architecture in the middle 80s.[97][98]
Logic-based
Unlike Newell and SimonJohn McCarthy felt that machines did not need to simulate human thought, but should instead try to find the essence of abstract reasoning and problem solving, regardless of whether people used the same algorithms.[89] His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representationplanning and learning.[99] Logic was also focus of the work at the University of Edinburgh and elsewhere in Europe which led to the development of the programming languageProlog and the science of logic programming.[100]
“Anti-logic” or “scruffy”
Researchers at MIT (such as Marvin Minsky and Seymour Papert)[101] found that solving difficult problems in vision and natural language processing required ad-hoc solutions – they argued that there was no simple and general principle (like logic) that would capture all the aspects of intelligent behavior. Roger Schank described their “anti-logic” approaches as “scruffy” (as opposed to the “neat” paradigms at CMU and Stanford).[90] Commonsense knowledge bases (such as Doug Lenat‘s Cyc) are an example of “scruffy” AI, since they must be built by hand, one complicated concept at a time.[102]
Knowledge-based
When computers with large memories became available around 1970, researchers from all three traditions began to build knowledge into AI applications.[103] This “knowledge revolution” led to the development and deployment of expert systems (introduced by Edward Feigenbaum), the first truly successful form of AI software.[30] The knowledge revolution was also driven by the realization that enormous amounts of knowledge would be required by many simple AI applications.

Sub-symbolic

By the 1980s progress in symbolic AI seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition, especially perceptionrobotics,learning and pattern recognition. A number of researchers began to look into “sub-symbolic” approaches to specific AI problems.[91]

Bottom-up, embodiedsituatedbehavior-based or nouvelle AI
Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI and focused on the basic engineering problems that would allow robots to move and survive.[104] Their work revived the non-symbolic viewpoint of the early cybernetics researchers of the 50s and reintroduced the use of control theory in AI. This coincided with the development of the embodied mind thesis in the related field of cognitive science: the idea that aspects of the body (such as movement, perception and visualization) are required for higher intelligence.
Computational Intelligence
Interest in neural networks and “connectionism” was revived by David Rumelhart and others in the middle 1980s.[105] These and other sub-symbolic approaches, such as fuzzy systems andevolutionary computation, are now studied collectively by the emerging discipline of computational intelligence.[106]

Statistical

In the 1990s, AI researchers developed sophisticated mathematical tools to solve specific subproblems. These tools are truly scientific, in the sense that their results are both measurable and verifiable, and they have been responsible for many of AI’s recent successes. The shared mathematical language has also permitted a high level of collaboration with more established fields (likemathematics, economics or operations research). Stuart Russell and Peter Norvig describe this movement as nothing less than a “revolution” and “the victory of the neats.”[33] Critics argue that these techniques are too focused on particular problems and have failed to address the long term goal of general intelligence.[107] There is an ongoing debate about the relevance and validity of statistical approaches in AI, exemplified in part by exchanges between Peter Norvig and Noam Chomsky, as described in,.[108][109]

Integrating the approaches

Intelligent agent paradigm
An intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success. The simplest intelligent agents are programs that solve specific problems. More complicated agents include human beings and organizations of human beings (such as firms). The paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful, without agreeing on one single approach. An agent that solves a specific problem can use any approach that works – some agents are symbolic and logical, some are sub-symbolic neural networks and others may use new approaches. The paradigm also gives researchers a common language to communicate with other fields—such as decision theory and economics—that also use concepts of abstract agents. The intelligent agent paradigm became widely accepted during the 1990s.[2]
Agent architectures and cognitive architectures
Researchers have designed systems to build intelligent systems out of interacting intelligent agents in a multi-agent system.[110] A system with both symbolic and sub-symbolic components is ahybrid intelligent system, and the study of such systems is artificial intelligence systems integration. A hierarchical control system provides a bridge between sub-symbolic AI at its lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time constraints permit planning and world modelling.[111] Rodney Brooks‘ subsumption architecture was an early proposal for such a hierarchical system.[112]

Tools

In the course of 50 years of research, AI has developed a large number of tools to solve the most difficult problems in computer science. A few of the most general of these methods are discussed below.

Search and optimization

Many problems in AI can be solved in theory by intelligently searching through many possible solutions:[113] Reasoning can be reduced to performing a search. For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule.[114] Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal, a process called means-ends analysis.[115] Robotics algorithms for moving limbs and grasping objects use local searches in configuration space.[69] Manylearning algorithms use search algorithms based on optimization.

Simple exhaustive searches[116] are rarely sufficient for most real world problems: the search space (the number of places to search) quickly grows to astronomical numbers. The result is a search that is too slow or never completes. The solution, for many problems, is to use “heuristics” or “rules of thumb” that eliminate choices that are unlikely to lead to the goal (called “pruning thesearch tree“). Heuristics supply the program with a “best guess” for the path on which the solution lies.[117]

A very different kind of search came to prominence in the 1990s, based on the mathematical theory of optimization. For many problems, it is possible to begin the search with some form of a guess and then refine the guess incrementally until no more refinements can be made. These algorithms can be visualized as blind hill climbing: we begin the search at a random point on the landscape, and then, by jumps or steps, we keep moving our guess uphill, until we reach the top. Other optimization algorithms are simulated annealingbeam search and random optimization.[118]

Evolutionary computation uses a form of optimization search. For example, they may begin with a population of organisms (the guesses) and then allow them to mutate and recombine, selecting only the fittest to survive each generation (refining the guesses). Forms of evolutionary computation include swarm intelligence algorithms (such as ant colony or particle swarm optimization)[119] andevolutionary algorithms (such as genetic algorithmsgene expression programming, and genetic programming).[120]

Logic

Logic[121] is used for knowledge representation and problem solving, but it can be applied to other problems as well. For example, the satplan algorithm uses logic for planning[122] and inductive logic programming is a method for learning.[123]

Several different forms of logic are used in AI research. Propositional or sentential logic[124] is the logic of statements which can be true or false. First-order logic[125] also allows the use ofquantifiers and predicates, and can express facts about objects, their properties, and their relations with each other. Fuzzy logic,[126] is a version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0). Fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems. Subjective logic[127] models uncertainty in a different and more explicit manner than fuzzy-logic: a given binomial opinion satisfies belief + disbelief + uncertainty = 1 within a Beta distribution. By this method, ignorance can be distinguished from probabilistic statements that an agent makes with high confidence.

Default logicsnon-monotonic logics and circumscription[51] are forms of logic designed to help with default reasoning and the qualification problem. Several extensions of logic have been designed to handle specific domains of knowledge, such as: description logics;[45] situation calculusevent calculus and fluent calculus (for representing events and time);[46] causal calculus;[47] belief calculus; and modal logics.[48]

Probabilistic methods for uncertain reasoning

Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to operate with incomplete or uncertain information. AI researchers have devised a number of powerful tools to solve these problems using methods from probability theory and economics.[128]

Bayesian networks[129] are a very general tool that can be used for a large number of problems: reasoning (using the Bayesian inference algorithm),[130] learning (using the expectation-maximization algorithm),[131] planning (using decision networks)[132] and perception (using dynamic Bayesian networks).[133] Probabilistic algorithms can also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping perception systems to analyze processes that occur over time (e.g., hidden Markov models or Kalman filters).[133]

A key concept from the science of economics is “utility“: a measure of how valuable something is to an intelligent agent. Precise mathematical tools have been developed that analyze how an agent can make choices and plan, using decision theorydecision analysis,[134] information value theory.[57] These tools include models such as Markov decision processes,[135] dynamic decision networks,[133] game theory and mechanism design.[136]

Classifiers and statistical learning methods

The simplest AI applications can be divided into two types: classifiers (“if shiny then diamond”) and controllers (“if shiny then pick up”). Controllers do however also classify conditions before inferring actions, and therefore classification forms a central part of many AI systems. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples, making them very attractive for use in AI. These examples are known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as a data set. When a new observation is received, that observation is classified based on previous experience.[137]

A classifier can be trained in various ways; there are many statistical and machine learning approaches. The most widely used classifiers are the neural network,[138] kernel methods such as thesupport vector machine,[139] k-nearest neighbor algorithm,[140] Gaussian mixture model,[141] naive Bayes classifier,[142] and decision tree.[143] The performance of these classifiers have been compared over a wide range of tasks. Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems; this is also referred to as the “no free lunch” theorem. Determining a suitable classifier for a given problem is still more an art than science.[144]

Neural networks

Main articles: Neural network and Connectionism

A neural network is an interconnected group of nodes, akin to the vast network ofneurons in the human brain.

The study of artificial neural networks[138] began in the decade before the field AI research was founded, in the work of Walter Pitts and Warren McCullough. Other important early researchers were Frank Rosenblatt, who invented the perceptron and Paul Werbos who developed the backpropagation algorithm.[145]

The main categories of networks are acyclic or feedforward neural networks (where the signal passes in only one direction) and recurrent neural networks (which allow feedback). Among the most popular feedforward networks are perceptronsmulti-layer perceptrons and radial basis networks.[146] Among recurrent networks, the most famous is the Hopfield net, a form of attractor network, which was first described by John Hopfield in 1982.[147] Neural networks can be applied to the problem ofintelligent control (for robotics) or learning, using such techniques as Hebbian learning and competitive learning.[148]

Hierarchical temporal memory is an approach that models some of the structural and algorithmic properties of the neocortex.[149]

Control theory

Main article: Intelligent control

Control theory, the grandchild of cybernetics, has many important applications, especially in robotics.[150]

Languages

AI researchers have developed several specialized languages for AI research, including Lisp[151] and Prolog.[152]

Evaluating progress

In 1950, Alan Turing proposed a general procedure to test the intelligence of an agent now known as the Turing test. This procedure allows almost all the major problems of artificial intelligence to be tested. However, it is a very difficult challenge and at present all agents fail.[153]

Artificial intelligence can also be evaluated on specific problems such as small problems in chemistry, hand-writing recognition and game-playing. Such tests have been termed subject matter expert Turing tests. Smaller problems provide more achievable goals and there are an ever-increasing number of positive results.[154]

One classification for outcomes of an AI test is:[155]

  1. Optimal: it is not possible to perform better.
  2. Strong super-human: performs better than all humans.
  3. Super-human: performs better than most humans.
  4. Sub-human: performs worse than most humans.

For example, performance at draughts is optimal,[156] performance at chess is super-human and nearing strong super-human (see computer chess: computers versus human) and performance at many everyday tasks (such as recognizing a face or crossing a room without bumping into something) is sub-human.

A quite different approach measures machine intelligence through tests which are developed from mathematical definitions of intelligence. Examples of these kinds of tests start in the late nineties devising intelligence tests using notions from Kolmogorov complexity and data compression.[157] Two major advantages of mathematical definitions are their applicability to nonhuman intelligences and their absence of a requirement for human testers.

Applications

An automated online assistantproviding customer service on a web page – one of many very primitive applications of artificial intelligence.

Artificial intelligence techniques are pervasive and are too numerous to list. Frequently, when a technique reaches mainstream use, it is no longer considered artificial intelligence; this phenomenon is described as the AI effect.[158]

Competitions and prizes

There are a number of competitions and prizes to promote research in artificial intelligence. The main areas promoted are: general machine intelligence, conversational behavior, data-mining, driverless cars, robot soccer and games.

Platforms

platform (or “computing platform“) is defined as “some sort of hardware architecture or software framework (including application frameworks), that allows software to run.” As Rodney Brooks[159] pointed out many years ago, it is not just the artificial intelligence software that defines the AI features of the platform, but rather the actual platform itself that affects the AI that results, i.e., there needs to be work in AI problems on real-world platforms rather than in isolation.

A wide variety of platforms has allowed different aspects of AI to develop, ranging from expert systems, albeit PC-based but still an entire real-world system, to various robot platforms such as the widely available Roomba with open interface.[160]

Philosophy

Artificial intelligence, by claiming to be able to recreate the capabilities of the human mind, is both a challenge and an inspiration for philosophy. Are there limits to how intelligent machines can be? Is there an essential difference between human intelligence and artificial intelligence? Can a machine have a mind and consciousness? A few of the most influential answers to these questions are given below.[161]

Turing’s “polite convention”: We need not decide if a machine can “think”; we need only decide if a machine can act as intelligently as a human being. This approach to the philosophical problems associated with artificial intelligence forms the basis of the Turing test.[153]

The Dartmouth proposal: “Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.” This conjecture was printed in the proposal for the Dartmouth Conference of 1956, and represents the position of most working AI researchers.[162]

Newell and Simon’s physical symbol system hypothesis: “A physical symbol system has the necessary and sufficient means of general intelligent action.” Newell and Simon argue that intelligences consist of formal operations on symbols.[163] Hubert Dreyfus argued that, on the contrary, human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a “feel” for the situation rather than explicit symbolic knowledge. (See Dreyfus’ critique of AI.)[164][165]

Gödel’s incompleteness theorem: A formal system (such as a computer program) cannot prove all true statements.[166] Roger Penrose is among those who claim that Gödel’s theorem limits what machines can do. (See The Emperor’s New Mind.)[167]

Searle’s strong AI hypothesis: “The appropriately programmed computer with the right inputs and outputs would thereby have a mind in exactly the same sense human beings have minds.”[168] John Searle counters this assertion with his Chinese room argument, which asks us to look inside the computer and try to find where the “mind” might be.[169]

The artificial brain argument: The brain can be simulated. Hans MoravecRay Kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software, and that such a simulation will be essentially identical to the original.[86]

Predictions and ethics

Artificial Intelligence is a common topic in both science fiction and projections about the future of technology and society. The existence of an artificial intelligence that rivals human intelligence raises difficult ethical issues, and the potential power of the technology inspires both hopes and fears.

In fiction, Artificial Intelligence has appeared fulfilling many roles, including a servant (R2D2 in Star Wars), a law enforcer (K.I.T.T. “Knight Rider“), a comrade (Lt. Commander Data in Star Trek: The Next Generation), a conqueror/overlord (The MatrixOmnius), a dictator (With Folded Hands), a benevolent provider/de facto ruler (The Culture), an assassin (Terminator), a sentient race (Battlestar Galactica/Transformers/Mass Effect), an extension to human abilities (Ghost in the Shell) and the savior of the human race (R. Daneel Olivaw in Isaac Asimov‘s Robot series).

Mary Shelley‘s Frankenstein considers a key issue in the ethics of artificial intelligence: if a machine can be created that has intelligence, could it also feel? If it can feel, does it have the same rights as a human? The idea also appears in modern science fiction, including the films I RobotBlade Runner and A.I.: Artificial Intelligence, in which humanoid machines have the ability to feel human emotions. This issue, now known as “robot rights“, is currently being considered by, for example, California’s Institute for the Future, although many critics believe that the discussion is premature.[170] The subject is profoundly discussed in the 2010 documentary film Plug & Pray.[171]

Martin Ford, author of The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future,[172] and others argue that specialized artificial intelligence applications, robotics and other forms of automation will ultimately result in significant unemployment as machines begin to match and exceed the capability of workers to perform most routine and repetitive jobs. Ford predicts that many knowledge-based occupations—and in particular entry level jobs—will be increasingly susceptible to automation via expert systems, machine learning[173] and other AI-enhanced applications. AI-based applications may also be used to amplify the capabilities of low-wage offshore workers, making it more feasible to outsource knowledge work.[174]

Joseph Weizenbaum wrote that AI applications can not, by definition, successfully simulate genuine human empathy and that the use of AI technology in fields such as customer service orpsychotherapy[175] was deeply misguided. Weizenbaum was also bothered that AI researchers (and some philosophers) were willing to view the human mind as nothing more than a computer program (a position now known as computationalism). To Weizenbaum these points suggest that AI research devalues human life.[176]

Many futurists believe that artificial intelligence will ultimately transcend the limits of progress. Ray Kurzweil has used Moore’s law (which describes the relentless exponential improvement in digital technology) to calculate that desktop computers will have the same processing power as human brains by the year 2029. He also predicts that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past, a scenario that science fiction writer Vernor Vinge named the “singularity“.[177]

Robot designer Hans Moravec, cyberneticist Kevin Warwick and inventor Ray Kurzweil have predicted that humans and machines will merge in the future into cyborgs that are more capable and powerful than either.[178] This idea, called transhumanism, which has roots in Aldous Huxley and Robert Ettinger, has been illustrated in fiction as well, for example in the manga Ghost in the Shell and the science-fiction series Dune. In the 1980s artist Hajime Sorayama‘s Sexy Robots series were painted and published in Japan depicting the actual organic human form with life-like muscular metallic skins and later “the Gynoids” book followed that was used by or influenced movie makers including George Lukas and other creatives. Sorayama never considered these organic robots to be real part of nature but always unnatural product of the human mind, a fantasy existing in the mind even when realized in actual form. Almost 20 years later, the first AI robotic pet (AIBO) came available as a companion to people. AIBO grew out of Sony’s Computer Science Laboratory (CSL). Famed engineer Dr. Toshitada Doiis credited as AIBO’s original progenitor: in 1994 he had started work on robots with artificial intelligence expert Masahiro Fujita within CSL of Sony. Doi’s, friend, the artist Hajime Sorayama, was enlisted to create the initial designs for the AIBO’s body. Those designs are now part of the permanent collections of Museum of Modern Art and the Smithsonian Institution, with later versions of AIBO being used in studies in Carnegie Mellon University. In 2006, AIBO was added into Carnegie Mellon University’s “Robot Hall of Fame”.

Political scientist Charles T. Rubin believes that AI can be neither designed nor guaranteed to be friendly.[179] He argues that “any sufficiently advanced benevolence may be indistinguishable from malevolence.” Humans should not assume machines or robots would treat us favorably, because there is no a priori reason to believe that they would be sympathetic to our system of morality, which has evolved along with our particular biology (which AIs would not share).

Edward Fredkin argues that “artificial intelligence is the next stage in evolution”, an idea first proposed by Samuel Butler‘s “Darwin among the Machines” (1863), and expanded upon by George Dysonin his book of the same name in 1998.[180]

See also

References

Notes

  1. ^ Definition of AI as the study of intelligent agents:
  2. a b The intelligent agent paradigm:

    The definition used in this article, in terms of goals, actions, perception and environment, is due to Russell & Norvig (2003). Other definitions also include knowledge and learning as additional criteria.

  3. ^ Although there is some controversy on this point (see Crevier (1993, p. 50)), McCarthy states unequivocally “I came up with the term” in a c|net interview. (Skillings 2006) McCarthy first used the term in the proposal for the Dartmouth conference, which appeared in 1955. (McCarthy et al. 1955)
  4. ^ McCarthy‘s definition of AI:
  5. ^ Pamela McCorduck (2004, pp. 424) writes of “the rough shattering of AI in subfields—vision, natural language, decision theory, genetic algorithms, robotics … and these with own sub-subfield—that would hardly have anything to say to each other.”
  6. a b This list of intelligent traits is based on the topics covered by the major AI textbooks, including:
  7. a b General intelligence (strong AI) is discussed in popular introductions to AI:
  8. ^ See the Dartmouth proposal, under Philosophy, below.
  9. a b This is a central idea of Pamela McCorduck‘s Machines Who Think. She writes: “I like to think of artificial intelligence as the scientific apotheosis of a venerable cultural tradition.” (McCorduck 2004, p. 34) “Artificial intelligence in one form or another is an idea that has pervaded Western intellectual history, a dream in urgent need of being realized.” (McCorduck 2004, p. xviii) “Our history is full of attempts—nutty, eerie, comical, earnest, legendary and real—to make artificial intelligences, to reproduce what is the essential us—bypassing the ordinary means. Back and forth between myth and reality, our imaginations supplying what our workshops couldn’t, we have engaged for a long time in this odd form of self-reproduction.” (McCorduck 2004, p. 3) She traces the desire back to its Hellenistic roots and calls it the urge to “forge the Gods.” (McCorduck 2004, pp. 340–400)
  10. ^ The optimism referred to includes the predictions of early AI researchers (see optimism in the history of AI) as well as the ideas of modern transhumanists such as Ray Kurzweil.
  11. ^ The “setbacks” referred to include the ALPAC report of 1966, the abandonment of perceptrons in 1970, the Lighthill Report of 1973 and the collapse of the lisp machine market in 1987.
  12. a b AI applications widely used behind the scenes:
  13. ^ AI in myth:
  14. ^ Cult images as artificial intelligence:

    These were the first machines to be believed to have true intelligence and consciousness. Hermes Trismegistus expressed the common belief that with these statues, craftsman had reproduced “the true nature of the gods”, their sensus and spiritus. McCorduck makes the connection between sacred automatons and Mosaic law (developed around the same time), which expressly forbids the worship of robots (McCorduck 2004, pp. 6–9)

  15. ^ Humanoid automata:
    Yan Shi:

    Hero of Alexandria:

    Al-Jazari:

    Wolfgang von Kempelen:

  16. ^ Artificial beings:
    Jābir ibn Hayyān‘s Takwin:

    Judah Loew‘s Golem:

    Paracelsus‘ Homunculus:

  17. ^ AI in early science fiction.
  18. ^ This insight, that digital computers can simulate any process of formal reasoning, is known as the Church–Turing thesis.
  19. ^ Formal reasoning:
  20. a b AI’s immediate precursors:

    See also Cybernetics and early neural networks (in History of artificial intelligence). Among the researchers who laid the foundations of AI were Alan TuringJohn Von NeumannNorbert WienerClaude ShannonWarren McCulloughWalter Pitts and Donald Hebb.

  21. ^ Dartmouth conference:
    • McCorduck 2004, pp. 111–136
    • Crevier 1993, pp. 47–49, who writes “the conference is generally recognized as the official birthdate of the new science.”
    • Russell & Norvig 2003, p. 17, who call the conference “the birth of artificial intelligence.”
    • NRC 1999, pp. 200–201
  22. ^ Hegemony of the Dartmouth conference attendees:
  23. ^ Russell and Norvig write “it was astonishing whenever a computer did anything kind of smartish.” Russell & Norvig 2003, p. 18
  24. ^ “Golden years” of AI (successful symbolic reasoning programs 1956–1973):

    The programs described are Daniel Bobrow‘s STUDENTNewell andSimon‘s Logic Theorist and Terry Winograd‘s SHRDLU.

  25. ^ DARPA pours money into undirected pure research into AI during the 1960s:
  26. ^ AI in England:
  27. ^ Optimism of early AI:
  28. ^ See The problems (in History of artificial intelligence)
  29. ^ First AI WinterMansfield AmendmentLighthill report
  30. a b Expert systems:
  31. ^ Boom of the 1980s: rise of expert systemsFifth Generation ProjectAlveyMCCSCI:
  32. ^ Second AI winter:
  33. a b Formal methods are now preferred (“Victory of the neats“):
  34. ^ McCorduck 2004, pp. 480–483
  35. ^ DARPA Grand Challenge – home page
  36. ^ “Welcome”. Archive.darpa.mil. Retrieved 31 October 2011.
  37. ^ Markoff, John (16 February 2011). “On ‘Jeopardy!’ Watson Win Is All but Trivial”.The New York Times.
  38. ^ Kinect’s AI breakthrough explained
  39. ^ Problem solving, puzzle solving, game playing and deduction:
  40. ^ Uncertain reasoning:
  41. ^ Intractability and efficiency and the combinatorial explosion:
  42. ^ Psychological evidence of sub-symbolic reasoning:
  43. ^ Knowledge representation:
  44. ^ Knowledge engineering:
  45. a b Representing categories and relations: Semantic networks,description logicsinheritance (including frames and scripts):
  46. a b Representing events and time:Situation calculusevent calculusfluent calculus (including solving the frame problem):
  47. a b Causal calculus:
  48. a b Representing knowledge about knowledge: Belief calculusmodal logics:
  49. ^ Ontology:
  50. ^ Qualification problem:

    While McCarthy was primarily concerned with issues in the logical representation of actions, Russell & Norvig 2003 apply the term to the more general issue of default reasoning in the vast network of assumptions underlying all our commonsense knowledge.

  51. a b Default reasoning and default logicnon-monotonic logics,circumscriptionclosed world assumptionabduction (Poole et al.places abduction under “default reasoning”. Luger et al. places this under “uncertain reasoning”):
  52. ^ Breadth of commonsense knowledge:
  53. ^ Dreyfus & Dreyfus 1986
  54. ^ Gladwell 2005
  55. a b Expert knowledge as embodied intuition:
  56. ^ Planning:
  57. a b Information value theory:
  58. ^ Classical planning:
  59. ^ Planning and acting in non-deterministic domains: conditional planning, execution monitoring, replanning and continuous planning:
  60. ^ Multi-agent planning and emergent behavior:
  61. ^ This is a form of Tom Mitchell‘s widely quoted definition of machine learning: “A computer program is set to learn from an experience E with respect to some task T and some performance measure P if its performance on T as measured by P improves with experience E.”
  62. ^ Learning:
  63. ^ Alan Turing discussed the centrality of learning as early as 1950, in his classic paper Computing Machinery and Intelligence.(Turing 1950) In 1956, at the original Dartmouth AI summer conference, Ray Solomonoff wrote a report on unsupervised probabilistic machine learning: “An Inductive Inference Machine”.(pdf scanned copy of the original)(version published in 1957, An Inductive Inference Machine,” IRE Convention Record, Section on Information Theory, Part 2, pp. 56–62)
  64. ^ Reinforcement learning:
  65. ^ Computational learning theory:
  66. ^ Natural language processing:
  67. ^ Applications of natural language processing, including information retrieval (i.e. text mining) and machine translation:
  68. ^ Robotics:
  69. a b Moving and configuration space:
  70. ^ Tecuci, G. (2012), Artificial intelligence. WIREs Comp Stat, 4: 168–180. doi: 10.1002/wics.200
  71. ^ Robotic mapping (localization, etc):
  72. ^ Machine perception:
  73. ^ Computer vision:
  74. ^ Speech recognition:
  75. ^ Object recognition:
  76. ^ “Kismet”. MIT Artificial Intelligence Laboratory, Humanoid Robotics Group.
  77. ^ Thro, Ellen (1993). Robotics. New York.
  78. ^ Edelson, Edward (1991). The Nervous System. New York: Remmel Nunn.
  79. ^ Tao, Jianhua; Tieniu Tan (2005). “Affective Computing: A Review”.Affective Computing and Intelligent InteractionLNCS 3784. Springer. pp. 981–995. doi:10.1007/11573548.
  80. ^ James, William (1884). “What is Emotion”. Mind 9: 188–205.doi:10.1093/mind/os-IX.34.188. Cited by Tao and Tan.
  81. ^ “Affective Computing”MIT Technical Report #321 (Abstract), 1995
  82. ^ Kleine-Cosack, Christian (October 2006). “Recognition and Simulation of Emotions”(PDF). Archived from the originalon 28 May 2008. Retrieved 13 May 2008. “The introduction of emotion to computer science was done by Pickard (sic) who created the field of affective computing.”
  83. ^ Diamond, David (December 2003). “The Love Machine; Building computers that care”. Wired.Archivedfrom the original on 18 May 2008. Retrieved 13 May 2008. “Rosalind Picard, a genial MIT professor, is the field’s godmother; her 1997 book, Affective Computing, triggered an explosion of interest in the emotional side of computers and their users.”
  84. ^ Emotion and affective computing:
  85. ^ Gerald EdelmanIgor Aleksander and others have both argued thatartificial consciousness is required for strong AI. (Aleksander 1995Edelman 2007)
  86. a b Artificial brain arguments: AI requires a simulation of the operation of the human brain

    A few of the people who make some form of the argument:

    The most extreme form of this argument (the brain replacement scenario) was put forward by Clark Glymour in the mid-70s and was touched on by Zenon Pylyshyn and John Searle in 1980.

  87. ^ AI completeShapiro 1992, p. 9
  88. ^ Nils Nilsson writes: “Simply put, there is wide disagreement in the field about what AI is all about” (Nilsson 1983, p. 10).
  89. a b Biological intelligence vs. intelligence in general:
    • Russell & Norvig 2003, pp. 2–3, who make the analogy withaeronautical engineering.
    • McCorduck 2004, pp. 100–101, who writes that there are “two major branches of artificial intelligence: one aimed at producing intelligent behavior regardless of how it was accomplioshed, and the other aimed at modeling intelligent processes found in nature, particularly human ones.”
    • Kolata 1982, a paper in Science, which describes McCathy’sindifference to biological models. Kolata quotes McCarthy as writing: “This is AI, so we don’t care if it’s psychologically real”[1]. McCarthy recently reiterated his position at the AI@50 conference where he said “Artificial intelligence is not, by definition, simulation of human intelligence” (Maker 2006).
  90. a b Neats vs. scruffies:
  91. a b Symbolic vs. sub-symbolic AI:
  92. ^ Haugeland 1985, p. 255.
  93. ^ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.8384&rep=rep1&type=pdf
  94. ^ Pei Wang (2008). Artificial general intelligence, 2008: proceedings of the First AGI Conference. IOS Press. p. 63. ISBN 978-1-58603-833-5. Retrieved 31 October 2011.
  95. ^ Haugeland 1985, pp. 112–117
  96. ^ The most dramatic case of sub-symbolic AI being pushed into the background was the devastating critique of perceptrons by Marvin Minsky and Seymour Papert in 1969. See History of AIAI winter, orFrank Rosenblatt.
  97. ^ Cognitive simulation, Newell and Simon, AI at CMU (then calledCarnegie Tech):
  98. ^ Soar (history):
  99. ^ McCarthy and AI research at SAIL and SRI International:
  100. ^ AI research at Edinburgh and in France, birth of Prolog:
  101. ^ AI at MIT under Marvin Minsky in the 1960s :
  102. ^ Cyc:
  103. ^ Knowledge revolution:
  104. ^ Embodied approaches to AI:
  105. ^ Revival of connectionism:
  106. ^ Computational intelligence
  107. ^ Pat Langley, “The changing science of machine learning”Machine Learning, Volume 82, Number 3, 275–279, doi:10.1007/s10994-011-5242-y
  108. ^ Yarden Katz, “Noam Chomsky on Where Artificial Intelligence Went Wrong”, The Atlantic, November 1, 2012
  109. ^ Peter Norvig, “On Chomsky and the Two Cultures of Statistical Learning”
  110. ^ Agent architectureshybrid intelligent systems:
  111. ^ Hierarchical control system:
  112. ^ Subsumption architecture:
  113. ^ Search algorithms:
  114. ^ Forward chainingbackward chainingHorn clauses, and logical deduction as search:
  115. ^ State space search and planning:
  116. ^ Uninformed searches (breadth first searchdepth first search and general state space search):
  117. ^ Heuristic or informed searches (e.g., greedy best first and A*):
  118. ^ Optimization searches:
  119. ^ Artificial life and society based learning:
  120. ^ Genetic programming and genetic algorithms:
  121. ^ Logic:
  122. ^ Satplan:
  123. ^ Explanation based learningrelevance based learninginductive logic programmingcase based reasoning:
  124. ^ Propositional logic:
  125. ^ First-order logic and features such as equality:
  126. ^ Fuzzy logic:
  127. ^ Subjective logic:
  128. ^ Stochastic methods for uncertain reasoning:
  129. ^ Bayesian networks:
  130. ^ Bayesian inference algorithm:
  131. ^ Bayesian learning and the expectation-maximization algorithm:
  132. ^ Bayesian decision theory and Bayesian decision networks:
  133. a b c Stochastic temporal models:

    Dynamic Bayesian networks:

    Hidden Markov model:

    Kalman filters:

  134. ^ decision theory and decision analysis:
  135. ^ Markov decision processes and dynamic decision networks:
  136. ^ Game theory and mechanism design:
  137. ^ Statistical learning methods and classifiers:
  138. a b Neural networks and connectionism:
  139. ^ kernel methods such as the support vector machineKernel methods:
  140. ^ K-nearest neighbor algorithm:
  141. ^ Gaussian mixture model:
  142. ^ Naive Bayes classifier:
  143. ^ Decision tree:
  144. ^ Classifier performance:
  145. ^ Backpropagation:
  146. ^ Feedforward neural networksperceptrons and radial basis networks:
  147. ^ Recurrent neural networksHopfield nets:
  148. ^ Competitive learningHebbian coincidence learning, Hopfield networks and attractor networks:
  149. ^ Hierarchical temporal memory:
  150. ^ Control theory:
  151. ^ Lisp:
  152. ^ Prolog:
  153. a b The Turing test:
    Turing’s original publication:

    Historical influence and philosophical implications:

  154. ^ Subject matter expert Turing test:
  155. ^ Rajani, Sandeep (2011). “Artificial Intelligence – Man or Machine”International Journal of Information Technology and Knowlede Management 4 (1): 173–176. Retrieved 24 September 2012.
  156. ^ Game AI:
  157. ^ Mathematical definitions of intelligence:
  158. ^ “AI set to exceed human brain power”(web article). CNN. 26 July 2006. Archivedfrom the original on 19 February 2008. Retrieved 26 February 2008.
  159. ^ Brooks, R.A., “How to build complete creatures rather than isolated cognitive simulators,” in K. VanLehn (ed.), Architectures for Intelligence, pp. 225–239, Lawrence Erlbaum Associates, Hillsdale, NJ, 1991.
  160. ^ Hacking Roomba » Search Results » atmel
  161. ^ Philosophy of AI. All of these positions in this section are mentioned in standard discussions of the subject, such as:
  162. ^ Dartmouth proposal:
  163. ^ The physical symbol systems hypothesis:
  164. ^ Dreyfus criticized the necessary condition of the physical symbol system hypothesis, which he called the “psychological assumption”: “The mind can be viewed as a device operating on bits of information according to formal rules”. (Dreyfus 1992, p. 156)
  165. ^ Dreyfus’ critique of artificial intelligence:
  166. ^ This is a paraphrase of the relevant implication of Gödel’s theorems.
  167. ^ The Mathematical Objection:

    Making the Mathematical Objection:

    Refuting Mathematical Objection:

    Background:

    • Gödel 1931, Church 1936, Kleene 1935, Turing 1937
  168. ^ This version is from Searle (1999), and is also quoted in Dennett 1991, p. 435. Searle’s original formulation was “The appropriately programmed computer really is a mind, in the sense that computers given the right programs can be literally said to understand and have other cognitive states.” (Searle 1980, p. 1). Strong AI is defined similarly by Russell & Norvig (2003, p. 947): “The assertion that machines could possibly act intelligently (or, perhaps better, act as if they were intelligent) is called the ‘weak AI’ hypothesis by philosophers, and the assertion that machines that do so are actually thinking (as opposed to simulating thinking) is called the ‘strong AI’ hypothesis.”
  169. ^ Searle’s Chinese room argument:

    Discussion:

  170. ^ Robot rights:

    Prematurity of:

    In fiction:

  171. ^ Independent documentary Plug & Pray, featuring Joseph Weizenbaum and Raymond Kurzweil
  172. ^ Ford, Martin R. (2009), The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future, Acculant Publishing,ISBN 978-1448659814(e-book available free online.)
  173. ^ “Machine Learning: A Job Killer?”
  174. ^ AI could decrease the demand for human labor:
  175. ^ In the early 70s, Kenneth Colby presented a version of Weizenbaum’s ELIZA known as DOCTOR which he promoted as a serious therapeutic tool. (Crevier 1993, pp. 132–144)
  176. ^ Joseph Weizenbaum‘s critique of AI:

    Weizenbaum (the AI researcher who developed the first chatterbotprogram, ELIZA) argued in 1976 that the misuse of artificial intelligence has the potential to devalue human life.

  177. ^ Technological singularity:
  178. ^ Transhumanism:
  179. ^ Rubin, Charles (Spring 2003). “Artificial Intelligence and Human Nature”The New Atlantis 1: 88–100.
  180. ^ AI as evolution:

References

AI textbooks

History of AI

Other sources

Further reading

External links