PART 1
PART 2
PART 3
PART 4
PART 5
Contents Preface v
Introduction 1
1 Introduction 2
Logic and Search 17
2 Logic 18
3 Search 46
4 Automating Logical Reasoning 72
Uncertainty 103
5 Bayesian Networks I 104
6 Bayesian Networks II 133
7 Other Approaches to Uncertainty 154
Deciding on Actions 173
8 DecisionNetworks 174
9 PlanningI 189 10 PlanningII 208
Learning 225
11 Introduction to Learning 226
12 Decision Tree Learning 241
13 Inductive Logic Programming 253
14 Reinforcement Learning 269
15 Neural Networks I 286
16 Neural Networks II 312
17 Genetic Algorithms 330
Natural Language Understanding and Perception 347
18 Natural Language Processing I 348
19 Natural Language Processing II 367
PART 6
iv Contents
PART 7
20 SpeechProcessing 396 21 Vision 409
Agents, Philosophy and Applications 435
22 Agents 436
23 Philosophy of Artificial Intelligence 441
24 Some Applications of Artificial Intelligence 452
Appendix A Bibliography Index
A Quick Tour of Prolog 474 492 500
CHAPTER 1
Introduction
This book is an introduction to techniques that are being used to build machines that manipulate information. These machines are being used today in a wide variety of applications, such as monitoring credit card fraud, making autonomous decisions on space missions, watching for attacks from computer network hackers, diagnosing faults in aircraft, enabling human–machine speech interfaces, and making the characters in a video game behave in a more human-like way. All of these machines are the result of technologists working in the field of artificial intelligence (AI). AI has become an important engineering discipline for building machines that can assist us and entertain us. But AI is not just about engineering; it is also a science that aims to further our understanding of our own and other animal intelligence. You may ask ‘Is not the ultimate aim of AI to build machines that think like us, are mobile, perceive what is going on around them, can communicate with us, and have limbs similar to ours so that they can do the sort of physical things that we do – a humanoid?ʼ In some sense, AI is about striving to build this type of machine. The technologies required to build such machines are being developed. There are robots that look like humans in spacesuits that walk and climb stairs, and robots that express emotion. Aspects of perception such as seeing and listening are being heavily researched, along with all of the other elements that make up intelligence, e.g. reasoning, learning, language communication, planning, decision making, and common-sense. We are still a long way from building a humanoid, and we do not really know whether it will be possible to build artificial machines that think like us and have an understanding about the world in the same way that we understand our world. But we are building machines that today assist us in many aspects of our daily life and we can be sure that AI will infiltrate our lives more and more in the coming years without us even knowing. We can also be sure that these machines will be better than us at many tasks in much the same way that the calculator is better than any human is at arithmetic computations.
The aim of this chapter is to provide a little more of the background of what AI is about.
1.1 AI emerges from the laboratory
In 1965, a team headed by Edward Feigenbaum and Robert Lindsay started work on a computer system to assist chemists to determine the molecular structure of complex organic compounds. Their system was not programmed in a conventional manner. Instead, it used a body of knowledge, the kind of knowledge possessed by chemists, to help solve the problems it was presented with. The system was an expert at analysing molecular structure and it was the start of expert systems, computer programs that have expert knowledge in specialized domains.
The system developed by Feigenbaum and his team was called DENDRAL. Determining the molecular structure of a compound involves finding out which atoms are present and how these atoms are connected together. Chemists can obtain data about a compound by breaking it up into fragments using a device called a mass spectrometer. The results of mass spectrometry provide
Introduction 3
clues as to the chemical composition of the fragments and how many types of each fragment there are. The problem for the chemist is that each fragment can correspond to several different types of substructure and yet all of these substructures must fit together into a single global structure. In theory, the chemist could test to see which combinations of substructures fit together to form the global structure. The problem is that there are too many possible combinations to make this approach practical. The problem is a little like being presented with a photograph of someoneʼs face that has been cut into many squares and jumbled up. There are many ways in which the squares can fit together but only one fit reproduces the face. If the photograph has been separated into four equal segments then there are 24 possible combinations of fit (ignoring rotations). The number of combinations increases rapidly with the number of segments. An exhaustive search would entail a systematic rearrangement of the pieces until the fit forms the image of the face. The search time can be significantly reduced if knowledge can be used. For example, in the face reconstruction task there are many constraints suggested by knowledge of real faces. For instance, the ears are to the left and right of the eyes, the eyes are approximately in line and above the nose, which is above the mouth, etc. DENDRAL used its knowledge base to constrain the search and help chemists discover the molecular structure of a compound.
A few years later, work started on another expert system for suggesting which micro-organism was causing a blood infection and how the infection should be treated. The system was named MYCIN. Tests are performed to gather data about the infection a patient has. One such test involves growing bacterial cultures from a sample taken from the location of the infection. These cultures can provide morphological features and staining characteristics of the organism. MYCINʼs knowledge can then be applied to these clues about the infection to diagnose the likely organism and to suggest treatment. Like DENDRAL, MYCINʼs knowledge was encoded as a set of IF…THEN rules. These rules consist of a set of conditions and a conclusion. If all of the conditions hold (are true) then the conclusion holds. A typical rule in MYCIN is:
IF
The stain of the organism is Gram negative AND The morphology of the organism is rod AND The aerobicity of the organism is aerobic
THEN There is strong evidence to suggest that the class of organism is Enterobacteriaceae
When reasoning, an expert system will usually apply many rules. The conclusion from one rule can form part of the IF condition of another rule. For example, the following two rules might form part of an expert system for diagnosing a fault with a car engine:
1 IF
The engine will not turn over AND The lights do not come on
THEN The battery is dead 2 IF
The battery is dead THEN The car will not start
If rule 1 concludes that the battery is dead, then rule 2 will conclude that the car will not start.
4 Artificial Intelligence
DENDRAL and MYCIN were the first of many expert systems. These systems represented the first truly large-scale commercialization of AI technology. Large corporations started to develop expert systems to mimic the reasoning performed by human domain experts. The boom years for the tradi- tional expert system were the 1980s. The boom was short-lived however. Too many people jumped on the bandwagon and started to suggest expert systems for applications for which they were not suited. Expert systems were good at tackling specific types of problem, but they represented only part of the AI technology under development. Many overestimated what expert systems were capa- ble of, probably because expert systems were offshoots of the field whose aim is to equip machines with human-like intelligence.
Expert systems are in commercial use today, but it is now rare for an application to be called an expert system. To many in industry, the title ‘expert systemʼ still conjures up a perception of what the technology offered in the 1980s and yet there have been some significant advances since. For example, the MYCIN rule presented earlier had its conclusion couched in uncertainty – there is strong evidence to suggest… The car rules, although written here as black and white conclusions, are lacking in diagnostic reasoning. Again, the knowledge base has to have knowledge of how likely a conclusion is when the conditions in the IF part hold. There were techniques for handling uncertain knowledge in early expert systems, but the approaches were somewhat ad hoc. Today, good probabilistic-based techniques exist. The tools of today also offer many advances over early systems because software engineering techniques have made significant advances in the last 10–15 years. The term ‘knowledge-based systemʼ is more likely to be used today rather than ‘expert systemʼ.
1.2 What is an AI application?
Those early expert systems represent the essence of an AI application. An AI application is designed to mimic some aspect of human intelligence, whether that be reasoning, learning, perceiving, com- municating, or planning. All of these facets of human intelligence rely on knowledge. They must also be flexible so that they can respond to events that were not foreseen by the developers of the system. An AI program is not like a database program that queries for information from a table, it is not like a spreadsheet program that computes total sales by category of item, and it is not like a pro- gram for constructing and sending emails. All of these programs follow clearly defined sequences of instructions determined by what the user requests. In these conventionally programmed systems, it is possible to predict what the user will want to do and program the instructions accordingly. AI programs are somewhat different. Programs are still constructed by programmers, but the program they implement must be able to execute in a way that is highly flexible. Consider the task faced by a medical domain expert, a doctor. Often a patient presents a minor complaint that may easily be treated by prescription. There is usually no single treatment for any one complaint. A patient may not respond well to one form of medication or may suffer side-effects. Sometimes a doctor may not be sure what is causing the symptoms but believes that the cause belongs to a certain class of complaint that normally responds to a particular form of treatment. A doctor may decide that there is value in carrying out tests to assist the diagnosis. He or she may decide that a patient needs to be referred for more specialist consultation.
So what makes the programming of an artificial doctor different from, say, a program for fil- ing and calculating your tax return? In many ways, the two domain tasks are similar. They both require elicitation of data concerning an individual. The doctor needs to know about symptoms and may perform tests to gather further data. The tax return program needs data concerning earnings, personal circumstances, expenses, etc. Both the medical and tax domains require expert knowledge. The doctor has gained a large amount of knowledge over many years of training. There are many rules regarding tax and a tax expert has good knowledge of these rules. Both types of domain expert
Introduction 5
seek to derive information from data and knowledge. The doctor hopes to identify the cause of the complaint and prescribe a treatment; the tax expert seeks to compute money owed to the tax office or money to be refunded to the individual. The real differences between the two domains lies in the amount of knowledge the expert requires, the type of data and number of exceptional circumstances, the consequences of a mistake, and the complexity of decision making. Earnings when filing a tax return are easily computed. The symptoms a patient suffers cannot always be so precisely described: symptoms may vary in severity, symptoms may come and go, and aspects of the data may be unknown (e.g. a negative test result might reduce the likelihood of some condition being present but the test may not be conclusive). A mistake on a tax return can always be recovered, possibly in the next tax year. A mistake in diagnosing a patient may not be so easily rectified. A doctor may need to consider many decisions, like the need for additional tests, and the type of treatment, etc. The other major difference between the tax domain problem and the medical one is that a doctor will use a wide range of other information such as a patientʼs history and information derived through their own perceptions from vision, and listening. In summary, programming an artificial doctor cannot be tackled using conventional techniques.
Back in the 1980s, there were suggestions for more sinister applications of expert systems. Some people, for example, contemplated putting expert systems onboard missiles. The missile would be given a specific target and, should the target not be found, the expert system could decide if an alternative target existed or if the missile should return home to be recovered. At that time this form of application was ambitious. For a start, the system would need to reliably recognize legitimate targets, and this is no easy task. There is though a growing need for complex reasoning on autonomous vehicles. There is currently much funding and development of unmanned aircraft. These air vehicles have been used in recent conflicts for surveillance. Development is also under way on unmanned air combat vehicles. Many believe that the new Joint Strike Fighter currently in development will be the last manned combat aircraft to be produced on a large scale. There will be a drive for autonomous reasoning so that an aircraft can respond quickly and appropriately to events. Eventually, this reasoning will also extend to the aircraftʼs health so that it can take appropriate action should it develop a fault or suffer battle damage.
An application needs AI technology when there is no clear recipe of how to perform a task given all the circumstances that might arise. If you can sit down and put together a list of instructions that can state for a task, when X occurs do Y, then you may not need AI. Most technology applications today get along without AI, but they desire it just like man desired speed and progressed from horseback to mechanized transport. For example, today you do your banking by phone. You can dial in, enter a pass code, find out your balance, and pay a bill. The more involved the transaction is, the more awkward the interface becomes. You have to wait for a question to be asked, respond by entering data through the keypad, and, if all is okay, you go to the next stage. No need for AI here. But would you not prefer to do all this by speaking? It would certainly be quicker. The greater challenge though is when you want to do a transaction or make an enquiry that is not on the menu. You are then referred to a human operator. If you want a bank loan and the computer sees that all the right boxes are ticked then the loan can be approved. The computer always has a default instruction, ‘any exceptional circumstances refer to bankʼ. The computer might have a rule that says, ‘you need to have been in employment for at least 1 yearʼ. If you are a talented young graduate 3 months into a new job then surely the bank wants you as a customer? Of course it does, but you are still an exception requiring human attention!
An AI application is in simple terms a system that possesses knowledge about an application domain, takes in data from its environment, and reasons about these data to derive information. This information is then used to make decisions and act. Often a human will make the decision and perform the action but technology is developing so that more of this responsibility resides in the
6 Artificial Intelligence
machine. Expert systems are good examples of what the essence of AI is, but an expert system is just one example of AI being applied to real-world problems. Other applications require vision, speech recognition, learning, mobility, etc. A specific expert system may also use these technologies.
1.3 What is AI?
Hopefully the preceding paragraphs have given some idea as to what AI is all about, but we have not directly tackled the question of what AI is. There is no crisp definition of AI, and a number of people have attempted to express what the aim of AI is. Marvin Minsky, the co-founder of the AI lab at the Massachusetts Institute of Technology (MIT), gives one such view of AI:
AI is the science of making machines do things that would require intelligence if done by men.
The machine is usually a digital computer.
It is generally recognized that AI became established as a scientific discipline at a conference in
1956 when a small group of like-minded scientists gathered to discuss their work and ideas. The fundamental underpinnings of the subject were to develop over the following years. It was John McCarthy, one of the organizers of the conference, who labelled the new discipline ‘artificial intelli- genceʼ. The central hypothesis of traditional AI, known as the physical symbol system hypothesis, is that any facet of human intelligence can be understood and described precisely enough for a machine to simulate it. The world is to be represented internally by the machine using symbolic structures. Intelligence then emerges as a process of transforming these structures. Consider, for example, language communication. We use symbols (words) to refer to objects in the world and other aspects such as actions (e.g. running), time, emotion, and so on. A computer simulation of this form of communication can be implemented in a way that is similar to having a computer read the following mathematical expression and understand how to produce an answer:
[(x – 2/y) + 3]
A program would parse this expression to make sure that it can recognize it, and out of this parse it might form a tree structure like the one shown in Figure 1.1. To give an answer, the program needs to recognize a number and needs to know what to do with the mathematical operators /,+ and –. These operators have meanings, and their meanings are the mathematical operations divide, add, and subtract. The program can recognize these symbols and call functions that implement these mathematical operators. The tree is scanned in a certain order to produce what is called Polish nota- tion. Once this notation is derived, the calculation is simple: the expression is scanned from right to left and when an operator is encountered it is applied to the two operands to its right. Many language programs operate in a similar way by parsing sentences into tree structures. They then transform the trees into other structures that can represent the meaning. For example, suppose we have a database table with the following entry:
Relation Argument 1 Argument 2 own caroline mia
We could write a simple program that could answer: What does Caroline own?
The program could parse the sentence into the following form:
Introduction 7
+
–3
x/
2y Figure 1.1 A tree representation of the expression ((x – 2/y) + 3).
own(caroline, X)
The program could then look up the table to see where it finds a match for the relation own and the first argument caroline. In this example, the variable X would be assigned the value Mia (because a variable can refer to any object) and the system would respond with ‘miaʼ. The above expression is asking what can match with own(caroline, X). The database contains the relation own(caroline, mia). The variable X can match with any other symbol, which in this case is ‘miaʼ.
The physical symbol system hypothesis is still waiting to be proved. Today, machines can per- form useful language communication, but they still fall far short of what a young child is capable of. Many would argue that the physical symbol system hypothesis is not sufficient to replicate human intelligence. There is a big question over whether internal representations are adequate to prime a machine with enough understanding so that it can reason intelligently. For example, can we really describe a garden tree to a computer using structures that convey features such as texture, size, and rooted in the ground? Or does the machine need to have its own perception of the world so that it can learn for itself what a tree is? Consider the challenge of how to represent a chair inside a machine. Four legs to which is attached a small surface upon which a backrest may be present? I would not describe my sofa as having four legs, and what about a more unusual design of chair that has two sloping legs and stabilizing feet. How then do we find it relatively easy to recognize an object as a chair even when we have not seen some unusual design before? Perhaps we see these objects as comfortable spots to park our weary limbs and perhaps on some occasions we use the context of our surroundings, like a restaurant, where we can expect to be seated, or the selection of dining suites in a furniture store.
AI is both an engineering and scientific discipline. Most people in AI want to build machines that require human-like intelligence. Others within AI want to understand more about human intelligence. The two interests are, of course, complementary. The act of building applications can tell us much about intelligent processes, and discovering more about ourselves through more direct methods such as psychological studies or the biology of the brain gives us insight in how to build intelligent machines.
If your interest in AI is to build a human-like agent to do the household chores then you should be encouraged. This is a noble challenge, and success would surely capture the interest of the world at large. In reality, such machines are still a long way off. Whether these machines await a huge scale of effort similar to the first manned missions to the moon, who knows? It might be that we have the fundamental collection of techniques already. It might be that we need to radically rethink some aspects of AI. Notwithstanding the considerable challenges facing AI researchers, there is great interest in what AI can do for real-world applications. This interest is also growing. The
8 Artificial Intelligence
problem-solving techniques developed within the discipline of AI tackle hard problems and in some areas the performance is better than that of humans. Some examples of the types of application are listed below.
• Systems have been demonstrated that can perform as well as humans in restricted domains of medical diagnostics.
• Machines can now recognize speech to a level competent enough for continuous dictation.
• Unmanned space vehicles have flown that use AI-based planning to allow onboard autonomous
reasoning.
• Machine learning is used to detect credit card fraud.
• Large corporations use data mining to hunt out information in huge databases. Such applica-
tions assist companies such as supermarkets to devise their sales strategies.
• Machines have been trained to recognize individuals from images of their faces.
• Machines can also track the eye and mouth movement of an individual who is speaking and
moving their head.
• Topics of interest in news stories can be filtered and summarized by machines.
• Gaming companies actively employ AI technologists to program human-like traits into the
agents of a game.
This is just a small sample of the sort of problems AI is tackling. Some of the technology areas are more advanced than others. Often the domain of application is very restricted and the environment controlled. For example, you might be able to communicate with a machine to enquire of tourist information within a city but you would not be able to widen the conversation and talk about the weather, employment opportunities, or sport.
AI is largely about building machine programs that can simulate some aspect of human intel- ligence. There are algorithms for reasoning, learning, planning, speech recognition, vision, and language understanding. AI takes from and gives back to many disciplines that include mathemat- ics, psychology, neurobiology, philosophy, computer science, and linguistics. There are many professions that have an interest in AI, such as medicine, engineering, and law.
1.4 Different models of intelligence
AI has sometimes been partitioned by different researchers into various camps that hold a particular view of how human intelligence is best modelled. Models of intelligence have been influenced by different subjects, such as mathematical logic, psychology, biology, statistics and economics. These influences can be seen in the subjects presented in this book. The logical and psychological models have influenced much of what is referred to as traditional AI or classical AI. An example of bio- logically inspired AI is connectionist AI, in which the computing elements resemble an abstraction of our own neural circuitry. Neural networks are the computing tools of connectionists. Most work on connectionist architectures has been done in the last 20–30 years. Early work on neural networks predates the 1956 conference at which AI was born, but traditional AI is so called because neural networks played a minor role in the formative years of AI. Connectionism has really boomed since the late 1980s.
Neural networks are structures that consist of many simple processing nodes connected via weighted links. The nodes and links form a network. The type of processing performed by a node varies depending on the node type, but the processing is usually very restricted, often doing nothing more than summing signals sent from other nodes and transforming this sum into an output. This output can then be sent on to other nodes or to some interface that communicates with a user. The
Introduction 9
weighted links can inhibit or strengthen a signal transmitted between nodes. The fascination with these networks is that they can perform complex tasks and yet they are constructed out of very simple elements. Of course, a computer at a fundamental level is also composed of simple elements such as sequences of bits which are subjected to simple operations.
To get anything useful out of a computer, it has to be programmed, and the more complex the task the more complex the instruction set that has to be entered. A neural network is simulated on a computer using a conventional program, but the network is not explicitly instructed in how to per- form a task. Think for a moment about the rules in an expert system. These systems are constructed using a tool known as a shell. The shell is a high-level programming environment that permits the programmer to write a set of rules and test these rules. The shell has an inference engine, which is a program that knows how to use the rules to reason and derive conclusions. The rules represent the programmerʼs knowledge but the inference engine is the same whatever the domain of application. The shell is similar to a spreadsheet program: you can enter data into a spreadsheet and state what you want to be done, but there is a lower level program that interprets your instructions. The rules in an expert system are explicit instructions: if condition a holds and condition b holds then you can conclude c. A neural network, on the other hand, is not usually explicitly programmed. Instead, you feed the network with examples of the sort of data it is to process and you tell it what you expect it to output for each example. The network then attempts to adjust its weighted connections to comply with your wishes. For example, you could program an expert system with rules to credit score a bank loan applicant. The rule base would be fed data about the applicant, such as salary and current debt, and rules would be used to derive a conclusion that indicates whether or not the applicant is success- ful. In contrast, a neural network would be trained using known case histories. For each example, the network is fed with input data (the same data fed to the rule base) and the network is told for each example whether or not that individual was a successful applicant. Over time the network adjusts its weighted connections so that for each case it can output a successful or unsuccessful applicant. The network is basically told: given this type of data I want you to produce this result. Neural networks are largely self-programming devices – they learn. We should note here that there are also many learning techniques that fall within the traditional or symbolic AI paradigm.
We may suppose that neural networks ought to be the mainstay of AI in view of their abstract similarity to our own computing hardware. The situation at present is that our engineering skills in traditional AI techniques are more widely applicable. A pragmatist will use neural networks for some tasks and traditional approaches for others. We could get tangled in endless debate over which paradigm will yield the most fruitful results in the future. Some traditionalists argue that neural networks offer nothing new other than a platform for program implementation. They believe that at a functional level our reasoning operates along the same lines as traditional symbol manipulation. We think in a language and this language is similar in form and operation to logic or rules. Some connectionists argue that the traditional approach of symbolizing representations is ultimately restricted. For instance, symbols that are text based fail to capture the contextual information that is necessary for much intelligent reasoning. For example, a watch could be described differently depending on the context of how it is to be utilized. It may be desirable for an everyday watch to have a range of functions and be robust, but for a dress watch look and style may be more important than the range of functions it offers.
Connectionism does computation in a different way to computing with symbols, which is the mainstay of computing as we know it today. While some still look for an all-encompassing theory for building intelligence and will argue for a paradigm like symbolism or connectionism, there is an emerging consensus that intelligent machines will need multiple approaches to modelling aspects of human intelligence. Furthermore, there is no longer a clear divide of techniques between symbolism and connectionism. This can be seen in areas where the role of statistics is gaining more emphasis.
10 Artificial Intelligence
1.5 Representation
We reason, plan, learn, and perceive about objects and events in the external world. Somehow we hold representations of the events and objects in our mind. If you are planning a dinner party you must hold in your thought process some representation concerning objects that will form the menu and other objects that represent your guests. A machine simulation of intelligence will also need representations of the external world. A great facet of human intelligence is that we can reason by simulating actions in the real world. This simulation is important because we can predict things that might happen and do not put ourselves at disadvantage through trial and error. Suppose you want to rescue a cat from a tree. You can reason about the use of a ladder, whether the ladder is long enough, whether the cat has the temperament to let you pick her up, whether you can balance with the cat, etc. Trial and error means that you do not give much thought about what you are about to do and you go ahead in the hope that somehow everything fits together. To do this reasoning, we need to represent knowledge about the world. We need internal representations, substitutes for things, such as the cat, the ladder, and the tree. We also hold some representation for abstract things such as fear, distress, or love. The thought of losing our beloved cat might cause us to overcome a fear of heights.
The whole science of AI has largely revolved around issues of knowledge representation. That said, there is a school of people within AI building machines without concern for specifics of knowledge representation. These machines are not built with the purpose of reasoning in the same way as humans. These machines use pattern recognition, and the association of meaning to a pattern is achieved through interacting with the environment. The idea is for these machines to exhibit intelligent behaviour, but their processes may be very different to our own. For example, a neural network can be fed sentences with no knowledge of word category such as noun or verb, and yet it develops its own internal representations that exhibit patterns for verbs and inanimate and animate objects. We may suppose that teachers use much expert knowledge when marking studentsʼ essays and that they follow a complex chain of reasoning that uses this knowledge. It is not uncommon for a teacher to explain how a mark was derived. Such explanation suggests access to expert knowledge. It is surprising then to know that a purely pattern-based text system that has no explicitly programmed domain expertise can automatically grade studentsʼ essays and that the grades correlate highly with those given by expert markers. Pattern-based reasoners still possess a representation, but their representations are not necessarily accessible by us. We may not be able to open and inspect their knowledge base in the same way you could with a rule-based expert system. In a rule-based system the rules are syntactic structures that we can read and interpret. The elements that represent the knowledge in a pattern-based reasoner may look totally meaningless to us. It is a bit like looking at the internal representation a computer has of a photograph. In the computer the photograph is a huge array of numbers. Looking at these numbers means nothing to us, but when the computer displays the photograph these numbers are transformed into colour and intensity levels that reveal the image and its meaning.
Applications that exhibit intelligent behaviour are being built without concern for explicitly structuring the representation of domain knowledge. However, the representation of knowledge has played a key role in the development of intelligent machines, and it continues, and will continue, to do so. For any application, you need to understand what the objectives are and then you need to decide which problem-solving strategy offers the best approach – the application may require explicitly programmed knowledge or the system may be able to learn such that its behaviour, once trained, offers the intelligent simulation that the application demands. A whole chapter could have been written about knowledge representation, but instead I have taken the approach of dealing with representation implicitly in the context of different problem-solving strategies. Although representa- tion will be specifically discussed only rarely, it is always there in the background, be it in the form
Introduction 11
of logical structures or some semantically enriched graph structure. It is, however, important to have an understanding of what is meant by knowledge representation as the term crops up time and again.
A representation, by definition, is a substitute for the real thing. A street map of a town is a representation. The map holds just enough detail for the purpose it is intended – finding a street and navigating between locations. You do not get an image of what the town looks like and you are not supplied with irrelevant information such as the routing of telecommunication cables. Pilots use maps to plan flight paths, but their maps look quite different from ground-based maps. An air map shows key ground features such as major roads and railways, but it also contains information concerning restricted airspace and hazards to flight. The information represented by an air map can be used to work out a heading to be flown. Pilots are trained to navigate. They operate in many ways like a programmed machine. They have access to a functional tool, the flight computer, and have a representation of key data to which they apply their navigational inference. The answer they derive also depends on data received at the time of planning, e.g. wind direction and speed.
Knowledge-based inference machines work in a similar way. They represent knowledge and they have a program that knows how to apply this knowledge to case-specific data to derive answers.
There are many forms of representation; three types will be briefly discussed and then some general points will be considered. Later chapters will include more detail of specific forms of repre- sentations such as logic, connectionist networks, natural language grammars, and causal networks.
1.5.1 Productionsystems
A production system is made up of IF … THEN rules. The IF part of the rule will contain one or more conditions and is called the antecedent. The THEN part is the consequent. A number of these rules exist in long-term memory (are stored when the system is not running), and the working memory holds information that is currently being processed. The basic mechanism of inferencing (reasoning) is to match the information in the working memory with rule antecedents in long-term memory. If a rule matches, then it is triggered, and if a triggered rule is fired its consequent is added to working memory and this new information may cause other rules to fire. Sometimes conflict resolution strategies are defined should more than one rule be triggered. The strategies determine which one of the triggered rules to fire.
An example of a production rule is:
IF X has skin AND X eats food AND Xbreathes
THEN X is an animal
In this rule the antecedent has three conditions. The rule also contains a variable, X, that can refer to any object.
Figure 1.2 shows a simple example of how matching and rule firing is used to achieve inferenc- ing.
1.5.2 Semanticnetworks
A semantic network consists of nodes that denote concepts or objects, links between the nodes that denote object relations (also called associations), and link labels that denote specific relations. For
12 Artificial Intelligence
Long-term memory
1. IF
AND day is sunny
clothes are wet THEN dry clothes outside
2. IF
AND day is wet
clothes are wet THEN dry clothes in dryer
3. IF
AND more than 15 dirty items
clothes are dirty THEN wash clothes
4. IF
THEN clothes wet
washclothes
Working memory
clothes are dirty 20 dirty items day is sunny
step 1: match with rule 3
clothes are dirty 20 dirty items day is sunny wash clothes
step 2: match with rule 4
clothes are dirty 20 dirty items day is sunny wash clothes wet clothes
step 3: match with rule 1
clothes are dirty 20 dirty items day is sunny wash clothes wet clothes
dry clothes outside
Figure 1.2 Example working of a small production system. Data come into the working memory and are matched against rules in the long-term memory. When a rule is fired, its conclusion is added to the working memory and this newly derived information can lead to the firing of other rules.
example, Figure 1.3 shows a semantic network for representing the concept arch. An arch consists of three parts: a top and two sides. The sides, which must not be touching, support the top. If the objects for the tops and sides are allowed to be of any type, the semantic network in Figure 1.3 provides a general description that can be used to recognize any arch. For example, the sides could be cylindrical and the top could be a pitched roof.
The labelled links can be used to express various forms of relations. These could be structural, such as is-part-of or x-contains-y, subtyping, as in ako (a kind of), similarity based, in which a nu- meric weight indicates the strength of association between one object and another, or more complex, e.g. hit, loves, and go. Another example of a semantic network is shown in Figure 1.4.
1.5.3 Frames and object-oriented concepts
Marvin Minsky proposed the concept of frames to represent stereotypical situations. You have, for example, an expectation of what will be found in a kitchen, e.g. a cooker, storage cupboards, a sink, and various forms of utensils. Events can also be described in terms of stereotypical situations.
Introduction 13
has part
has part
does not touch
does not touch
has part
Top
supported by
supported by
Side 1
Side 2
Figure 1.3 A semantic network for the concept arch (Winston, 1970).
Eating at a restaurant generally entails being seated, selecting from the menu, order taking, eating, and paying the bill. Minsky proposed frames as structures that could be called up from memory and adapted when a new situation is encountered. The frame would be adapted by changing certain features, such as the colour of the walls in a kitchen or whether the floor is tiled or carpeted. A frame could also have properties that must be present for the frame to be applicable to the situation. A kitchen frame may insist on the presence of a sink and a cooking appliance. Frames could also be linked together to organize knowledge and express relations.
The concept of frames can be found in the more general object-oriented programming paradigm that is popular today. The terminology of frames is different from that of the object-oriented school, and the motivation for frames is to perform inferencing, which is a specific type of computational task that is not generally required in the wider field of object-oriented programming. Nonetheless, we shall introduce a number of key object-oriented concepts as they support the structures and mechanisms that are suggested by frames. Indeed, many AI systems are implemented using an object-oriented language. The wider concepts of object-oriented techniques are also being utilized in some advanced frameworks for inferencing, one example being an extension to structuring probabilistic causal inferencing, which is introduced in Chapters 5 and 6.
An object can be used to denote anything we wish. It could be a concrete concept such as a book or car, or something more abstract, e.g. shape. A class provides a framework for describing a type of object. The class is the blueprint, much as a drawing is a blueprint for a house. An object is an instantiation of a class. For instance, a book has features such as title, author, publication date, and publisher. The class ‘bookʼ provides a generic description using features, but an instance is an object that has values assigned to these features. Instead of the term ‘featureʼ, the word ‘propertyʼ is used to denote an attribute of an object. Properties are pieces of information about an object, such as colour, author, publisher, and number of pages. Methods describe behaviour in that they declare what an object can do. For example, a book object might have a list of keywords that describe the bookʼs content. The list of keywords is a property, but a method could be used to see if a particular word exists in the list of keywords. A graphical object that is part of a user interface will have methods such as draw, move, hide, and resize.
Arch
14 Artificial Intelligence
Animal
eats must
has
Food Breathe Skin
ako
ako
has Wings
has Fins can
Bird
ako
Rook
can has
colour
Fly Feathers
Black
Fish Swim
Figure 1.4 Semantic network showing some animal concepts and properties. This network is an efficient way of memorizing data. For example, we do not need to store explicitly the knowledge that a bird must breath. We can infer this fact because a bird is a kind of animal and animals must breath.
There may be some confusion as to the granularity of an object. For example, this text is an instance of the class ‘bookʼ, but there are many instances that are all copies of this text. The im- portant thing is that an object should be identifiable and so every object should have some property that is unique. A library may contain many copies of the same book, but everyone can be uniquely identified in terms of a number.
Objects can be linked (associated). One special type of association is inheritance. Class is synonymous with type and inheritance is used to define type specialization. This text is of type book, but a book is a subtype of a more general class called publication. All publications have a title, publisher, and date. Journals, newspapers, and books are all subtypes of publication, and they differ from each other in the way that they specialize publication. A book has a property that is a list of chapters, whereas a journal will contain a list of papers. The publication class is said to be a parent of book, journal, and newspaper. Also, book and journal can be referred to as child classes, or derived classes, or specializations, or subtypes, or subclasses. Inheritance is a special form of as- sociation because it usually has built-in support that allocates by default all properties and methods found in a parent class to a derived class. For example, in Figure 1.4, bird is derived from animal and so all birds have properties must-breath, eats-food, and has-skin. There are restrictions that can be enforced to determine which properties and methods are inherited, but this is detail that need not
Introduction 15
concern us here. Any other form of association can be modelled. The links that form associations are usually implemented by having a property in one object hold a reference to the associated object.
Another useful feature of objects is that they can be programmed to respond to events. Probably the most commonly encountered event with computers is the press of a key when typing or the click of a mouse button. An event will invoke a method that describes how the object should respond. Clicking the mouse on part of a window might cause the window to close. When the water in a kettle boils (an event), the kettle should switch itself off after a short delay.
A method can have different behaviour depending on the type of object being processed. The method for drawing an object to the screen is dependent on the shape of the object. For instance, the draw method for a circle-shaped object is different to that for a cube.
The object-oriented paradigm is excellent for modelling and implementing programs. There is growing demand for software components that can be used in different applications. The component may be a learning algorithm, a component for displaying graph structures and executing graphical operations, or a component that filters emails and routes them to particular storage folders. The object-oriented programming paradigm and component software techniques are of practical interest to AI in that they are tools that assist in building large complex systems out of smaller parts.
There are many real-world examples that can be modelled using object-oriented techniques. Consider simulating a lift system with two lift shafts. The objects are the two lifts, the people using the lifts, the request service panels on each floor, and the control panels inside each lift. The behaviour of each lift can be described using methods (e.g. move up, move down, and wait). The behaviour can also be constrained in terms of control logic rules. For example, ‘if moving down and a request for service to a lower floor is received after a request to move to a higher floor, continue to move downʼ. Passenger requests issued in the lift or on the various service levels are events. A request to go to a particular floor is an event that the lift must handle along with all other events. Responding to events in strict order of receipt is not likely to be efficient, and so some careful control logic is applied. A sophisticated simulation might get both lift objects communicating in an attempt to optimize the service to customers. The simulation allows engineers to see how the lifts respond to different scenarios of user demand.
1.5.4 Key features of a representation
Davis et al. (1993) state that a knowledge representation is best understood in terms of five distinct roles that it plays:
1 A representation is a substitute. The substitute allows us to think rather than act. For instance, we often play out scenarios in our own minds to reason about the consequences of our actions. Planning an evening out requires substitutes for venues, transport, friends, likes and dislikes, etc.
2 A representation suggests in what way you should think about the world. For example, a pro- cedural representation for instructions to change a carʼs wheel would be written as a sequence of instructions: apply handbrake, loosen nuts, jack car, and so on. Representing this type of knowledge using objects does not appear so natural. On the other hand, modelling knowledge concerning the components that make up a car engine might fit naturally with an object descrip- tion. Some objects are composed of other objects and there exist objects of the same type, such as hoses, pistons, etc.
3 A representation is a fragmentary theory of intelligent reasoning. Representations are based on theories about what reasoning is. Often these theories are incomplete or are generated by some
16
Artificial Intelligence
4
5
small insight, for example observations of behaviour from psychological studies. Prolog is an AI programming language that has a representation rooted in mathematical logic. Knowledge is represented in Prolog using logical expressions and these expressions support reasoning, as we shall see in later chapters. The rules in expert systems are inspired by psychological studies of how domain experts reason. Connectionist networks are biologically inspired. Causal networks use statistical reasoning. All of these representations present a view on what reasoning is and how it is to be done. A connectionist view is that knowledge can be represented in the weighted connections of nodes and that inference is basically achieved by pattern association.
A knowledge representation is a medium for efficient computation. A representation is useful only if we can compute with it. Much research concerns the efficiency of the computation. Often efficiency is judged in terms of space (memory requirements) and speed of inference. The question of efficiency is very important for real-world applications.
A knowledge representation is a medium of human expression. A representation is a way for us to impart knowledge to our machines. It is also a way for humans to exchange and share knowledge with one another. Software engineers are familiar with this role for representation. They will often use a modelling language to document the design of a program and to com- municate ideas with other members of the team. They will judge a representation on how easy it is to express things, how concise the representation is, and how readable it is to others.
Further reading
For a good overview of work in AI during the first 40 years see Grevier (1993). Grevier also provides a good general and readable introduction to AI, as does Copeland (1993), albeit from a more philosophical perspective. An interesting read on intelligence and a general discussion about its computational modelling is provided in Davis (1998).
A general book on representation is Ringland and Duce (1988). Davis et al. (1993) explore in depth the nature of representation. For specific discussions of various representation formalisms see Collins and Quillian (1969) for semantic networks, Minsky (1975) for frames and Schank and Abelson (1977) for scripts. An early example of learning semantic structures is Winston (1970).
A general read on some of the capabilities of connectionist architectures is Clark (1993). An example of intelligent-type behaviour without explicit programming of knowledge is Brooks (1991). Two examples of text processing without explicit knowledge of language are Elman (1990) and Deerwester et al. (1990).
CHAPTER 15
Neural Networks I
Artificial neural networks are parallel computing devices consisting of many interconnected simple processors. These processors really are simplistic, especially when compared with the type of processor found in a computer. Each processor in a network is aware only of signals it periodically receives and the signal it periodically sends to other processors, and yet such simple local processes are capable of performing complex tasks when placed together in a large network of orchestrated cooperation.
Work on artificial neural networks dates back to the 1940s, but it has only really been since the late 1980s that this form of computation has become popular, following some theoretical leaps in learning and the growth in available computing power. The word ‘artificialʼ is sometimes used to make it clear that discussion is about an artificial device and not about neural circuitry found in ani- mals. The human computer (the brain) performs some tasks that computers have not, as yet, come close to matching in performance, and some would argue that a number of these tasks might always prove elusive to computer scientists. It is not surprising then that our own neural architecture should be the inspiration for computing machines. At this point in time, in comparison with the human brain, artificial neural networks are simplistic abstractions. It is common practice to drop the prefix ‘artificialʼ when it is clear in which context these networks are being discussed. Also, artificial neural networks are often referred to as connectionist networks when computing ability is emphasized rather than biological fidelity. In other words, connectionists aim to make neural networks solve a task rather than attempt to mimic faithfully some part of biological computation.
Although neural networks can be (and are) implemented as fast hardware devices, much research is performed with a conventional computer using software simulation in place of hardware implementation. Not only does software simulation provide a relatively cheap and flexible environ- ment with which to develop ideas, but also real-world applications can be delivered in the form of a conventional piece of software. For example, neural networks are applied to credit score loan applicants, and many other applications exist that do not demand a hardware implementation.
Neural network solutions are growing more sophisticated, and no doubt over the coming years our skills for engineering these computing devices will improve. Already, though, there is a vast array of exciting developments. The application base for neural networks is enormous: credit card fraud detection, stock market forecasting, credit scoring, optical character recognition, machine health monitoring, road vehicle autopilots, learning to land damaged aircraft, etc. Further understanding of the human brain will inspire future artificial neural networks, but there is mutual support in that artificial neural networks are being used as models for brain processes to assist in our understanding of the human brain.
Neural networks are typically programmed through learning, and hence their place within this part of the book. In this chapter, we shall introduce a type of architecture known as the feed-forward network and present an algorithm for learning called backpropagation. Most practitioners would accept that the backpropagation algorithm played a major role in bringing neural networks back to mainstream computing.
Neural Networks I 287
15.1 The basic components
A neural network is a collection of nodes that are connected in some pattern to allow communication between the nodes. These nodes, also referred to as neurons or units, are simple processors whose computing ability is typically restricted to a rule for combining input signals and an activation rule that takes the combined input to calculate an output signal. Output signals may be sent to other nodes along connections known as weights. The weights usually excite or inhibit the signal that is being communicated. A neural network node is illustrated in Figure 15.1.
One of the intriguing aspects of neural networks is that, although they have nodes with very limited computing capability, when many of these nodes are connected together the complete network is capable of performing complicated tasks.
The pattern of connectivity refers to a networkʼs wiring detail, i.e. the detail of which nodes connect, their direction of connection and the values of their weighted connections. The task that a network knows (or its program) is coded in the weights. The connection pattern is usually deter- mined by a two-stage process: first, the system designer specifies which nodes are connected and in which direction and, second, the weight values are learned during a training phase.
Sometimes the weights can be determined without training, but the great appeal of neural networks is their ability to learn a task from being exposed to the kind of data that the network will be expected to process when fully operational. Indeed, for many applications, training is the only option for programming a network because we do not have the task-solving knowledge.
There are many different types of neural network but, in general, they will share the following features:
• a set of simple processing nodes;
• a pattern of connectivity;
• a rule for propagating signals through the network;
• a rule for combining input signals;
• a rule for calculating an output signal;
• a learning rule to adapt the weights.
15.1.1 A set of simple processing nodes
Each processing node usually has incoming weights to receive signals from other network nodes and outgoing weights to transmit signals to other network nodes. Some nodes, input nodes, exist to receive input signals. There are also nodes, termed output nodes, for outputting the results of computation. For example, if a network is used as a classifier, the inputs would be attribute values and the outputs would correspond to the different classes.
Node
Incoming weights from other nodes
Outgoing weights send signals to other nodes
Figure 15.1 A single network node.
288 Artificial Intelligence
15.1.2 A pattern of connectivity
The pattern of connectivity refers to the way in which the nodes are connected. In one type of network, each node may connect to every other node; in another network type, nodes may be arranged into an ordered hierarchy of layers in which connections are allowed only between nodes in immediately adjacent layers; yet other types of network allow feedback connections between adjacent layers, or within a layer, or allow nodes to send signals back to themselves. A connection is parameterized by a weight. A weight is specified by three parameters:
1 the node that the weight connects from;
2 the node that the weight connects to;
3 a number (usually real) that denotes the weight value.
A negative weight value will inhibit the activity of the connected-to node whereas a positive weight will serve to excite the connected-to node. The absolute weight value specifies the strength of the connection.
The pattern of connectivity is conveniently described in a matrix, W, where the entry wij represents the weight value from node i to node j (note that, in many texts, the matrix is written so that connections go from node j to node i: the important thing is to be consistent when performing matrix and vector operations). More than one weight matrix may be used to describe the pattern of connectivity where nodes are grouped into layers. Figures 15.2 and 15.3 give examples of the connectivity pattern written as matrices.
The weight matrix (or matrices) is the networkʼs memory, which holds the knowledge of how to perform a task.
1
–2.6
1.2
4.3 –0.8
2 0.7 3 –0.4
0.5
0.9
–1.0
2.3
4
W=
0.0 –2.6 4.3 0.0
1.2 0.0 0.0 0.5 –0.8 0.0 –0.4 0.9 0.7 –1.0 2.3 0.0
For example the weight that connects node 3 (row 3) with node 1 (column 1) is denoted by the value –0.8 in the weight matrix above.
Figure 15.2 The matrix describes the network’s connections.
Neural Networks I 289
1
2
32
4 0.1 3 5
–0.6
1.2 1
–1.7
1.3
–0.8
1
2
layer-1 layer-2 layer-3
–0.6 1.2 –1.7 …
1.3 –0.8 W1= . . . W2= . .
. . 0.1 …
..
Figure 15.3 Matrices describe the network’s connections.
15.1.3 A rule for controlling the propagation of signals
through the network
Conventional computer programs impose conditions on when certain processes can start and finish. The same is true for neural networks. For a particular network type, some rule will exist to control when nodes can be updated (i.e. combine input signals and calculate an output signal) and when a signal can be sent on to other nodes. With some network types, a node is selected at random for updating, whereas other types insist that one group of nodes must update before another group can be updated.
15.1.4 A rule for combining input signals
It is typical for a nodeʼs incoming signals to be combined by summing their weighted values. This summation method is illustrated in Figure 15.4, in which netj is the resultant combined input to the node j, xi is the output from node i and n is the number of incoming connections. Other methods for combining input signals exist.
15.1.5 A rule for calculating an output signal
Nodes have a rule for calculating an output value that will be transmitted to other nodes or for presenting output results. This rule is known as an activation function, and the output value is referred to as the activation for the node. The activation may be real-valued, sometimes restricted to the interval [0, 1], or discrete such as {0, 1} or {+1, –1}. The value passed to the activation function is the net combined input to a node. Some activation functions are given next.
290 Artificial Intelligence
Figure 15.4
0.7
–0.3
0.1 3.1 j
0.5 0.3
n
netj = ∑ xi wij
i=1
netj = (0.7 x –0.3) + (0.1 x 3.1) + (0.3 x 0.5) = 0.25
or, alternatively, in vector notation
–0.3
[0.7, 0.1, 0.3] 3.1 =0.25
0.5
The typical way of summing the signals impinging on a node.
Identity function
The activation function for input nodes is the identity function, which simply means that the activa- tion (signal sent on to other nodes) is the same as the net input (Figure 15.5).
Binary threshold function
Most network types rely on a non-linear activation function. A binary threshold function will limit the activation to 1 or 0 depending on the net input relative to some threshold θ (Figure 15.6). Usually, it is more convenient to add a bias term to the net input and change the threshold to its mathematical equivalent form shown in Figure 15.7. The bias w0 is the negative of the threshold and in this case the net input is calculated as:
n
netj =w0+∑xiwij (15.1)
i=1
The bias is normally thought of as a weight coming from a node that always has an activation of 1, as shown in Figure 15.8.
The net input can be expressed as:
n
netj =∑xiwij
i=0
where x0 always has a value of 1.
(15.2)
f (net)
net
Figure 15.5 The activation is the same as the net input, f(net) = net. Note that f(net) refers to the
activation.
Neural Networks I 291
Figure 15.6 Binary threshold function.
f (net) 1
f (net) 1
φ
net
1 if net > 0 –
f (net) =
φ if net < 0
0 net
Figure 15.7 Binary threshold function with bias term added in.
1 w0j
w1j
j
Figure 15.8 For convenience of implementation, the bias term is often thought of as being a weight that is connected to a node in the previous layer with an activation permanently set to 1.
Sigmoid function
The sigmoid function is one of the most common forms of activation functions used whose output falls in a continuous range from 0 to 1. An example is the logistic function shown in Figure 15.9. The slope and output range of the logistic function can vary. The bipolar sigmoid, for example, has an output in the range of –1 to 1.
1 if net > 0 –
f (net) =
0 if net < 0
x1
x2
w3j x3
w2j
292 Artificial Intelligence
1
1 + exp (–net)
f (net) 1 0.8 0.6 0.4 0.2
–4 –2 0 2 4net Figure 15.9 The sigmoid function.
Example 1 • • • • •
The following example brings together some of the points discussed so far. A network is used to express the XOR relationship. The XOR relationship maps two binary inputs to 0 or 1; the definition is given in Table 15.1.
The network model shown in Figure 15.10 is a layered feed-forward network with two input nodes, two hidden nodes and one output node. The hidden nodes are so called because they do not take direct input from the environment or send information directly out to the environment. In this example, we can think of the environment simply as ourselves feeding values to the input nodes and monitoring the result via the output nodes. The nodes are arranged in layers: the input layer contain- ing the input nodes, the hidden layer containing the hidden nodes and the output layer containing the output nodes. The number of nodes in each layer depends on the problem being solved and will be discussed later in this chapter. For the moment, we simply observe that the number of input nodes matches the number of attribute values describing the training examples and the number of output nodes matches the number of attributes describing the output vector associated with each training example. For this example the net input is calculated according to:
Table 15.1 XOR relationship
Input Output
X1 X2 110
000 101 011
f (net) =
Neural Networks I 293
1.5
–1
–1
–1
–1
1
–1
–0.5
0.5
Figure 15.10 Network used to model the XOR relation. n
netj =∑xiwij i=0
and the output will be calculated using the threshold function f (net) = 1 if net ≥ 0
0 if net < 0
The activation is the same as the net input for nodes in the input layer. Signals propagate through the network from the input layer to the output layer.
The first input will be the first sample in Table 15.1, which is < 1, 1 >. For the first hidden node with a bias of 1.5:
net = (x0 × 1.5) + (x1 × –1) + (x2 × –1)
net = (1 × 1.5) + (1 × –1) + (1 × –1) = –0.5
so the output = 0. For the second hidden node with a bias of 0.5: net = (x0 × 0.5) + (x1 × –1) + (x2 × –1)
net = (1 × 0.5) + (1 × –1) + (1 × –1) = –1.5 so the output = 0.
For the output node with a bias of –0.5: net = (x0 × –0.5) + (x1 × 1) + (x2 × –1)
net = (1 × –0.5) + (1 × 1) + (1 × –1) = –0.5
so the output is 0.
If the procedure is followed for the other three samples, the output of the network will match the
output column in Table 15.1.
•••••
294 Artificial Intelligence
15.1.6 A learning rule to adapt the weights
The network illustrated in Example 1 implements the XOR function. The correct operation of this XOR network is dependent on the arrangement of nodes, the choice of activation function and the weights. The arrangement of nodes is usually fixed at the start of learning, and so is the choice of activation function. The task during learning then is to adapt the weights so as to produce the desired response.
It is usual for the weights to be set to small random values at the start of training and so, when an input sample is presented for the first time, it is unlikely that the network will produce the correct output. This discrepancy between what the network actually outputs and what it is required to output constitutes an error, and this error can be used to adapt the weights. An example of an error-correct- ing rule is the delta rule or Widrow–Hoff rule. For an output node that has a single input weight, an activation (i.e. output) of y and a target output t, the error, δ, is given by:
δ = t –y (15.3) The signal into the output node is x. The delta rule states that the adjustment to be made, ∆w, is:
∆w = ηδx (15.4) where η is a real number known as the learning rate. The new weight is then the adjustment added
to the old weight:
wnew = w + ∆w (15.5)
At the start of training, the weights are set to small random values, for instance in the range [–0.7,+0.7]. The backpropagation algorithm is a generalized version of the delta rule. During training, each example in turn is presented to the network and weights are continually adapted until for any input example the error drops to an acceptable low value. Upon completion of training, the network is tested on data that were not seen during training to test generalization.
15.2 Fundamentalconcepts
This section will consider basic concepts before moving on to look at the backpropagation model.
15.2.1 Decisionfunction
To introduce the concept of how a neural network operates, a trivial task is presented and the most basic type of network is used to solve that task.
Figure 15.11 illustrates a classification task. The task is to provide rules to classify an instance as either a member of class A or class B. The rules could be given in symbol form:
IF x2 > 0.80 AND x1 < 0.55 THEN class A IF x2 < 0.90 AND x1 > 0.25 THEN class B
The above rules1 use discrete limit values that effectively partition the space into rectangular regions. An alternative approach to using rules is to derive a classification function by defining a
1Note that x1 is used in place of x and x2 in place of y. Subscripting on x is used when problems have a high number of dimensions. For example, x3 is used in place of z.
Neural Networks I 295
3 2.5 2 y 1.5 1 0.5
0 0.2 0.4 0.6 0.8 1
x
Figure 15.11 The separation of two classes using idealized data.
line that separates the two classes. The class for a new instance can be judged by plotting its location in the x, y plane and observing on which side of the line the instance is. In the present scenario it is possible to visualize the data as a two-dimensional picture. However, when handling hundreds of samples with a large number of attributes (i.e. a high-dimensional problem) it is no longer possible to visualize the task (not using standard charts).
The solution then is to use a decision function. The equation of the line that separates the two classes is:
x2 = 1.5×1 + 0.5
This equation can be used to create a decision function:
f(x1,x2)=–x2 +1.5×1 +0.5 d = class B if f (x1, x2 ) ≥ 0 class A if f (x1, x2 ) < 0
For example, a sample from class B < 0.4, 0.5 > will give: f(0.4, 0.5) = –0.5 + (1.5 × 0.4) + 0.5 = 0.6
The decision function will classify the sample correctly as a member of class B.
The above decision function could be modelled using a neural network, as shown in Figure
15.12.
The node in Figure 15.12 calculates an output value according to the threshold criteria:
1 if total input ≥ 0 output = 0 if total input < 0
An output value of 1 would indicate a member of class B and an output of 0 would indicate a member of class A.
Example 2 • • • • •
Find the weights for a neural model similar to Figure 15.12 that represents the equation:
2x2 = –4x1 + 8
Class A Class B
296 Artificial Intelligence
0.5 1.5
–1.0
bias
1 x1
x2
Figure 15.12 A simple neural network that models a line.
For any point that falls on the line, the weights will be defined such that: w0 + x1w1 + x2w2 = 0
Rearranging gives:
x =–x w1–w0 2 1w2 w2
Equating terms gives:
–w1 =−4,–w0 =8
w2 2 w2 2 So, w0 = –8, w1 = 4 and w2 = 2.
15.2.2 Adapting the weights
It should be obvious from Figure 15.11 that many straight lines can be drawn as the separating boundary and therefore many sets of weights exist to provide a solution. If tj denotes the target or desired output from node j and oj the actual output, then the error Ep for an example p can be defined as:
Ep=1/2∑(tj –oj)2 (15.6) j
and the overall error E = ΣEp. The factor 1/2 is included for convenience later on.
The activation for any node is dependent on the net input to that node and therefore is dependent on the weights impinging on that node. Imagine a node as in Figure 15.12 but with no bias. Such a node can model any straight line that passes through the origin. For a linear activation function and
a single input vector, eqn (15.6) can be written as:
E = 1/2(t – net)2 (15.7)
because for a linear activation function the output is the same as the input. Expanding gives:
•••••
112211112222
with net = x1w1 + x2w2. Differentiating eqn (15.8) with respect to w1:
Neural Networks I 297
E = 1/2[t2 – 2t net + net2] (15.8) = 1/2[t2 – 2t(x w + x w ) + x 2w 2 + 2x w x w + x 2w 2]
∂E =(–t+x1w1+x2w2)x1 (15.9) ∂w1
Equation (15.8) shows that if the squared error is plotted against w1 the shape is parabolic, as shown in Figure 15.13, and if eqn (15.9) is set to 0 and solved the minimum point on the curve can be found.
Training starts with random weights and the initial state can therefore be anywhere on the error surface but is unlikely to be at its minimum. During training, the network should adapt its weights so that the overall error is reduced. The weights should therefore be adjusted in the direction of steepest descent.
15.2.3 Minimizing the squared error
A science experiment that many children will have performed at school is to plot the deflection of a beam against different loads and then to fit a best straight line by minimizing the sum of the squared differences between each point and the line. The same principle can be used to adapt weights and one such rule for adapting weights is the delta rule introduced earlier. This rule is:
∆wij = ηδjxi, δj = (tj –oj) (15.10)
where tj is the target value at node j, oj is the actual output, xi is the signal coming from node i, η is the learning rate (by how much to adapt the weight) and ∆wij is the amount by which to change the weight connecting node i with j.
The rule is simple to derive for a linear node with the output defined as:
oj =∑xiwij i
The chain rule can be used to express the derivative of the error surface with respect to a weight as a product that reflects how the error changes with a nodeʼs output and how the output changes with an impinging weight:
E
nw
Figure 15.13 The line represents the derivative of the error with respect to a weight at time n.
298 Artificial Intelligence
w1 Weight vector that
gives minimum error
w2
Figure 15.14 The error can be plotted for different combinations of weights.
∂E =∂E∂oj ∂wij ∂oj ∂wij
∂E =–δj ∂oj
∂oj =xi ∂wij
and substituting back into eqn (15.11):
–∂E =δjxi ∂w ij
(15.11)
(15.12)
E
Taking into account the fact that the weights need to change in a direction that is opposite to the direction of the gradient vector and that a learning rate is factored in, eqn (15.10) is derived.
A modification to this method of adapting weights will be given in section 15.4 to train networks with multiple layers of nodes, but before that we shall take a look at what advantage such a network offers over one with just a single layer of weights (i.e. a layer of input nodes and a layer of output nodes).
15.3 Linear and non-linear problems
The network in Figure 15.12 has two input nodes for two attributes. The number of attributes de- notes the dimension of the space that all input samples are drawn from: so a simple network model consisting of three inputs and a single output node will model a plane, and for n-inputs the model will be a n-dimensional plane. For a classification task, if a line (for two dimensions) or a plane (for n-dimensions) can separate all samples into their correct class, then the problem is linear. If the problem requires multiple lines or planes to separate the samples then the problem is non-linear. A famous non-linear example is the XOR problem introduced earlier.
The XOR problem, then, is non-linear, and there are two options for solving it with a neural network: either use a network that will model two or more lines to separate the data or change the input vectors. The latter option can make the problem linear by augmenting the two input attributes with a third to make the samples three-dimensional (the two classes will then reside on two different ends of a cube). For most problems we prefer the network to sort out the non-linearity; after all,
a main attraction of neural networks is their ability to solve problems for which a solution is not immediately obvious. Therefore, we shall concentrate on a solution that uses two lines to separate the data. The network then will require two hidden nodes, both connected to two input nodes to represent the two lines, and a third node to combine the outputs from the two hidden (middle) nodes. The XOR task with separating boundaries is illustrated in Figure 15.15. Figure 15.16 shows a neural network architecture that will model the two separating boundaries.
(0, 1) (1, 1) Line 2
Line 1
(0, 0) (1, 0)
Figure 15.15 The XOR problem solved with two separating lines.
Neural Networks I 299
–1
–1
–1
Input layer
–1
1.5
1
2
0.5
Hidden layer
Figure 15.16 Neural net for modelling the XOR problem. The second layer of weights is shown later.
Table 15.2 Response of hidden nodes for the network shown in Figure 15.16
Output layer
x1 x2 1 1
0 0
1 0
0 1
Node 1 –0.5 1.5
0.5
0.5
Node 2 –1.5 0.5 –0.5 –0.5
Node 1 0
1
1
1
Node 2 0
1
0
0
Net input to hidden layer
Output from hidden layer
300 Artificial Intelligence
What does the first layer of weights do to each input vector? Nodes on the input layer will be indexed by i, nodes on the hidden layer by j and the output nodes by k. The network in Figure 15.16 has the nodes split into the three layers. Table 15.2 shows the net input and output (using a threshold function) of the nodes in the hidden layer in response to the inputs for the XOR problem.
Each sample is transformed by the action of the first layer of weights and the hidden nodes. The second layer of weights connecting the hidden layer to the output layer will also model a line as there will be two inputs and a bias passing values to the single output node. If the output node is to respond with the correct class for each input sample, then the input to the second layer of weights must be linearly separable. This can be checked by plotting the response of the hidden nodes as shown in Figure 15.17. Figure 15.18 shows the weights that define the separating line given in Figure 15.17.
A network with only linear activation functions is restricted to modelling linear problems. The association law of matrix multiplication tells us that for linear activation functions a single layer of
(1, 1)
(0, 0)
(0, 1) (1, 0)
Figure 15.17 The effect of the first layer of weights has been to move the original point (0, 1) to (1, 0).
–1
–1
–1
–1
1.5
1
1
–1
2
0.5
–0.5
Figure 15.18 The complete network to solve the XOR problem.
weights can be found that will achieve the same result as a network that has more than one layer of weights. In other words, a multilayered network with linear activation functions will only be able to solve a problem that a single layered network is capable of solving. Therefore, non-linear problems are ruled out. For multilayered networks then, a non-linear activation function is required, and for the backpropagation algorithm the function should be continuous, differentiable and monotonically increasing. A function satisfying these properties is the sigmoid function.
15.4 Backpropagationlearning
For many years there was no rule available for updating the weights of a multilayered network undergoing supervised training. In the 1970s, Werbos developed a technique for adapting the weights, but it was the publication by Rumelhart, Hinton and Williams (1986a) that gave a new lease of life to neural networks. The weight adaptation rule is known as backpropagation. For what follows, a fully connected feed-forward network is assumed, which means activation travels in a direction from the input layer to the output layer, and the nodes in one layer are connected to every other node in the next layer up.
The backpropagation algorithm defines two sweeps of the network: first a forward sweep from the input layer to the output layer and then a backward sweep from the output layer to the input layer. The backward sweep is similar to the forward sweep except that error values are propagated back through the network to determine how the weights are to be changed during training. This double sweep is illustrated in Figure 15.19.
During training, each input sample will have an associated target vector. The objective of training is to find a set of network weights that provide a solution to the particular problem at hand. Before training commences, the weights are set to small random values, for example between –0.7 and 0.7. For reasons discussed earlier, a non-linear activation function, the sigmoid, will be used. The logistic sigmoid can only produce values between 0 and 1, and, because the function can never actually attain an exact 0 or 1, sometimes 0.1 and 0.9 are used instead of 0 and 1. The network is usually considered to have learned a task once all outputs fall within a specified tolerance of their target values. For example, if the target output is 1.0 and the tolerance is 0.1 then any actual output
Neural Networks I 301
Feedforward sweep Backward sweep
Figure 15.19 The shaded hidden node sends activation to each output node and so during the backward sweep this hidden node will receive error signals from these output nodes.
302 Artificial Intelligence
between 0.9 and 1.0 will be within the specified tolerance. The procedure for backpropagation training is as follows:
while not STOP STOP=TRUE
for each input vector
perform a forward sweep to find the actual output
obtain an error vector by comparing the actual and target output if the actual output is not within tolerance set
end for end while
The error derivative can be expressed as
∂E = ∂E ∂oj ∂netj ∂wij ∂oj ∂netj ∂wij
(15.13)
(15.14)
STOP=FALSE end if
perform a backward sweep of the error vector
use the backward sweep to determine weight changes update weights
In summary, a sample is presented to the network and an error vector is calculated to determine how the weights should change; the process is then repeated for each other sample. An epoch is a complete cycle through all samples (i.e. each sample has been presented to the network). The samples are continually presented to the network, epoch after epoch, until during one epoch all actual outputs for each sample are within tolerance.
15.4.1 Sometheory
Backpropagation uses a generalization of the delta rule. For a more in-depth presentation than that given here, the interested reader is referred to Rumelhart et al. (1986a) or Werbos (1990).
δj is defined as: δj=− ∂E
∂net j
The original delta rule in section 15.2.3 had the definition as:
δ j = – ∂E ∂oj
This definition is consistent with eqn (15.14) as the original delta rule is for linear nodes in which the output is the same as the input. The definition for δj can be expressed as:
δj=−∂E ∂oj (15.15) ∂oj ∂netj
we have:
∂E =–(tj –oj)
(15.16)
(15.17)
(15.18)
(15.19)
Neural Networks I 303
As:
Ep=1/2∑(tj –oj)2 j
∂oj
For activation function f ′ (typically the logistic function) the output is:
oj=f(netj)
and so the derivative f ′ is given by:
∂oj = f′(netj) ∂net j
So:
δj = (tj – oj)f ′(netj)
The standard summation of products is used to find the net total input:
netj =∑xiwij i=0
and so
So taking the product of each derivative and substituting back into eqn (15.13) gives: ∂E =(tj–oj)f′(netj)xi
∂netj =xi ∂wij
(15.20)
(15.21)
∂wij
Noting that the weight change should be in the direction opposite to the derivative of the error
surface, the weight change for a node is: ∆wij = ηδjxi
The error δj given above is applicable to an output node, but the error for a hidden node is not directly related to the target output. However, a hidden node can be adapted in proportion to its assumed contribution to the error in the next layer up (i.e. output layer for a network with a single hidden layer). For a network with a single hidden layer, the error from each output node will make a contribution to the error of each hidden node by propagating error signals back through the network. The contribution to a hidden node will depend on the size of the error for an output node and the strength of the weight that connects both nodes. In other words, an output node with a high error
304 Artificial Intelligence
makes a strong contribution to the error of any hidden node that is connected by a weight with a high strength. For a hidden node the error is given by:
δj =f′(netj)∑δkwkj k
(15.22)
where k indexes the layer sending back the error (the output layer in a network with a single hidden layer). A suitable activation function is the logistic function:
f (net j ) = 1
1 + exp(–net j )
The derivative of this activation function is:
f′(net )= j
exp(–netj) (1+exp(–netj ))2
= 1 1– 1 1+exp(–netj) 1+exp(–netj )
(15.23)
= f(netj)[1– f(netj)]
15.4.2 Thebackpropagationalgorithm
The first stage is to initialize the weights to small random values. Training continues until the change in the absolute value of the averaged squared error falls within some tolerance between one epoch and the next epoch. For example, a tolerance of 0.01 means that the averaged squared error must not change by more than ± 0.01 between successive epochs. If a network meets the tolerance during training it is said to have converged. An alternative way to judge the end of training is to insist that each target output for each training sample be within some specified tolerance.
To reduce the likelihood of the weight changes oscillating, a momentum term, α, is introduced that adds in a proportion of the previous weight change. So, the weight change for sample m + 1 is dependent on the weight change for sample m:
∆wij(m + 1) = η(δjoj) + α∆wij(m) (15.24) The backpropagation algorithm is summarized in Figure 15.20.
Example 3 • • • • •
Figure 15.21 illustrates a single forward and backward sweep through a 2–2–1 network with input
< 0.1, 0.9 >. The learning rate is 0.8 and the momentum 0. The weights are:
–2 3–2 23 –2 3–4 21
2 2 3 –2 –2
where each matrix represents a layer of weights. The last row is the bias weights.
•••••
step 1. Read first input sample and associated output sample. CONVERGE ̈ TRUE
step 2. For input layer- assign as net input to each node its corresponding element in the input vector. The ouput for each node is its net input.
Repeat step 3 for all subsequent hidden layers.
Neural Networks I 305
Read next input sample and associated output sample
step 3. For first hidden layer nodes – calculate the net input and ouput.
n
net ̈ w + Â x w , o ̈ j 0i=1iijj
1
1 + exp(net j )
.
step 4. For output layer nodes-calculate the net input and output.
n
net ̈w+Âxw, o ̈ j 0i=1iijj
1
1 + exp(net j )
.
step 5. Is the difference between target and ouput sample within tolerance? if No CONVERGE ̈ FALSE
step 6. For each output node calculate its error.
.
dj ̈(tj -oj)oj(1-oj)
step 7. For last hidden layer calculate error for each node. dj ̈oj(1-oj)Âdw .
k kj k
Repeat step 7 for all other hidden layers.
no
step 8. For all layers update weights for each node. Dwij(m +1) ̈h(d joi)+aDwij(m)
last sample?
CONVERGE=TRUE
STOP
Figure 15.20 The backpropagation algorithm. The index k refers to a previous layer in the reverse sense (i.e. when moving back through the net).
306 Artificial Intelligence
Target value is 0.9 so error for the
output node is:
(0.900-0.288) (0.288) x (1 – 0.288) = 0.125
Forward pass
0.288
–2
–2
–0.906
0.125
3 1
Error from previous layer multiplied by weights
3 1
0.122
0.728
0.375
0.125
3 –23 –2
Actual error: 0.122 (1–0.122) x 0.375
–2 2 –2 2
Output from
2 –4 node 2 –4
2222
Input to
–2 3 node –2 3
3 –2 3 –2
Figure 15.21 Illustrating a forward and backward sweep of the backpropagation algorithm.
15.4.3 Some practical considerations
The successful application of a neural network usually demands much experimentation. There are a number of parameters to be set that can affect how easy it is to find a solution. For feed-forward networks, the number of hidden nodes and the number of layers can be varied. Certainly, an important factor is the training data, and a good deal of attention must be paid to the test data so that we can be sure the network will generalize correctly on data it has not been trained on. Indeed, it is
–1.972
0.986
0.040
0.025
0.500
0.9933
–0.030
0
5.000
–0.007
–0.110
–0.001
0.1
0.9
Neural Networks I 307
possible to train a network on data with complete success only to find that it fails when processing new data. For any new application, there is no prescribed way of finding a solution. Sometimes the problem appears intractable, but failure to find a solution does not mean that the application is not amenable to a neural solution. Although applying neural networks involves a ‘try and see approachʼ, the requirement for knowledge of the application domain and knowledge of neural networks should not be underestimated.
Theory tells us that a network with a single hidden layer can represent any continuous mapping of the form:
y=f(x)
Since this form of mapping is the basis for many real-world problems, is there a need for more than one hidden layer? Well, a network with two hidden layers can sometimes be easier to train and a variant of the feed-forward net known as an auto-associater may use multiple hidden layers to perform data compression.
The learning algorithm as it has been presented uses pattern updating of the weights, which simply means that the weights are updated after each sample presentation. It can sometimes prove quicker to train a network using batch updating, in which the error for a node is accumulated for one epoch before adapting the nodeʼs weights. However, it can be beneficial to randomize the order of presentation for each training sample between epochs, in which case pattern updating is required. Usually, it is best to experiment with different approaches.
Generalization
If a network can produce the correct output for the majority of input samples in the test data set, the network is said to generalize well. It is assumed that the test data set was not used during training.
If a network is trained well to produce a smooth non-linear mapping, it should be able to inter- polate to new samples that are similar but not exactly the same as those samples used for training. A non-smooth mapping results when the network is overtrained. In this situation, the network will work more like a memory, looking up an output for a certain input from the training set.
Good generalization is dependent on the training set and network architecture. The training data set must be representative of the problem being tackled, but the number of hidden nodes is also an important factor. If there are more hidden nodes than is required to learn the input–output relation- ship, there will be more weights than necessary and overfitting of the data can result if training continues for too long. Sometimes a subset of the training data is used to act as an interim test for detecting overtraining.
The application of neural networks is an experimental approach to engineering. General guidelines sometimes exist for network design. Haykin (1998) uses a result derived by Baum and Haussler (1989) to give a guideline as to the size of the training data set. The guideline states that:
N > Wε
where N is the number of training examples, W is the number of network weights and ε is the percentage of allowed errors in testing. So, for a 10% error the number of training examples should be 10 times the number of weights.
308 Artificial Intelligence
15.5 An example of character classification
The task is to classify the digits 0 through to 9. There are 10 classes and so each target vector could be a 1-in-10 vector. For example, the target for the character 2 could be < 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 >, which indicates that the third output node should be on and all others off.
In the case study presented here, the target values are represented by 1-in-9 vectors for the numerals 1 to 9, with zero being represented by having all output nodes set to 0. The numerals to be learned are shown in Figure 15.22. Each numeral is represented by a 9 × 7 grid, with a grey pixel being 0 and a black pixel being 1.
A network architecture of 63–6–9 is chosen: (9 × 7) input nodes, one for each pixel, six hidden nodes and nine output nodes for the target vectors. The pixels are mapped into the input layer as shown in Figure 15.23.
The network was trained for 600 epochs using a learning rate of 0.3 and a momentum of 0.7. An output node was considered to be on if its activation was greater than 0.9 and off if its activation was less than 0.1.
The network trained successfully. It was then tested on the data in Figure 15.24. Each of the
Figure 15.22 Training data.
0 0 0
0 0 1
Figure 15.23 The grid is mapped to the input layer by treating it as a long string of bits that are set to 0 or 1. The bits are mapped by starting at the top left-hand corner, working down the grid for the first column and then repeating for each other column.
Neural Networks I 309
Figure 15.24 Noisy test data.
numbers has one or more bits missing. All the test numbers were correctly identified apart from 8. The sixth output node had an activation of 0.53 and the eighth node an activation of 0.41. The network was confused with the eighth sample being either the numeral 8 or 6 but not confident in either.
Summary
This chapter has looked at supervised learning, in which the object of learning is to map an input vector to a target vector. For supervised learning it is necessary to have:
• data that have a known classification;
• sufficient data that represent all aspects of the problem being solved;
• sufficient data to allow testing (the test data set and the training data set should be disjoint,
i.e. test data are not used during training). The backpropagation algorithm is in wide use:
• It learns by adapting its weights using the generalized delta rule, which attempts to mini- mize the squared error between what is the desired network output and the actual network output.
• During learning, it continually cycles through the data until the error is at a low enough value for the task to be considered solved. Even when the squared error is low, it is important to check that individual training samples are all correctly classified.
• Following training, the networkʼs weights are fixed and the network can then be used to classify unseen data.
• The knowledge of the solved task resides in the networkʼs weights
• A single hidden layer of nodes is theoretically sufficient for a task to be learned, but in
practice more than one hidden layer could lead to better performance.
• Backpropagation networks can take a long time to train.
• Generalization is a measure of how well the network performs on data not seen during
training. The number of hidden nodes and the length of training can have a significant effect on generalization.
310 Artificial Intelligence
Further reading
The original exposition of the backpropagation algorithm by Rumelhart et al. (1986a) is still worth studying. The later paper by Werbos (1990) also describes the theory behind backpropagation and a variation with recurrent connections. There are many variations of the backpropagation algorithm to speed up training and to help the network avoid the problem of local minima, which can result in the network failing to learn. The keen reader is referred to Haykin (1998) and to Masters (1995), who provides a practical insight to solving these issues using the C++ language. For a statistical treatment of neural networks, Bishop (1995) provides good coverage.
Exercises
•
•
•
•
1
2
Sketch a neural network along with weight values that will model the line: y = –0.3x + 0.7
The figure below shows a backpropagation network that is currently processing the training vector [1.0 0.9 0.9] and the associated target vector is [0.1 0.9 0.1].
–0.3
b 0.9
–0.6 –0.1
0.4 1.2
•
–0.8 a
c –0.3
Given that the output from unit b is 0.6 and from c is 0.8, and assuming that the logistic function is the activation function:
a Calculate the actual output vector.
b Calculate the error for each output unit.
c Calculate the error for each hidden unit.
d Calculate the weight changes for the weights connecting from unit a. Use a learning rate of
0.25 and momentum of zero.
Given that the points {(–1, 1), (–1, –1), (1, –1)} belong to class A and that {(–2, –2), (1, 1), (2, 2), (4, 1)} belong to class B:
3
a b
Show that the classes are not linearly separable.
Assuming a network with units that has outputs according to:
1 if total input ≥ 0 output = 0 if total input < 0
• c Derive a second layer of weights so that the network will correctly classify all the patterns. • Assume a single output unit. •
Neural Networks I 311
show that the first layer of weights W1 in a three-layer network will transform the problem into a linear one. The first row in W1 defines the bias weights.
1 –6
• •
W1=–2 –2 •
4 The points {(4, –1), (8, –2), (1, 1), (3, 6)} belong to class A, and the points {(–8, 4), (–2, –3), • (–1, –1), (2, –9)} belong to class B. Derive a minimal network to classify these points cor- • rectly. •
5 Derive a suitable feed-forward network that models the logical AND. •
6 The bipolar sigmoid is another activation function that is commonly used with a backpropaga- •
tion network. The range of the function is (–1, 1) and is defined as: •
f(x) = 1/2[1 + f(x)] [1 – f(x)] • Give the error correcting rule for both an output and a hidden unit. •
–1 –3 •
f (x) = 2 –1 1+ exp(–x)
• The derivative can be expressed as: •
Symbols
∀ universal quantifier, 35 ∃ existential quantifier, 36 → implication, 21
∧ logical and, 20
∨ logical or, 20
↔ if and only if, 21
¬ logical negation, 21
← backward rule notation
and used as assignment
for algorithms ⊆ subset
∈ set membership
= test of equality used in
algorithms
f entailment, 33
g turnstile, 28
≡ logical equivalence, 24 != not equal to (used in
algorithms)
A* search, 57–60
abductive reasoning, 105, 167 Abelson, R., 16, 394 absorption, in logic, 26 absorption in a junction tree,
144
ACL see agent communication
language action decision, 174
action description, in planning, 190
action nodes, 194
activation function, 289 adaptive resonance theory, 329 add edge, in planning, 194 add-list, in planning, 190 adjective, 351
ADL, 224
admissible cost function, 60 adverb, 351
agenda, in chart parsing, 370 agent, 189
autonomy, 437
belief desire intention
architecture, 439–40 definition of, 436–40 flexibility, 437
linguistic role, 382 multiagent system, 436 reactive, 440 situatedness, 437
agent communication language, 438
Aho, A., 394 AI
applications, 452–73 connectionist, 8 definition of, 6 traditional, classical, 8
AIBO, 473
Allen, J., 394
allophone, 397
Alshawi, H., 394
ambiguity, of sentences, 386–7 amplitude, 397
anaphor phrase, 388
and, fuzzy variables, 158
and, logic connective see
conjunction Anderson, C., 210, 211, 224 annealing see simulated
annealing antecedent, in logic, 21
antecedent, of a rule, 11 antecedent phrase, 388 argument, logic, 25 arguments, of a predicate in
logic, 34
Index
arithmetic in Prolog, 480–2 arity, of a predicate, 34 arrow junction, 418
ART see adaptive resonance
theory
Ascent Technology, 458
ASIMO, 473
assert, statement in Prolog,
475
associative learning, 226
assumptions, in logic, 28 atom, 18, 72
atomic formula, 72
atomic proposition, 18 atomic sentence, in predicate
logic, 35 autonomy, 467
auxiliary verb, 351 Aydede, M., 451
background knowledge, 253–4 background theory, 254 backpropagation learning,
301–9 backtracking, 97–100
backward chaining, 100–1 backward search, 192 Baker, C., 394
Ballard, D., 432
Barrett, A., 224
Barto, A., 284
batch updating, neural
networks, 307 Baum, E., 307
Bayes’ classifier, 322–5 Bayes’ rule, 112–13 Bayesian network definition,
119–21
Bayesian networks, 104–52
cluster tree, 121, 133–44
Index 501
converging connection, 115–16
decision network, 174–86 definition of, 119–21 discrete state, 106 diverging connection, 116 d-separation, 115–17 handling evidence, 107,
145–9
inexact inferencing, 150–2 marginalization, 122 sepset, 121
serial connection, 117 stochastic sampling, 150–2 variable, 106
BDI see belief desire intention architecture
belief, in Dempster–Shafer theory, 164–5
belief desire intention architecture, 439–40
belief network see Bayesian networks
Bellman, R., 284
Bernard, D., 473
best-first search, 57
bigram model, 391
Bishop, C., 310, 325 blending method for genetic
operator, 343 blind search, 50–5
blind sight, 445
Blum, A., 194, 224 Blythe, J., 224
Boden, M., 451 bottom-up hypothesis
derivation, 256 bottom-up parser, 355–6
bound variable
in FOPC, 40–1 unification, 82
boundary extraction, 424–8 branching factor, 101 Bratko, I., 101, 491 breadth-first search, 50–2 Breiman, L., 251
Brooks, R., 16, 440, 473 Brown, C., 432
Buntine, W., 152
camera calibration, 413 Cameron-Jones, R., 267
candidate elimination algorithm, 234–7
Carpenter, G., 329
case frame, for representing
sentences, 382 Cassandra, A., 224
centre of curvature, 418 chain rule, 113
Chalmers, D., 445, 450, 451 Charniak, E., 394
chart parser, 367–70
Chinchor, N., 458
Chinese room, 442–4
chord, in a graph, 135 chromosome, 331
Clark, A., 16, 451
class, object-oriented, 13 classification learning, concept
of, 231–2
clause, in natural language,
349
Clocksin, W., 101, 491
closed-world assumption, 90, 195
of Prolog, 475
Clowes, M., 432
cluster tree, construction of,
133–44
cluster tree of a Bayesian
network, 121 clustering with neural
networks, 312–25 CNF see conjunctive normal
form Cohen, W., 251
Cole, R., 394
Collins, A., 16
common sense, 350, 467–9 commutativity, in logic, 26 competing needs, mutex
relations, 196 complementary pairs, in
resolution, 75 complete, in logic, 32, 33 complete rule in chart parser,
368 completeness
in search, 47 in ILP, 254
compound proposition, 19 concept learning, 233–7 conclusion, in logic, 18
conditional dependence, 111–13
conditional effects, in planning, 208–11
conflict resolution, in production system, 11
conjunction, in logic, 20 conjunctive normal form, 73–4,
77–9 connectionism, 9
connectionist, 8 connectionist networks, 286 connective, in natural
language, 351 connectives, in logic, 19–22 consciousness, 444–6 consequent
in logic, 21
of a rule, 11 consistency, in ILP, 254 consistent,
in logic, 33
in concept learning, 233 consistent junction tree, 141 consistent link, in junction tree,
133
constant, in predicate logic, 34 constraint satisfaction, 66–8 context analysis, in natural
language processing,
363–4, 387–90 context-dependent planning,
208
context knowledge in natural
language, 350 context-free grammar, 353 contradiction, 23 converged, backpropagation
training, 304 converging connection, in
a Bayesian network,
115–16 convolution, 416
Copeland, J., 16, 442, 444, 451 corpora, in text processing, 391 correspondence, 414
cost, in search, 49
cost function, 57–60
cost, of a sepset, 138
cover, in concept learning, 233 critics, in planning, 222 cross-validation, 466
502 Artificial Intelligence
Crystal, D., 394
cueing, a script, 389
Currie, K., 224, 473
curse of dimensionality, 229 curvature, 418
cut, in Prolog, 485–6 CVOnline, 432
Cyc, 468–9
Cytyc, 463
D–S theory see Dempster– Shafer theory
DAG see directed acyclic graph Darwiche, A., 152
data mining, 463–7 data-driven search, 100
Davis, R., 15, 16, 451
de Kleer, J., 170
De Morgan’s laws, equivalence
of quantifiers, 24, 38 De Raedt, L., 267
decision networks, 174–86 decision tree learning, 241–51 declarative language, 88 declarative mood, 361 deduction, in logic, 28–32 Deep Space 1, 452–4 Deerwester, S., 16 defuzzification, 162
DeJong, K., 338
delete edge, in planning, 194 delete list, in planning, 190 delta rule, 294
Dempster, A., 170, 324 Dempster–Shafer theory, 163–6 DENDRAL, 2–3
Dennett, D., 451 dependency-directed
backtracking, 168 depth-first search, 50, 52
derived class, object-oriented, 14
Derthick, M., 451
designation of symbols, 449 desJardins, M., 224 determiner, 351
deterministic outcome, 105 deterministic planner, 212, 214 dictionary, 349
directed acyclic graph, 93 Discern Communication, 461 discounting, 272
discourse anaphora, 388 discrete state, in Bayesian
networks, 106 disjunction, logic connective,
20
disjunctive normal form, 73 distributivity, in logic, 26 diverging connection, in a
Bayesian network, 116 DNF see disjunctive normal
form
domain, in predicate logic, 35 double implication see if and
only if Doyle, J., 170
Dreyfus, H., 451 d-separation, 115–17 Duce, D., 16
Duda, R., 322, 432
Earley, J., 394
edge, 418
edge detection, 418–21
effect, in planning, 190 eigenfaces, 462
eight-tile puzzle, 47–8 eliminating, a node in a graph,
135–6 ELIZA, 446
Elkan, C., 170 Elman, J., 16
EM see expectation
maximization entailment, in logic, 33
entropy, 244
epipole, 414
episode, 272 epistemology, 442 epoch, in reinforcement
learning, 272 epoch, in backpropagation learning, 302
equivalence, in logic, 24–5, 26 Erol, K., 222, 224
error, in learning, 228
Essa, I., 432
Euclidean squared distance, 312–3
evidence
entering in Bayesian
network, 145–9
in Bayesian networks, 107
exclusive or, logical, 20 existential quantifier, 36 expectation maximization,
324–5
expert systems, 2–4, 9
explicit representation, 231 extended rule, in chart parser,
368–9
extensional meaning, in FOPC,
43 Eysenck, M., 451
factored expansion, in planning, 210–11
fail, in Prolog, 486–7 false-positive diagnosis, 181 falsum, in logic, 75
FASTUS, 459–61
Fausett, L., 329
features, used in grammars,
370–7
feed-forward network, 286 Feigenbaum, E., 2
Fikes, R., 189, 206
fill-ins, in moral graph, 134 filter, in vision, 416
finding, in Bayesian networks,
107
first-order predicate calculus,
34 see also predicate
calculus Flach, P., 267
focal element, in Dempster– Shafer theory, 164
focal length, 411
Fodor, J., 450, 451
FOIL, 257–61
FOPC see first-order predicate
calculus fork junction, 418
FORMS, 432
formula, in logic, 32 Forsyth, D., 432
forward chaining, 100–1 Fourier transform, 398 frame, in speech processing,
398
frame representation, 12–5 frame of discernment, in
Dempster–Shafer theory,
163
frame problem, 203–6
Index 503
Fredman, M., 69
free variable, in FOPC, 40–1 frequency, 397
Friedman, N., 152
function optimization, in
search, 60–6 functions, in FOPC, 38–9 fundamental rule, 111–2
Furst, M., 194 fuzzification, 157
fuzzy inferencing, 157–63 fuzzy logic, 154–63
and connective, 158 fuzzification, 157 inferencing, 157–63 linguistic value, 154 membership function,
155–6
or connective, 159
fuzzy membership, 155 fuzzy membership function,
155–6 fuzzy set, 154
fuzzy value, 154 fuzzy variable, 154
Gabbay, D., 170
GABIL, 338
gain, as used in FOIL, 260 gain, information used in ID3,
244–5
Gaussian function, 228, 229 Gazdar, G., 394
Geman, D., 152
Geman, S., 152
gene, 331
general boundary set, 234 generalization
in learning, 248
in neural networks, 307 generalization operators, in
learning, 233
genetic algorithms, 330–44 genetic operator, 330, 332 genetic programming, 343 Gent, I., 69
Georgeff, M., 438, 439 Ghallab, M., 224
Gibbs sampling, 151–2
global minimum, in search, 64 goal, in search, 46
goal description, in planning, 190
goal state, in search, 49 goal-driven search, 100 Goldberg, D., 344 Golumbic, N., 152
gradient descent, 63 gradient, in edge detection,
420 grammar, 349, 352
GraphPlan, 193–206
Gray codes, 341
greatest lower bound, 264 Grevier, D., 16, 451 Grossberg, S., 329
ground literal, 190 GSAT, 68
Hart, P., 322, 432
Haupt, R., 344
Haupt, S., 344
Haussler, D., 307
Haykin, S., 307, 310
head of phrase, 375 Hecht-Nielsen, R., 329, 468 Heckerman, D., 152 Henrion, M., 152
heuristic function, in search, 57 heuristic search, 55–60
hidden Markov model, 402–8 hidden node, 292
hierarchical task network, 219–23
hill-climbing search, 68 Hinton, G., 301, 451 history list, in NLP, 388 HNC, 468
Hobbs, J., 473
Holland, J., 341
Honda, 473
Horn, B., 432
Horn clause, 97
Hough transform, 425–6 Howard, R., 186
HTN see hierarchical task network
Huang, C., 152
Huffman, D., 432
HUGIN, 473
Hugin propagation, 141–5 Hwee Tou Ng, 394 hypothesis learning, 227
IBM, 464
ID3 algorithm, 242–50 identity function, 290 if and only if, 19, 21–2 if ... then, 19, 21
ILP see inductive logic
programming image plane, 411
implication see if ... then implicit parallelism, 342 inclusive or, in logic, 20 independent event, 113 induction, as inverse of
deduction, 261 inductive bias, 239
inductive inference, 227 see also inductive logic
programming inductive logic programming,
253–67
inexact inferencing in Bayesian
networks, 149–502 inference engine, 9
information gain, 244–5 inheritance, object-oriented, 14 initial state, in search, 49 instantiation, object-oriented,
13
instrument, linguistic role in
NLP, 382
intensional meaning, in FOPC,
43
interference, mutex relations,
196 interfering goal, 191
intermediate-logical form, 383–6
interpretation, in FOPC, 41–4 interpretation of symbols, 449 interrogative mood, 361 intervening action, 174,
177–81
intonation planning, 397
intransitive verb, 351 intrasentential anaphora, 388 introduced rule in chart parser,
368–9
inverting resolution, 261–2 iRobot, 473
irregular verb, 351 iterative-deepening search,
54–5
504 Artificial Intelligence
Jennings, N., 437, 440 Jensen, F., 141, 152, 473 Johnson, D., 69
joint probability distribution,
111 Juang, B., 408
junction tree, 133 see also cluster tree
Jurafsky, D., 392, 394, 408
Kambhampati, S., 224 Kautz, H., 206
Kay, M., 394
Keane, M., 451
Kelly, J., 101
kernel, of a filter, 416 Kinny, D., 438 Kirkpatrick, S., 66 Knight, K., 101 knowledge query and
manipulation language,
438 knowledge-based
planning, 219–23
system, 4
Koehler, J., 224
Kohonen, T., 314, 320, 329 Koller, D., 152
Kosko, B., 170
Kowalski, R., 101
Koza, J., 343
KQML see knowledge query
and manipulation language
L junction, in vision, 418 lambda
calculus, 362
reduction, 362 Langford, J., 224
language bias, in ILP, 265–6 language of thought
hypothesis, 449–50 Lauritzen, S., 152
law of contradiction, 26 learning, 225–344
as search, 233–7 associative, concept of, 226 backpropagation, 301–9 bias, 238–9
candidate elimination,
234–7
classification, concept of, 231–2
clustering, 312–25 concept, 233–7
decision tree, 241–51 expectation maximization,
324–5
FOIL, 257–61 generalization, 248
neural nets, 307
operators, 233
genetic algorithms, 330–44 ID3, 242–51
inductive logic
programming, 253–67 introduction to, 226–40 neural networks, 286–329 overfitting, 248 Q-learning, 281–4 radial-basis functions,
325–8
reinforcement, 269–84 self-organizing feature map,
314–21 soundness and
completeness, 32–3 specialization operator, 233 supervised, 229, 231
target function, 227,
229–31
temporal difference, 282 types of, 231–2 unsupervised, 229, 231 validation data, 248 version space, 234
learning as search, 233–7 learning bias, 238–9
least general generalization,
264
least upper bound, 264
Lecun, Y., 432
Lee, K., 408
Lenat, D., 467, 468 levelled off graph, in
GraphPlan, 203 lexicon, 349
Lindsay, R., 2
linear predictive coding, 399 linear resolution, 98
linguistic value, in fuzzy logic,
154 linguistics, 352
lists, in Prolog, 482–4
local minimum, in search, 64 location, linguistic role, 382 Loebner, H., 447
logic, 18–44
and connective, 20 antecedent, 21 arguments and validity,
25–8
assumptions, 28
atomic formula, 72 atomic proposition, 18 atomic sentence, 35 commutativity, 26 conjunctive normal form,
73–4, 77–9
connectives, 19–22 consequent, 21
deduction, 28–32 disjunction, 20
disjunctive normal form, 73 distributivity, 26 entailment, 33
equivalence, 24–5, 26 FOPC, 33–44
formula, 32
fuzzy see fuzzy logic
if and only if, 19, 21–2
if ... then, 19, 21
inductive programming
see inductive logic
programming premise, 18
Prolog, see Prolog propositional, 18–33 reasoning, automation of,
72–101
refutation proof, 27 resolution, FOPC, 77–88 resolution, propositional,
72–7
semantics, 19 skolemization, 79–82 soundness and
completeness, 32–3 syntax, 19
truth tables, 19, 22–4 unification, 82–5
valid argument, 25–8 well-formed formulas,
39–41
logic connectives, 19–22
Index 505
logic operators see logic connectives
logical and, 19, 20
logical if and only if, 19, 21–2 logical if ... then, 19, 21 logical not, 19, 21
logical or, 19, 20
logistic function, 291
LOTH see language of thought
hypothesis
LPC see linear predictive coding LUNAR, 394
Lynch, M., 467
McAllester, D., 170
McCarthy, J., 6, 451 McDermott, D., 170 McGeoch, L., 69
Manning, D., 394 marginalization, 122
Markov assumption, 214 Markov decision process, 214 Markov model, 214
Markov property, in reinforcement learning,
273 Marr, D., 432
Martin, J., 392, 394, 408 mass, of a sepset, 138 Masters, T., 310 Matheson, J., 186
MDP see Markov decision process
Mellish, C., 101, 394, 491 memoization, in GraphPlan,
203
message passing, in junction
tree, 141, 144
method, object oriented, 13 method, in planning, 222 Metropolis, N., 64, 69
Michell, M., 344
Microsoft, 464, 473 Miikkulainen, R., 394
Minker, J., 170
Minsky, M., 6, 12, 13, 16, 451 missionaries and cannibals, 50 Mitchell, T., 233, 234, 239, 240 Mlnet, 240
modal operator, in non-
monotonic inferencing, 167
modal operator, in NLP, 384 model, in logic, 32
momentum in backpropagation
learning, 304
monotonic reasoning, 166 moral graph, 134
morpheme, 350
morphological knowledge, 350 most general unifier, 83 MUC-5, 458
Muggleton, S., 255, 266, 267 multiagent system, 437 mutate, 330, 332
mutex relations, 193, 196 MYCIN, 3
NASA, 473
natural language processing,
348–93
bottom-up parser, 355–6 chart parser, 367–70 context analysis, 363–4,
387–90
features, in grammars,
370–7
parts of speech, 351–2
in Prolog, 487–91
phrase, 349
semantic analysis, 357–63,
377–83
speech, 396–408 stages of, 349–50 statistical approaches,
390–3
structure analysis, 352–6 top-down parser, 355, 356 with Prolog, 487–91
Ndumu, D., 437, 438 negation, logical, 21 negative literal, 72
neural networks, 8, 286–329
activation functions, versions of, 289–91
backpropagation, 301–9 basic description, 287–94 clustering, 312–25
linear and non-linear
problems, 298–301 radial-basis functions,
325–8
self-organizing feature map,
314–21
Neural Technologies, 468 Neurodynamics, 467 neuron, 287
Newell, A., 206
N-gram model, 391
Nilsson, N., 44, 69, 189, 192,
206
NLP see natural language
processing node, neural, 287
noise, in an image, 419 non-intervening action, 174,
175–6 non-monotonic reasoning,
166–9
non-terminal node, in a tree,
353
no-op operator, 194
Norsys, 473 noun, 351
noun phrase, 351 Nwana, H., 437
object, linguistic role, 382 object, object-oriented, 13 object, in natural language,
349, 351
objective function for genetic
algorithm, 330, 332 object-oriented concepts, 13 OLAP, 466
ontology, 442
operator, in search, 49 operator, in planning, 190 O-plan, 224, 454
optical axis, 411
optimality, in search, 47
or, fuzzy variables, 159 Oracle, 464
orthographic projection, 412 other minds response, 444 overfitting, in learning, 248 overtrained neural network,
307
parallel processing, with genetic algorithms, 343
parent, object-oriented, 14 parse tree, 355
parser, 353
parsing, 353–6, 367–70
bottom-up, 355–6
506 Artificial Intelligence
chart, 367–70 top-down, 355, 356 shift-reduce, 355–6
partially observable Markov decision process,
216–17, 224
partially ordered plan, 222 part-of-speech tagging, 392 Passino, K., 170
pattern updating, 307
PCA see principal component
analysis
PDDL see planning domain
definition language Pearl, J., 152
Pednault, E., 224 PEGASUS, 462
Penn treebank, 392 Pentland, A., 432, 462 Peot, M., 186
Pereira, F., 394
perfect tense, 352
perfect progressive tenses, 352 perspective projection, 411–12 philosophy, 441–51
Chinese room, 442–4 strong and weak AI, 442–4 thinking machines, 444–6 thought as language,
448–51
Turing test, 446–8 what is it?, 441–2
phone, 397
phoneme, 397
phonetic knowledge, 350 phrase, in natural language,
349
physical symbol system
hypothesis, 6 pinhole camera, 411
pixel, 409, 410
plan recognition, 390 planning, 189–224
action description, 190
as search, 191–3 conditional effects, 208–11 context dependent, 208 critics, 222
goal description, 190 effect, 190 GraphPlan, 193–206 HTN, 219–23
knowledge-based methods, 219–23
method, 222 operator, 190
O-plan, 224, 454 possible worlds, 218 precondition, 190 primitive task, 219 regression, 192 replanning, 223
state description, 190 STRIPS, 189–91
task, 219
with uncertainty, 211–19 planning domain definition
language, 224 plausibility, in Dempster–Shafer
theory, 164–5 Plotkin, G., 262
Poggio, T., 432
policy, 271
policy in Markov models, 215 policy iteration, 280–281 Pollack, J, 451
POMDP see partially observable
Markov decision process Ponce, J., 432
population of hypotheses, 331 positive literal, 72
possible worlds, in planning,
218
posterior probability, 111
power set, 163, 238 precedence, of logic symbols,
26
precondition edge, 194
precondition, in planning, 190 predicate, 34
predicate calculus, 33–44 preference bias, in ILP, 266 preferences between
quantifiers, 387 premise, in logic, 18
prenex normal form, 78 preposition, 351
primitive task, in planning, 219 principal component analysis,
429–31
prior probability, 111
probabilistic network see Bayesian networks
probability distribution, 110
probability mass, in Dempster– Shafer theory, 163
probability theory, 109–13 ProDAPS, 470–2 production system, 11 productivity, 450
Progol, 255, 266 progressive tenses, 351 Prolog, 88–101, 474–91
arithmetic, 480–2 assert statement, 475 cut, 485–6
fail, 486–7
lists, 482–4
NLP, 487–91 recursion, 480
pronoun, 351
proposition, 18
proposition nodes, 194 propositional logic, 18–33 propositional satisfiability, 66 pruning decision tree, 248–9 Pylyshyn, Z., 450, 451
Q-learning, 281–4 quantification, in natural
language, 383–6 quantifier, in natural language,
351 quantization, 398
Quillian, R., 16
Quinlan, J., 242, 251, 257, 267
Rabiner, L., 408
Radcliff, N., 343
radial basis function network,
325–8
random variable, 110
Rao, A., 438, 439 reactive architecture, 440 reasoning
abductive, 105, 167 automation of, 72–101 monotonic, 166 non-monotonic, 166–9
see also Bayesian networks;
logic, deduction and
resolution recursion, in Prolog, 480
refutation proof, in logic, 27 regression planning, 192 regular verb, 351
Index 507
reinforcement learning, 269–84 remote agent experiment, 452 replanning, 223 representation, 10–16 resolution, in logic inferencing,
74–88
resolution, of an image, 411 resolution deduction, 76 resolution in FOPC, 77–88 resolution in propositional
logic, 74–7 resolution refutation, 75
return, 272
reward, 269, 271
rewrite rule, 352
Ringland, G., 16
Roberts, L., 432
Robinson, J., 101
robot response, 444
robotics, 473
roles, linguistic, 382 Rosenberg, C., 408
roulette wheel selection, 336 Rowley, H., 432
rules of deduction, 28–32 Rumelhart, D., 301, 302, 310,
394
St. John, 451
sample space, 109 sampling, 397
SAT, 66
satisfaction, in FOPC, 43–4 satisfy
in logic, 33
in concept learning, 233 Schank, R., 16, 394
schema theorem, 341–2 schemata, 341
Schutz, H., 394 scoping, in FOPC, 40–1 scorecard, 468
script, 389
SDR-4X, 473
search, 46–69
A*, 57–60
as function optimization,
60–6
backward, 192
backward chaining, 100–1 best-first, 57
blind, 50–5
breadth-first, 50–2 completeness, 47
constraint satisfaction, 66–8 cost, 49, 57–60
data-driven, 100 depth-first, 50, 52
forward chaining, 100–1 function optimization, 60–6 global minimum, 64
goal, 46, 49
goal-driven, 100 heuristic, 55–60 hill-climbing, 68
iterative deepening, 54–5 local minimum, 64 operator, 49
optimality, 47
state, 49
simulated annealing, 63–6
Searle, J., 438, 442 segmentation, in speech, 407 segmentation, in vision, 422–4 segmentation, through
thresholding, 423–4 Sejnowski, T., 408
selectional restriction, 388 self-organizing feature map,
314–21
Selman, B., 68, 69, 206
semantic
analysis, 357–63 features, 361 knowledge, 349–50 network, 11–12
semantics, in logic, 19 semantics, in NLP, 357–62,
377–83
Sensory GraphPlan, 217–19 sepset, in Bayesian networks,
121
serial connection, in Bayesian
network, 117
SGP see Sensory GraphPlan Shachter, R., 186
Shafer, G., 170
shape description, 418 Shenoy, P., 186
Shieber, S., 394
shift reduce parser, 355–6 SHRDLU, 394
sigmoid function, 291 Simon, H., 206
simple tenses, 351 simulated annealing, 63–6 single-point crossover, 332,
337
singly connected DAG, 134 singular value decomposition,
326 SIPE, 224
skolem function, 80 skolemization, 79–82 Sloman, A., 441, 445, 451 SmartAirport, 458
Smets, P., 170
Smiths Aerospace, 469 Smolensky, P., 451
Sobel operator, 421 SOFM see self-organizing
feature map Sony, 473
sound logic, 32
specialization, object-oriented,
14
specialization operator, in
learning, 233
specific boundary set, 234 spectrum, 398
speech processing, 396–408
frame, 398
hidden Markov model,
402–8
recognition, 401–8 segmentation, 407
signal processing, 397–401
spherical projection, 412 Spiegelhalter, D., 152 Spirtes, P., 152
Spivey, M., 101
state, in search, 49
state description, in planning,
190
state space, in search, 49
statistical approaches, to natural language
processing, 390–3 stereo imaging, 414
stochastic outcome, 105 stochastic sampling, inexact
inferencing in Bayesian
networks, 150–2 Stolke, A., 408
strength of a weight, 288 STRIPS, 189–91
508 Artificial Intelligence
strong AI, 442–4
subclass, object-oriented, 14 subject in natural language,
349, 351 θ-subsumption, 261–3
subsumption architecture, 440 subtype, object-oriented, 14 Sugihara, K., 432
Sundheim, B., 458
supervised learning, 229, 231 surface normal, 418
Sussman anomaly, 191 Sutton, R., 284
SVD see singular value decomposition, 326
symbol system hypothesis, 449 symbolism, 9
syntactic knowledge, 349 syntax, in logic, 19
system response, 444 systematicity, 450
T junction, in vision, 418 tagset, in NLP, 392
target function, in learning,
227, 229–31 task, planning, 219
Tate, A., 224, 473
tautology, 23
temporal difference learning,
282 tense, 351
term, in FOPC, 39
terminal node in a tree, 353 test decision, 174, 181–3 texture, 416
threshold function, 290 TMS see truth maintenance
system top-down hypothesis
derivation, 256 top-down parser, 355, 356 Touretzky, D., 451 transitive verb, 351 travelling salesman, 341 trellis, 404
triangulated graph, 135–8 triangulation, in image
processing, 414
triangulation, of a graph, 136–8
trigram model, 391 Tripath Imaging, 463
truth maintenance system,
168–9
truth table, 19, 22–4
Turing test, 446
Turk, M., 432, 462 turnstile, logic symbol, 28 two-point crossover, 337 type, object-oriented, 14 types of learning, 231–2
UCPOP, 224 Ullman, J., 394 uncertainty, 103–70
Bayesian networks, 104–52 Dempster–Shafer theory,
163–6
explanation of, 104–5 fuzzy logic, 154–63
in planning, 211–19
unification, 82–5
unifier, 83
uniform crossover, 337
unit clause, 77
universal quantifier, 35 unsupervised learning, 229,
231 utility, 174
utility value, 272
valid argument, in logic, 25–8 valid deduction, in logic, 18 validation data, in learning,
248
valuation function, in logic, 32 value iteration, 279–80
value of information, 183–5 variable, in Bayesian network,
106 verb, 349, 351
verb phrase, 351 version space, 234 vision, 409–32
classifying objects, 428–31 edge detection, 418–21 filter, 416
Hough transform, 425–6 image, basic concepts,
410–13 segmentation, 422–4 stereo imaging, 414 visual cues, 414–17
Visionic, 463
visual cues, 414–17 Viterbi algorithm, 407
Waibel, A., 408
Waldinger, R., 206
Walsh, T., 69
weak AI, 442–4
weakest precondition, 192 weight, in neural networks,
287–88
weight, of a cluster, 138
Weiznbaum, J., 446
Weld, D., 193, 195, 206, 217 well-formed formula, in FOPC,
39–41
Werbos, P., 301, 302, 310 Whitely, D., 344 Widrow–Hoff rule, 294 Wilkins, D., 224
Williams, R., 301
Williamson, J., 152 Winograd, T., 393
Winston, P., 16, 458 Wittgenstein, L., 445
Woods, W., 394
Wooldridge, M., 440 working memory, production
system, 11 world knowledge, 350
Wos, L., 101
XOR definition, 292
Yuille, A., 432 Yurkovich, S., 170
Zadeh, L., 154, 170 Zelle, J., 394
Zhu, S., 432