程序代写代做代考 flex Excel Hidden Markov Mode c++ case study chain algorithm Bayesian network AI Bayesian prolog decision tree database data mining PART 1

PART 1
PART 2
PART 3
PART 4
PART 5
Contents Preface v
Introduction 1
1 Introduction 2
Logic and Search 17
2 Logic 18
3 Search 46
4 Automating Logical Reasoning 72
Uncertainty 103
5 Bayesian Networks I 104
6 Bayesian Networks II 133
7 Other Approaches to Uncertainty 154
Deciding on Actions 173
8 DecisionNetworks 174
9 PlanningI 189 10 PlanningII 208
Learning 225
11 Introduction to Learning 226
12 Decision Tree Learning 241
13 Inductive Logic Programming 253
14 Reinforcement Learning 269
15 Neural Networks I 286
16 Neural Networks II 312
17 Genetic Algorithms 330
Natural Language Understanding and Perception 347
18 Natural Language Processing I 348
19 Natural Language Processing II 367
PART 6

iv Contents
PART 7
20 SpeechProcessing 396 21 Vision 409
Agents, Philosophy and Applications 435
22 Agents 436
23 Philosophy of Artificial Intelligence 441
24 Some Applications of Artificial Intelligence 452
Appendix A Bibliography Index
A Quick Tour of Prolog 474 492 500

CHAPTER 1
Introduction
This book is an introduction to techniques that are being used to build machines that manipulate information. These machines are being used today in a wide variety of applications, such as monitoring credit card fraud, making autonomous decisions on space missions, watching for attacks from computer network hackers, diagnosing faults in aircraft, enabling human–machine speech interfaces, and making the characters in a video game behave in a more human-like way. All of these machines are the result of technologists working in the field of artificial intelligence (AI). AI has become an important engineering discipline for building machines that can assist us and entertain us. But AI is not just about engineering; it is also a science that aims to further our understanding of our own and other animal intelligence. You may ask ‘Is not the ultimate aim of AI to build machines that think like us, are mobile, perceive what is going on around them, can communicate with us, and have limbs similar to ours so that they can do the sort of physical things that we do – a humanoid?ʼ In some sense, AI is about striving to build this type of machine. The technologies required to build such machines are being developed. There are robots that look like humans in spacesuits that walk and climb stairs, and robots that express emotion. Aspects of perception such as seeing and listening are being heavily researched, along with all of the other elements that make up intelligence, e.g. reasoning, learning, language communication, planning, decision making, and common-sense. We are still a long way from building a humanoid, and we do not really know whether it will be possible to build artificial machines that think like us and have an understanding about the world in the same way that we understand our world. But we are building machines that today assist us in many aspects of our daily life and we can be sure that AI will infiltrate our lives more and more in the coming years without us even knowing. We can also be sure that these machines will be better than us at many tasks in much the same way that the calculator is better than any human is at arithmetic computations.
The aim of this chapter is to provide a little more of the background of what AI is about.
1.1 AI emerges from the laboratory
In 1965, a team headed by Edward Feigenbaum and Robert Lindsay started work on a computer system to assist chemists to determine the molecular structure of complex organic compounds. Their system was not programmed in a conventional manner. Instead, it used a body of knowledge, the kind of knowledge possessed by chemists, to help solve the problems it was presented with. The system was an expert at analysing molecular structure and it was the start of expert systems, computer programs that have expert knowledge in specialized domains.
The system developed by Feigenbaum and his team was called DENDRAL. Determining the molecular structure of a compound involves finding out which atoms are present and how these atoms are connected together. Chemists can obtain data about a compound by breaking it up into fragments using a device called a mass spectrometer. The results of mass spectrometry provide

Introduction 3
clues as to the chemical composition of the fragments and how many types of each fragment there are. The problem for the chemist is that each fragment can correspond to several different types of substructure and yet all of these substructures must fit together into a single global structure. In theory, the chemist could test to see which combinations of substructures fit together to form the global structure. The problem is that there are too many possible combinations to make this approach practical. The problem is a little like being presented with a photograph of someoneʼs face that has been cut into many squares and jumbled up. There are many ways in which the squares can fit together but only one fit reproduces the face. If the photograph has been separated into four equal segments then there are 24 possible combinations of fit (ignoring rotations). The number of combinations increases rapidly with the number of segments. An exhaustive search would entail a systematic rearrangement of the pieces until the fit forms the image of the face. The search time can be significantly reduced if knowledge can be used. For example, in the face reconstruction task there are many constraints suggested by knowledge of real faces. For instance, the ears are to the left and right of the eyes, the eyes are approximately in line and above the nose, which is above the mouth, etc. DENDRAL used its knowledge base to constrain the search and help chemists discover the molecular structure of a compound.
A few years later, work started on another expert system for suggesting which micro-organism was causing a blood infection and how the infection should be treated. The system was named MYCIN. Tests are performed to gather data about the infection a patient has. One such test involves growing bacterial cultures from a sample taken from the location of the infection. These cultures can provide morphological features and staining characteristics of the organism. MYCINʼs knowledge can then be applied to these clues about the infection to diagnose the likely organism and to suggest treatment. Like DENDRAL, MYCINʼs knowledge was encoded as a set of IF…THEN rules. These rules consist of a set of conditions and a conclusion. If all of the conditions hold (are true) then the conclusion holds. A typical rule in MYCIN is:
IF
The stain of the organism is Gram negative AND The morphology of the organism is rod AND The aerobicity of the organism is aerobic
THEN There is strong evidence to suggest that the class of organism is Enterobacteriaceae
When reasoning, an expert system will usually apply many rules. The conclusion from one rule can form part of the IF condition of another rule. For example, the following two rules might form part of an expert system for diagnosing a fault with a car engine:
1 IF
The engine will not turn over AND The lights do not come on
THEN The battery is dead 2 IF
The battery is dead THEN The car will not start
If rule 1 concludes that the battery is dead, then rule 2 will conclude that the car will not start.

4 Artificial Intelligence
DENDRAL and MYCIN were the first of many expert systems. These systems represented the first truly large-scale commercialization of AI technology. Large corporations started to develop expert systems to mimic the reasoning performed by human domain experts. The boom years for the tradi- tional expert system were the 1980s. The boom was short-lived however. Too many people jumped on the bandwagon and started to suggest expert systems for applications for which they were not suited. Expert systems were good at tackling specific types of problem, but they represented only part of the AI technology under development. Many overestimated what expert systems were capa- ble of, probably because expert systems were offshoots of the field whose aim is to equip machines with human-like intelligence.
Expert systems are in commercial use today, but it is now rare for an application to be called an expert system. To many in industry, the title ‘expert systemʼ still conjures up a perception of what the technology offered in the 1980s and yet there have been some significant advances since. For example, the MYCIN rule presented earlier had its conclusion couched in uncertainty – there is strong evidence to suggest… The car rules, although written here as black and white conclusions, are lacking in diagnostic reasoning. Again, the knowledge base has to have knowledge of how likely a conclusion is when the conditions in the IF part hold. There were techniques for handling uncertain knowledge in early expert systems, but the approaches were somewhat ad hoc. Today, good probabilistic-based techniques exist. The tools of today also offer many advances over early systems because software engineering techniques have made significant advances in the last 10–15 years. The term ‘knowledge-based systemʼ is more likely to be used today rather than ‘expert systemʼ.
1.2 What is an AI application?
Those early expert systems represent the essence of an AI application. An AI application is designed to mimic some aspect of human intelligence, whether that be reasoning, learning, perceiving, com- municating, or planning. All of these facets of human intelligence rely on knowledge. They must also be flexible so that they can respond to events that were not foreseen by the developers of the system. An AI program is not like a database program that queries for information from a table, it is not like a spreadsheet program that computes total sales by category of item, and it is not like a pro- gram for constructing and sending emails. All of these programs follow clearly defined sequences of instructions determined by what the user requests. In these conventionally programmed systems, it is possible to predict what the user will want to do and program the instructions accordingly. AI programs are somewhat different. Programs are still constructed by programmers, but the program they implement must be able to execute in a way that is highly flexible. Consider the task faced by a medical domain expert, a doctor. Often a patient presents a minor complaint that may easily be treated by prescription. There is usually no single treatment for any one complaint. A patient may not respond well to one form of medication or may suffer side-effects. Sometimes a doctor may not be sure what is causing the symptoms but believes that the cause belongs to a certain class of complaint that normally responds to a particular form of treatment. A doctor may decide that there is value in carrying out tests to assist the diagnosis. He or she may decide that a patient needs to be referred for more specialist consultation.
So what makes the programming of an artificial doctor different from, say, a program for fil- ing and calculating your tax return? In many ways, the two domain tasks are similar. They both require elicitation of data concerning an individual. The doctor needs to know about symptoms and may perform tests to gather further data. The tax return program needs data concerning earnings, personal circumstances, expenses, etc. Both the medical and tax domains require expert knowledge. The doctor has gained a large amount of knowledge over many years of training. There are many rules regarding tax and a tax expert has good knowledge of these rules. Both types of domain expert

Introduction 5
seek to derive information from data and knowledge. The doctor hopes to identify the cause of the complaint and prescribe a treatment; the tax expert seeks to compute money owed to the tax office or money to be refunded to the individual. The real differences between the two domains lies in the amount of knowledge the expert requires, the type of data and number of exceptional circumstances, the consequences of a mistake, and the complexity of decision making. Earnings when filing a tax return are easily computed. The symptoms a patient suffers cannot always be so precisely described: symptoms may vary in severity, symptoms may come and go, and aspects of the data may be unknown (e.g. a negative test result might reduce the likelihood of some condition being present but the test may not be conclusive). A mistake on a tax return can always be recovered, possibly in the next tax year. A mistake in diagnosing a patient may not be so easily rectified. A doctor may need to consider many decisions, like the need for additional tests, and the type of treatment, etc. The other major difference between the tax domain problem and the medical one is that a doctor will use a wide range of other information such as a patientʼs history and information derived through their own perceptions from vision, and listening. In summary, programming an artificial doctor cannot be tackled using conventional techniques.
Back in the 1980s, there were suggestions for more sinister applications of expert systems. Some people, for example, contemplated putting expert systems onboard missiles. The missile would be given a specific target and, should the target not be found, the expert system could decide if an alternative target existed or if the missile should return home to be recovered. At that time this form of application was ambitious. For a start, the system would need to reliably recognize legitimate targets, and this is no easy task. There is though a growing need for complex reasoning on autonomous vehicles. There is currently much funding and development of unmanned aircraft. These air vehicles have been used in recent conflicts for surveillance. Development is also under way on unmanned air combat vehicles. Many believe that the new Joint Strike Fighter currently in development will be the last manned combat aircraft to be produced on a large scale. There will be a drive for autonomous reasoning so that an aircraft can respond quickly and appropriately to events. Eventually, this reasoning will also extend to the aircraftʼs health so that it can take appropriate action should it develop a fault or suffer battle damage.
An application needs AI technology when there is no clear recipe of how to perform a task given all the circumstances that might arise. If you can sit down and put together a list of instructions that can state for a task, when X occurs do Y, then you may not need AI. Most technology applications today get along without AI, but they desire it just like man desired speed and progressed from horseback to mechanized transport. For example, today you do your banking by phone. You can dial in, enter a pass code, find out your balance, and pay a bill. The more involved the transaction is, the more awkward the interface becomes. You have to wait for a question to be asked, respond by entering data through the keypad, and, if all is okay, you go to the next stage. No need for AI here. But would you not prefer to do all this by speaking? It would certainly be quicker. The greater challenge though is when you want to do a transaction or make an enquiry that is not on the menu. You are then referred to a human operator. If you want a bank loan and the computer sees that all the right boxes are ticked then the loan can be approved. The computer always has a default instruction, ‘any exceptional circumstances refer to bankʼ. The computer might have a rule that says, ‘you need to have been in employment for at least 1 yearʼ. If you are a talented young graduate 3 months into a new job then surely the bank wants you as a customer? Of course it does, but you are still an exception requiring human attention!
An AI application is in simple terms a system that possesses knowledge about an application domain, takes in data from its environment, and reasons about these data to derive information. This information is then used to make decisions and act. Often a human will make the decision and perform the action but technology is developing so that more of this responsibility resides in the

6 Artificial Intelligence
machine. Expert systems are good examples of what the essence of AI is, but an expert system is just one example of AI being applied to real-world problems. Other applications require vision, speech recognition, learning, mobility, etc. A specific expert system may also use these technologies.
1.3 What is AI?
Hopefully the preceding paragraphs have given some idea as to what AI is all about, but we have not directly tackled the question of what AI is. There is no crisp definition of AI, and a number of people have attempted to express what the aim of AI is. Marvin Minsky, the co-founder of the AI lab at the Massachusetts Institute of Technology (MIT), gives one such view of AI:
AI is the science of making machines do things that would require intelligence if done by men.
The machine is usually a digital computer.
It is generally recognized that AI became established as a scientific discipline at a conference in
1956 when a small group of like-minded scientists gathered to discuss their work and ideas. The fundamental underpinnings of the subject were to develop over the following years. It was John McCarthy, one of the organizers of the conference, who labelled the new discipline ‘artificial intelli- genceʼ. The central hypothesis of traditional AI, known as the physical symbol system hypothesis, is that any facet of human intelligence can be understood and described precisely enough for a machine to simulate it. The world is to be represented internally by the machine using symbolic structures. Intelligence then emerges as a process of transforming these structures. Consider, for example, language communication. We use symbols (words) to refer to objects in the world and other aspects such as actions (e.g. running), time, emotion, and so on. A computer simulation of this form of communication can be implemented in a way that is similar to having a computer read the following mathematical expression and understand how to produce an answer:
[(x – 2/y) + 3]
A program would parse this expression to make sure that it can recognize it, and out of this parse it might form a tree structure like the one shown in Figure 1.1. To give an answer, the program needs to recognize a number and needs to know what to do with the mathematical operators /,+ and –. These operators have meanings, and their meanings are the mathematical operations divide, add, and subtract. The program can recognize these symbols and call functions that implement these mathematical operators. The tree is scanned in a certain order to produce what is called Polish nota- tion. Once this notation is derived, the calculation is simple: the expression is scanned from right to left and when an operator is encountered it is applied to the two operands to its right. Many language programs operate in a similar way by parsing sentences into tree structures. They then transform the trees into other structures that can represent the meaning. For example, suppose we have a database table with the following entry:
Relation Argument 1 Argument 2 own caroline mia
We could write a simple program that could answer: What does Caroline own?
The program could parse the sentence into the following form:

Introduction 7
+
–3
x/
2y Figure 1.1 A tree representation of the expression ((x – 2/y) + 3).
own(caroline, X)
The program could then look up the table to see where it finds a match for the relation own and the first argument caroline. In this example, the variable X would be assigned the value Mia (because a variable can refer to any object) and the system would respond with ‘miaʼ. The above expression is asking what can match with own(caroline, X). The database contains the relation own(caroline, mia). The variable X can match with any other symbol, which in this case is ‘miaʼ.
The physical symbol system hypothesis is still waiting to be proved. Today, machines can per- form useful language communication, but they still fall far short of what a young child is capable of. Many would argue that the physical symbol system hypothesis is not sufficient to replicate human intelligence. There is a big question over whether internal representations are adequate to prime a machine with enough understanding so that it can reason intelligently. For example, can we really describe a garden tree to a computer using structures that convey features such as texture, size, and rooted in the ground? Or does the machine need to have its own perception of the world so that it can learn for itself what a tree is? Consider the challenge of how to represent a chair inside a machine. Four legs to which is attached a small surface upon which a backrest may be present? I would not describe my sofa as having four legs, and what about a more unusual design of chair that has two sloping legs and stabilizing feet. How then do we find it relatively easy to recognize an object as a chair even when we have not seen some unusual design before? Perhaps we see these objects as comfortable spots to park our weary limbs and perhaps on some occasions we use the context of our surroundings, like a restaurant, where we can expect to be seated, or the selection of dining suites in a furniture store.
AI is both an engineering and scientific discipline. Most people in AI want to build machines that require human-like intelligence. Others within AI want to understand more about human intelligence. The two interests are, of course, complementary. The act of building applications can tell us much about intelligent processes, and discovering more about ourselves through more direct methods such as psychological studies or the biology of the brain gives us insight in how to build intelligent machines.
If your interest in AI is to build a human-like agent to do the household chores then you should be encouraged. This is a noble challenge, and success would surely capture the interest of the world at large. In reality, such machines are still a long way off. Whether these machines await a huge scale of effort similar to the first manned missions to the moon, who knows? It might be that we have the fundamental collection of techniques already. It might be that we need to radically rethink some aspects of AI. Notwithstanding the considerable challenges facing AI researchers, there is great interest in what AI can do for real-world applications. This interest is also growing. The

8 Artificial Intelligence
problem-solving techniques developed within the discipline of AI tackle hard problems and in some areas the performance is better than that of humans. Some examples of the types of application are listed below.
• Systems have been demonstrated that can perform as well as humans in restricted domains of medical diagnostics.
• Machines can now recognize speech to a level competent enough for continuous dictation.
• Unmanned space vehicles have flown that use AI-based planning to allow onboard autonomous
reasoning.
• Machine learning is used to detect credit card fraud.
• Large corporations use data mining to hunt out information in huge databases. Such applica-
tions assist companies such as supermarkets to devise their sales strategies.
• Machines have been trained to recognize individuals from images of their faces.
• Machines can also track the eye and mouth movement of an individual who is speaking and
moving their head.
• Topics of interest in news stories can be filtered and summarized by machines.
• Gaming companies actively employ AI technologists to program human-like traits into the
agents of a game.
This is just a small sample of the sort of problems AI is tackling. Some of the technology areas are more advanced than others. Often the domain of application is very restricted and the environment controlled. For example, you might be able to communicate with a machine to enquire of tourist information within a city but you would not be able to widen the conversation and talk about the weather, employment opportunities, or sport.
AI is largely about building machine programs that can simulate some aspect of human intel- ligence. There are algorithms for reasoning, learning, planning, speech recognition, vision, and language understanding. AI takes from and gives back to many disciplines that include mathemat- ics, psychology, neurobiology, philosophy, computer science, and linguistics. There are many professions that have an interest in AI, such as medicine, engineering, and law.
1.4 Different models of intelligence
AI has sometimes been partitioned by different researchers into various camps that hold a particular view of how human intelligence is best modelled. Models of intelligence have been influenced by different subjects, such as mathematical logic, psychology, biology, statistics and economics. These influences can be seen in the subjects presented in this book. The logical and psychological models have influenced much of what is referred to as traditional AI or classical AI. An example of bio- logically inspired AI is connectionist AI, in which the computing elements resemble an abstraction of our own neural circuitry. Neural networks are the computing tools of connectionists. Most work on connectionist architectures has been done in the last 20–30 years. Early work on neural networks predates the 1956 conference at which AI was born, but traditional AI is so called because neural networks played a minor role in the formative years of AI. Connectionism has really boomed since the late 1980s.
Neural networks are structures that consist of many simple processing nodes connected via weighted links. The nodes and links form a network. The type of processing performed by a node varies depending on the node type, but the processing is usually very restricted, often doing nothing more than summing signals sent from other nodes and transforming this sum into an output. This output can then be sent on to other nodes or to some interface that communicates with a user. The

Introduction 9
weighted links can inhibit or strengthen a signal transmitted between nodes. The fascination with these networks is that they can perform complex tasks and yet they are constructed out of very simple elements. Of course, a computer at a fundamental level is also composed of simple elements such as sequences of bits which are subjected to simple operations.
To get anything useful out of a computer, it has to be programmed, and the more complex the task the more complex the instruction set that has to be entered. A neural network is simulated on a computer using a conventional program, but the network is not explicitly instructed in how to per- form a task. Think for a moment about the rules in an expert system. These systems are constructed using a tool known as a shell. The shell is a high-level programming environment that permits the programmer to write a set of rules and test these rules. The shell has an inference engine, which is a program that knows how to use the rules to reason and derive conclusions. The rules represent the programmerʼs knowledge but the inference engine is the same whatever the domain of application. The shell is similar to a spreadsheet program: you can enter data into a spreadsheet and state what you want to be done, but there is a lower level program that interprets your instructions. The rules in an expert system are explicit instructions: if condition a holds and condition b holds then you can conclude c. A neural network, on the other hand, is not usually explicitly programmed. Instead, you feed the network with examples of the sort of data it is to process and you tell it what you expect it to output for each example. The network then attempts to adjust its weighted connections to comply with your wishes. For example, you could program an expert system with rules to credit score a bank loan applicant. The rule base would be fed data about the applicant, such as salary and current debt, and rules would be used to derive a conclusion that indicates whether or not the applicant is success- ful. In contrast, a neural network would be trained using known case histories. For each example, the network is fed with input data (the same data fed to the rule base) and the network is told for each example whether or not that individual was a successful applicant. Over time the network adjusts its weighted connections so that for each case it can output a successful or unsuccessful applicant. The network is basically told: given this type of data I want you to produce this result. Neural networks are largely self-programming devices – they learn. We should note here that there are also many learning techniques that fall within the traditional or symbolic AI paradigm.
We may suppose that neural networks ought to be the mainstay of AI in view of their abstract similarity to our own computing hardware. The situation at present is that our engineering skills in traditional AI techniques are more widely applicable. A pragmatist will use neural networks for some tasks and traditional approaches for others. We could get tangled in endless debate over which paradigm will yield the most fruitful results in the future. Some traditionalists argue that neural networks offer nothing new other than a platform for program implementation. They believe that at a functional level our reasoning operates along the same lines as traditional symbol manipulation. We think in a language and this language is similar in form and operation to logic or rules. Some connectionists argue that the traditional approach of symbolizing representations is ultimately restricted. For instance, symbols that are text based fail to capture the contextual information that is necessary for much intelligent reasoning. For example, a watch could be described differently depending on the context of how it is to be utilized. It may be desirable for an everyday watch to have a range of functions and be robust, but for a dress watch look and style may be more important than the range of functions it offers.
Connectionism does computation in a different way to computing with symbols, which is the mainstay of computing as we know it today. While some still look for an all-encompassing theory for building intelligence and will argue for a paradigm like symbolism or connectionism, there is an emerging consensus that intelligent machines will need multiple approaches to modelling aspects of human intelligence. Furthermore, there is no longer a clear divide of techniques between symbolism and connectionism. This can be seen in areas where the role of statistics is gaining more emphasis.

10 Artificial Intelligence
1.5 Representation
We reason, plan, learn, and perceive about objects and events in the external world. Somehow we hold representations of the events and objects in our mind. If you are planning a dinner party you must hold in your thought process some representation concerning objects that will form the menu and other objects that represent your guests. A machine simulation of intelligence will also need representations of the external world. A great facet of human intelligence is that we can reason by simulating actions in the real world. This simulation is important because we can predict things that might happen and do not put ourselves at disadvantage through trial and error. Suppose you want to rescue a cat from a tree. You can reason about the use of a ladder, whether the ladder is long enough, whether the cat has the temperament to let you pick her up, whether you can balance with the cat, etc. Trial and error means that you do not give much thought about what you are about to do and you go ahead in the hope that somehow everything fits together. To do this reasoning, we need to represent knowledge about the world. We need internal representations, substitutes for things, such as the cat, the ladder, and the tree. We also hold some representation for abstract things such as fear, distress, or love. The thought of losing our beloved cat might cause us to overcome a fear of heights.
The whole science of AI has largely revolved around issues of knowledge representation. That said, there is a school of people within AI building machines without concern for specifics of knowledge representation. These machines are not built with the purpose of reasoning in the same way as humans. These machines use pattern recognition, and the association of meaning to a pattern is achieved through interacting with the environment. The idea is for these machines to exhibit intelligent behaviour, but their processes may be very different to our own. For example, a neural network can be fed sentences with no knowledge of word category such as noun or verb, and yet it develops its own internal representations that exhibit patterns for verbs and inanimate and animate objects. We may suppose that teachers use much expert knowledge when marking studentsʼ essays and that they follow a complex chain of reasoning that uses this knowledge. It is not uncommon for a teacher to explain how a mark was derived. Such explanation suggests access to expert knowledge. It is surprising then to know that a purely pattern-based text system that has no explicitly programmed domain expertise can automatically grade studentsʼ essays and that the grades correlate highly with those given by expert markers. Pattern-based reasoners still possess a representation, but their representations are not necessarily accessible by us. We may not be able to open and inspect their knowledge base in the same way you could with a rule-based expert system. In a rule-based system the rules are syntactic structures that we can read and interpret. The elements that represent the knowledge in a pattern-based reasoner may look totally meaningless to us. It is a bit like looking at the internal representation a computer has of a photograph. In the computer the photograph is a huge array of numbers. Looking at these numbers means nothing to us, but when the computer displays the photograph these numbers are transformed into colour and intensity levels that reveal the image and its meaning.
Applications that exhibit intelligent behaviour are being built without concern for explicitly structuring the representation of domain knowledge. However, the representation of knowledge has played a key role in the development of intelligent machines, and it continues, and will continue, to do so. For any application, you need to understand what the objectives are and then you need to decide which problem-solving strategy offers the best approach – the application may require explicitly programmed knowledge or the system may be able to learn such that its behaviour, once trained, offers the intelligent simulation that the application demands. A whole chapter could have been written about knowledge representation, but instead I have taken the approach of dealing with representation implicitly in the context of different problem-solving strategies. Although representa- tion will be specifically discussed only rarely, it is always there in the background, be it in the form

Introduction 11
of logical structures or some semantically enriched graph structure. It is, however, important to have an understanding of what is meant by knowledge representation as the term crops up time and again.
A representation, by definition, is a substitute for the real thing. A street map of a town is a representation. The map holds just enough detail for the purpose it is intended – finding a street and navigating between locations. You do not get an image of what the town looks like and you are not supplied with irrelevant information such as the routing of telecommunication cables. Pilots use maps to plan flight paths, but their maps look quite different from ground-based maps. An air map shows key ground features such as major roads and railways, but it also contains information concerning restricted airspace and hazards to flight. The information represented by an air map can be used to work out a heading to be flown. Pilots are trained to navigate. They operate in many ways like a programmed machine. They have access to a functional tool, the flight computer, and have a representation of key data to which they apply their navigational inference. The answer they derive also depends on data received at the time of planning, e.g. wind direction and speed.
Knowledge-based inference machines work in a similar way. They represent knowledge and they have a program that knows how to apply this knowledge to case-specific data to derive answers.
There are many forms of representation; three types will be briefly discussed and then some general points will be considered. Later chapters will include more detail of specific forms of repre- sentations such as logic, connectionist networks, natural language grammars, and causal networks.
1.5.1 Productionsystems
A production system is made up of IF … THEN rules. The IF part of the rule will contain one or more conditions and is called the antecedent. The THEN part is the consequent. A number of these rules exist in long-term memory (are stored when the system is not running), and the working memory holds information that is currently being processed. The basic mechanism of inferencing (reasoning) is to match the information in the working memory with rule antecedents in long-term memory. If a rule matches, then it is triggered, and if a triggered rule is fired its consequent is added to working memory and this new information may cause other rules to fire. Sometimes conflict resolution strategies are defined should more than one rule be triggered. The strategies determine which one of the triggered rules to fire.
An example of a production rule is:
IF X has skin AND X eats food AND Xbreathes
THEN X is an animal
In this rule the antecedent has three conditions. The rule also contains a variable, X, that can refer to any object.
Figure 1.2 shows a simple example of how matching and rule firing is used to achieve inferenc- ing.
1.5.2 Semanticnetworks
A semantic network consists of nodes that denote concepts or objects, links between the nodes that denote object relations (also called associations), and link labels that denote specific relations. For

12 Artificial Intelligence
Long-term memory
1. IF
AND day is sunny
clothes are wet THEN dry clothes outside
2. IF
AND day is wet
clothes are wet THEN dry clothes in dryer
3. IF
AND more than 15 dirty items
clothes are dirty THEN wash clothes
4. IF
THEN clothes wet
washclothes
Working memory
clothes are dirty 20 dirty items day is sunny
step 1: match with rule 3
clothes are dirty 20 dirty items day is sunny wash clothes
step 2: match with rule 4
clothes are dirty 20 dirty items day is sunny wash clothes wet clothes
step 3: match with rule 1
clothes are dirty 20 dirty items day is sunny wash clothes wet clothes
dry clothes outside
Figure 1.2 Example working of a small production system. Data come into the working memory and are matched against rules in the long-term memory. When a rule is fired, its conclusion is added to the working memory and this newly derived information can lead to the firing of other rules.
example, Figure 1.3 shows a semantic network for representing the concept arch. An arch consists of three parts: a top and two sides. The sides, which must not be touching, support the top. If the objects for the tops and sides are allowed to be of any type, the semantic network in Figure 1.3 provides a general description that can be used to recognize any arch. For example, the sides could be cylindrical and the top could be a pitched roof.
The labelled links can be used to express various forms of relations. These could be structural, such as is-part-of or x-contains-y, subtyping, as in ako (a kind of), similarity based, in which a nu- meric weight indicates the strength of association between one object and another, or more complex, e.g. hit, loves, and go. Another example of a semantic network is shown in Figure 1.4.
1.5.3 Frames and object-oriented concepts
Marvin Minsky proposed the concept of frames to represent stereotypical situations. You have, for example, an expectation of what will be found in a kitchen, e.g. a cooker, storage cupboards, a sink, and various forms of utensils. Events can also be described in terms of stereotypical situations.

Introduction 13
has part
has part
does not touch
does not touch
has part
Top
supported by
supported by
Side 1
Side 2
Figure 1.3 A semantic network for the concept arch (Winston, 1970).
Eating at a restaurant generally entails being seated, selecting from the menu, order taking, eating, and paying the bill. Minsky proposed frames as structures that could be called up from memory and adapted when a new situation is encountered. The frame would be adapted by changing certain features, such as the colour of the walls in a kitchen or whether the floor is tiled or carpeted. A frame could also have properties that must be present for the frame to be applicable to the situation. A kitchen frame may insist on the presence of a sink and a cooking appliance. Frames could also be linked together to organize knowledge and express relations.
The concept of frames can be found in the more general object-oriented programming paradigm that is popular today. The terminology of frames is different from that of the object-oriented school, and the motivation for frames is to perform inferencing, which is a specific type of computational task that is not generally required in the wider field of object-oriented programming. Nonetheless, we shall introduce a number of key object-oriented concepts as they support the structures and mechanisms that are suggested by frames. Indeed, many AI systems are implemented using an object-oriented language. The wider concepts of object-oriented techniques are also being utilized in some advanced frameworks for inferencing, one example being an extension to structuring probabilistic causal inferencing, which is introduced in Chapters 5 and 6.
An object can be used to denote anything we wish. It could be a concrete concept such as a book or car, or something more abstract, e.g. shape. A class provides a framework for describing a type of object. The class is the blueprint, much as a drawing is a blueprint for a house. An object is an instantiation of a class. For instance, a book has features such as title, author, publication date, and publisher. The class ‘bookʼ provides a generic description using features, but an instance is an object that has values assigned to these features. Instead of the term ‘featureʼ, the word ‘propertyʼ is used to denote an attribute of an object. Properties are pieces of information about an object, such as colour, author, publisher, and number of pages. Methods describe behaviour in that they declare what an object can do. For example, a book object might have a list of keywords that describe the bookʼs content. The list of keywords is a property, but a method could be used to see if a particular word exists in the list of keywords. A graphical object that is part of a user interface will have methods such as draw, move, hide, and resize.
Arch

14 Artificial Intelligence
Animal
eats must
has
Food Breathe Skin
ako
ako
has Wings
has Fins can
Bird
ako
Rook
can has
colour
Fly Feathers
Black
Fish Swim
Figure 1.4 Semantic network showing some animal concepts and properties. This network is an efficient way of memorizing data. For example, we do not need to store explicitly the knowledge that a bird must breath. We can infer this fact because a bird is a kind of animal and animals must breath.
There may be some confusion as to the granularity of an object. For example, this text is an instance of the class ‘bookʼ, but there are many instances that are all copies of this text. The im- portant thing is that an object should be identifiable and so every object should have some property that is unique. A library may contain many copies of the same book, but everyone can be uniquely identified in terms of a number.
Objects can be linked (associated). One special type of association is inheritance. Class is synonymous with type and inheritance is used to define type specialization. This text is of type book, but a book is a subtype of a more general class called publication. All publications have a title, publisher, and date. Journals, newspapers, and books are all subtypes of publication, and they differ from each other in the way that they specialize publication. A book has a property that is a list of chapters, whereas a journal will contain a list of papers. The publication class is said to be a parent of book, journal, and newspaper. Also, book and journal can be referred to as child classes, or derived classes, or specializations, or subtypes, or subclasses. Inheritance is a special form of as- sociation because it usually has built-in support that allocates by default all properties and methods found in a parent class to a derived class. For example, in Figure 1.4, bird is derived from animal and so all birds have properties must-breath, eats-food, and has-skin. There are restrictions that can be enforced to determine which properties and methods are inherited, but this is detail that need not

Introduction 15
concern us here. Any other form of association can be modelled. The links that form associations are usually implemented by having a property in one object hold a reference to the associated object.
Another useful feature of objects is that they can be programmed to respond to events. Probably the most commonly encountered event with computers is the press of a key when typing or the click of a mouse button. An event will invoke a method that describes how the object should respond. Clicking the mouse on part of a window might cause the window to close. When the water in a kettle boils (an event), the kettle should switch itself off after a short delay.
A method can have different behaviour depending on the type of object being processed. The method for drawing an object to the screen is dependent on the shape of the object. For instance, the draw method for a circle-shaped object is different to that for a cube.
The object-oriented paradigm is excellent for modelling and implementing programs. There is growing demand for software components that can be used in different applications. The component may be a learning algorithm, a component for displaying graph structures and executing graphical operations, or a component that filters emails and routes them to particular storage folders. The object-oriented programming paradigm and component software techniques are of practical interest to AI in that they are tools that assist in building large complex systems out of smaller parts.
There are many real-world examples that can be modelled using object-oriented techniques. Consider simulating a lift system with two lift shafts. The objects are the two lifts, the people using the lifts, the request service panels on each floor, and the control panels inside each lift. The behaviour of each lift can be described using methods (e.g. move up, move down, and wait). The behaviour can also be constrained in terms of control logic rules. For example, ‘if moving down and a request for service to a lower floor is received after a request to move to a higher floor, continue to move downʼ. Passenger requests issued in the lift or on the various service levels are events. A request to go to a particular floor is an event that the lift must handle along with all other events. Responding to events in strict order of receipt is not likely to be efficient, and so some careful control logic is applied. A sophisticated simulation might get both lift objects communicating in an attempt to optimize the service to customers. The simulation allows engineers to see how the lifts respond to different scenarios of user demand.
1.5.4 Key features of a representation
Davis et al. (1993) state that a knowledge representation is best understood in terms of five distinct roles that it plays:
1 A representation is a substitute. The substitute allows us to think rather than act. For instance, we often play out scenarios in our own minds to reason about the consequences of our actions. Planning an evening out requires substitutes for venues, transport, friends, likes and dislikes, etc.
2 A representation suggests in what way you should think about the world. For example, a pro- cedural representation for instructions to change a carʼs wheel would be written as a sequence of instructions: apply handbrake, loosen nuts, jack car, and so on. Representing this type of knowledge using objects does not appear so natural. On the other hand, modelling knowledge concerning the components that make up a car engine might fit naturally with an object descrip- tion. Some objects are composed of other objects and there exist objects of the same type, such as hoses, pistons, etc.
3 A representation is a fragmentary theory of intelligent reasoning. Representations are based on theories about what reasoning is. Often these theories are incomplete or are generated by some

16
Artificial Intelligence
4
5
small insight, for example observations of behaviour from psychological studies. Prolog is an AI programming language that has a representation rooted in mathematical logic. Knowledge is represented in Prolog using logical expressions and these expressions support reasoning, as we shall see in later chapters. The rules in expert systems are inspired by psychological studies of how domain experts reason. Connectionist networks are biologically inspired. Causal networks use statistical reasoning. All of these representations present a view on what reasoning is and how it is to be done. A connectionist view is that knowledge can be represented in the weighted connections of nodes and that inference is basically achieved by pattern association.
A knowledge representation is a medium for efficient computation. A representation is useful only if we can compute with it. Much research concerns the efficiency of the computation. Often efficiency is judged in terms of space (memory requirements) and speed of inference. The question of efficiency is very important for real-world applications.
A knowledge representation is a medium of human expression. A representation is a way for us to impart knowledge to our machines. It is also a way for humans to exchange and share knowledge with one another. Software engineers are familiar with this role for representation. They will often use a modelling language to document the design of a program and to com- municate ideas with other members of the team. They will judge a representation on how easy it is to express things, how concise the representation is, and how readable it is to others.
Further reading
For a good overview of work in AI during the first 40 years see Grevier (1993). Grevier also provides a good general and readable introduction to AI, as does Copeland (1993), albeit from a more philosophical perspective. An interesting read on intelligence and a general discussion about its computational modelling is provided in Davis (1998).
A general book on representation is Ringland and Duce (1988). Davis et al. (1993) explore in depth the nature of representation. For specific discussions of various representation formalisms see Collins and Quillian (1969) for semantic networks, Minsky (1975) for frames and Schank and Abelson (1977) for scripts. An early example of learning semantic structures is Winston (1970).
A general read on some of the capabilities of connectionist architectures is Clark (1993). An example of intelligent-type behaviour without explicit programming of knowledge is Brooks (1991). Two examples of text processing without explicit knowledge of language are Elman (1990) and Deerwester et al. (1990).

CHAPTER 15
Neural Networks I
Artificial neural networks are parallel computing devices consisting of many interconnected simple processors. These processors really are simplistic, especially when compared with the type of processor found in a computer. Each processor in a network is aware only of signals it periodically receives and the signal it periodically sends to other processors, and yet such simple local processes are capable of performing complex tasks when placed together in a large network of orchestrated cooperation.
Work on artificial neural networks dates back to the 1940s, but it has only really been since the late 1980s that this form of computation has become popular, following some theoretical leaps in learning and the growth in available computing power. The word ‘artificialʼ is sometimes used to make it clear that discussion is about an artificial device and not about neural circuitry found in ani- mals. The human computer (the brain) performs some tasks that computers have not, as yet, come close to matching in performance, and some would argue that a number of these tasks might always prove elusive to computer scientists. It is not surprising then that our own neural architecture should be the inspiration for computing machines. At this point in time, in comparison with the human brain, artificial neural networks are simplistic abstractions. It is common practice to drop the prefix ‘artificialʼ when it is clear in which context these networks are being discussed. Also, artificial neural networks are often referred to as connectionist networks when computing ability is emphasized rather than biological fidelity. In other words, connectionists aim to make neural networks solve a task rather than attempt to mimic faithfully some part of biological computation.
Although neural networks can be (and are) implemented as fast hardware devices, much research is performed with a conventional computer using software simulation in place of hardware implementation. Not only does software simulation provide a relatively cheap and flexible environ- ment with which to develop ideas, but also real-world applications can be delivered in the form of a conventional piece of software. For example, neural networks are applied to credit score loan applicants, and many other applications exist that do not demand a hardware implementation.
Neural network solutions are growing more sophisticated, and no doubt over the coming years our skills for engineering these computing devices will improve. Already, though, there is a vast array of exciting developments. The application base for neural networks is enormous: credit card fraud detection, stock market forecasting, credit scoring, optical character recognition, machine health monitoring, road vehicle autopilots, learning to land damaged aircraft, etc. Further understanding of the human brain will inspire future artificial neural networks, but there is mutual support in that artificial neural networks are being used as models for brain processes to assist in our understanding of the human brain.
Neural networks are typically programmed through learning, and hence their place within this part of the book. In this chapter, we shall introduce a type of architecture known as the feed-forward network and present an algorithm for learning called backpropagation. Most practitioners would accept that the backpropagation algorithm played a major role in bringing neural networks back to mainstream computing.

Neural Networks I 287
15.1 The basic components
A neural network is a collection of nodes that are connected in some pattern to allow communication between the nodes. These nodes, also referred to as neurons or units, are simple processors whose computing ability is typically restricted to a rule for combining input signals and an activation rule that takes the combined input to calculate an output signal. Output signals may be sent to other nodes along connections known as weights. The weights usually excite or inhibit the signal that is being communicated. A neural network node is illustrated in Figure 15.1.
One of the intriguing aspects of neural networks is that, although they have nodes with very limited computing capability, when many of these nodes are connected together the complete network is capable of performing complicated tasks.
The pattern of connectivity refers to a networkʼs wiring detail, i.e. the detail of which nodes connect, their direction of connection and the values of their weighted connections. The task that a network knows (or its program) is coded in the weights. The connection pattern is usually deter- mined by a two-stage process: first, the system designer specifies which nodes are connected and in which direction and, second, the weight values are learned during a training phase.
Sometimes the weights can be determined without training, but the great appeal of neural networks is their ability to learn a task from being exposed to the kind of data that the network will be expected to process when fully operational. Indeed, for many applications, training is the only option for programming a network because we do not have the task-solving knowledge.
There are many different types of neural network but, in general, they will share the following features:
• a set of simple processing nodes;
• a pattern of connectivity;
• a rule for propagating signals through the network;
• a rule for combining input signals;
• a rule for calculating an output signal;
• a learning rule to adapt the weights.
15.1.1 A set of simple processing nodes
Each processing node usually has incoming weights to receive signals from other network nodes and outgoing weights to transmit signals to other network nodes. Some nodes, input nodes, exist to receive input signals. There are also nodes, termed output nodes, for outputting the results of computation. For example, if a network is used as a classifier, the inputs would be attribute values and the outputs would correspond to the different classes.
Node
Incoming weights from other nodes
Outgoing weights send signals to other nodes
Figure 15.1 A single network node.

288 Artificial Intelligence
15.1.2 A pattern of connectivity
The pattern of connectivity refers to the way in which the nodes are connected. In one type of network, each node may connect to every other node; in another network type, nodes may be arranged into an ordered hierarchy of layers in which connections are allowed only between nodes in immediately adjacent layers; yet other types of network allow feedback connections between adjacent layers, or within a layer, or allow nodes to send signals back to themselves. A connection is parameterized by a weight. A weight is specified by three parameters:
1 the node that the weight connects from;
2 the node that the weight connects to;
3 a number (usually real) that denotes the weight value.
A negative weight value will inhibit the activity of the connected-to node whereas a positive weight will serve to excite the connected-to node. The absolute weight value specifies the strength of the connection.
The pattern of connectivity is conveniently described in a matrix, W, where the entry wij represents the weight value from node i to node j (note that, in many texts, the matrix is written so that connections go from node j to node i: the important thing is to be consistent when performing matrix and vector operations). More than one weight matrix may be used to describe the pattern of connectivity where nodes are grouped into layers. Figures 15.2 and 15.3 give examples of the connectivity pattern written as matrices.
The weight matrix (or matrices) is the networkʼs memory, which holds the knowledge of how to perform a task.
1
–2.6
1.2
4.3 –0.8
2 0.7 3 –0.4
0.5
0.9
–1.0
2.3
4
W=
0.0 –2.6 4.3 0.0
1.2 0.0 0.0 0.5 –0.8 0.0 –0.4 0.9 0.7 –1.0 2.3 0.0
For example the weight that connects node 3 (row 3) with node 1 (column 1) is denoted by the value –0.8 in the weight matrix above.
Figure 15.2 The matrix describes the network’s connections.

Neural Networks I 289
1
2
32
4 0.1 3 5
–0.6
1.2 1
–1.7
1.3
–0.8
1
2
layer-1 layer-2 layer-3
–0.6 1.2 –1.7 …
1.3 –0.8 W1= . . . W2= . .
. . 0.1 …
..
Figure 15.3 Matrices describe the network’s connections.
15.1.3 A rule for controlling the propagation of signals
through the network
Conventional computer programs impose conditions on when certain processes can start and finish. The same is true for neural networks. For a particular network type, some rule will exist to control when nodes can be updated (i.e. combine input signals and calculate an output signal) and when a signal can be sent on to other nodes. With some network types, a node is selected at random for updating, whereas other types insist that one group of nodes must update before another group can be updated.
15.1.4 A rule for combining input signals
It is typical for a nodeʼs incoming signals to be combined by summing their weighted values. This summation method is illustrated in Figure 15.4, in which netj is the resultant combined input to the node j, xi is the output from node i and n is the number of incoming connections. Other methods for combining input signals exist.
15.1.5 A rule for calculating an output signal
Nodes have a rule for calculating an output value that will be transmitted to other nodes or for presenting output results. This rule is known as an activation function, and the output value is referred to as the activation for the node. The activation may be real-valued, sometimes restricted to the interval [0, 1], or discrete such as {0, 1} or {+1, –1}. The value passed to the activation function is the net combined input to a node. Some activation functions are given next.

290 Artificial Intelligence
Figure 15.4
0.7
–0.3
0.1 3.1 j
0.5 0.3
n
netj = ∑ xi wij
i=1
netj = (0.7 x –0.3) + (0.1 x 3.1) + (0.3 x 0.5) = 0.25
or, alternatively, in vector notation
–0.3
[0.7, 0.1, 0.3] 3.1 =0.25
0.5
The typical way of summing the signals impinging on a node.
Identity function
The activation function for input nodes is the identity function, which simply means that the activa- tion (signal sent on to other nodes) is the same as the net input (Figure 15.5).
Binary threshold function
Most network types rely on a non-linear activation function. A binary threshold function will limit the activation to 1 or 0 depending on the net input relative to some threshold θ (Figure 15.6). Usually, it is more convenient to add a bias term to the net input and change the threshold to its mathematical equivalent form shown in Figure 15.7. The bias w0 is the negative of the threshold and in this case the net input is calculated as:
n
netj =w0+∑xiwij (15.1)
i=1
The bias is normally thought of as a weight coming from a node that always has an activation of 1, as shown in Figure 15.8.
The net input can be expressed as:
n
netj =∑xiwij
i=0
where x0 always has a value of 1.
(15.2)

f (net)
net
Figure 15.5 The activation is the same as the net input, f(net) = net. Note that f(net) refers to the
activation.
Neural Networks I 291
Figure 15.6 Binary threshold function.
f (net) 1
f (net) 1
φ
net
1 if net > 0 –
f (net) =
φ if net < 0 0 net Figure 15.7 Binary threshold function with bias term added in. 1 w0j w1j j Figure 15.8 For convenience of implementation, the bias term is often thought of as being a weight that is connected to a node in the previous layer with an activation permanently set to 1. Sigmoid function The sigmoid function is one of the most common forms of activation functions used whose output falls in a continuous range from 0 to 1. An example is the logistic function shown in Figure 15.9. The slope and output range of the logistic function can vary. The bipolar sigmoid, for example, has an output in the range of –1 to 1. 1 if net > 0 –
f (net) =
0 if net < 0 x1 x2 w3j x3 w2j 292 Artificial Intelligence 1 1 + exp (–net) f (net) 1 0.8 0.6 0.4 0.2 –4 –2 0 2 4net Figure 15.9 The sigmoid function. Example 1 • • • • • The following example brings together some of the points discussed so far. A network is used to express the XOR relationship. The XOR relationship maps two binary inputs to 0 or 1; the definition is given in Table 15.1. The network model shown in Figure 15.10 is a layered feed-forward network with two input nodes, two hidden nodes and one output node. The hidden nodes are so called because they do not take direct input from the environment or send information directly out to the environment. In this example, we can think of the environment simply as ourselves feeding values to the input nodes and monitoring the result via the output nodes. The nodes are arranged in layers: the input layer contain- ing the input nodes, the hidden layer containing the hidden nodes and the output layer containing the output nodes. The number of nodes in each layer depends on the problem being solved and will be discussed later in this chapter. For the moment, we simply observe that the number of input nodes matches the number of attribute values describing the training examples and the number of output nodes matches the number of attributes describing the output vector associated with each training example. For this example the net input is calculated according to: Table 15.1 XOR relationship Input Output X1 X2 110 000 101 011 f (net) = Neural Networks I 293 1.5 –1 –1 –1 –1 1 –1 –0.5 0.5 Figure 15.10 Network used to model the XOR relation. n netj =∑xiwij i=0 and the output will be calculated using the threshold function f (net) = 1 if net ≥ 0 0 if net < 0 The activation is the same as the net input for nodes in the input layer. Signals propagate through the network from the input layer to the output layer. The first input will be the first sample in Table 15.1, which is < 1, 1 >. For the first hidden node with a bias of 1.5:
net = (x0 × 1.5) + (x1 × –1) + (x2 × –1)
net = (1 × 1.5) + (1 × –1) + (1 × –1) = –0.5
so the output = 0. For the second hidden node with a bias of 0.5: net = (x0 × 0.5) + (x1 × –1) + (x2 × –1)
net = (1 × 0.5) + (1 × –1) + (1 × –1) = –1.5 so the output = 0.
For the output node with a bias of –0.5: net = (x0 × –0.5) + (x1 × 1) + (x2 × –1)
net = (1 × –0.5) + (1 × 1) + (1 × –1) = –0.5
so the output is 0.
If the procedure is followed for the other three samples, the output of the network will match the
output column in Table 15.1.
•••••

294 Artificial Intelligence
15.1.6 A learning rule to adapt the weights
The network illustrated in Example 1 implements the XOR function. The correct operation of this XOR network is dependent on the arrangement of nodes, the choice of activation function and the weights. The arrangement of nodes is usually fixed at the start of learning, and so is the choice of activation function. The task during learning then is to adapt the weights so as to produce the desired response.
It is usual for the weights to be set to small random values at the start of training and so, when an input sample is presented for the first time, it is unlikely that the network will produce the correct output. This discrepancy between what the network actually outputs and what it is required to output constitutes an error, and this error can be used to adapt the weights. An example of an error-correct- ing rule is the delta rule or Widrow–Hoff rule. For an output node that has a single input weight, an activation (i.e. output) of y and a target output t, the error, δ, is given by:
δ = t –y (15.3) The signal into the output node is x. The delta rule states that the adjustment to be made, ∆w, is:
∆w = ηδx (15.4) where η is a real number known as the learning rate. The new weight is then the adjustment added
to the old weight:
wnew = w + ∆w (15.5)
At the start of training, the weights are set to small random values, for instance in the range [–0.7,+0.7]. The backpropagation algorithm is a generalized version of the delta rule. During training, each example in turn is presented to the network and weights are continually adapted until for any input example the error drops to an acceptable low value. Upon completion of training, the network is tested on data that were not seen during training to test generalization.
15.2 Fundamentalconcepts
This section will consider basic concepts before moving on to look at the backpropagation model.
15.2.1 Decisionfunction
To introduce the concept of how a neural network operates, a trivial task is presented and the most basic type of network is used to solve that task.
Figure 15.11 illustrates a classification task. The task is to provide rules to classify an instance as either a member of class A or class B. The rules could be given in symbol form:
IF x2 > 0.80 AND x1 < 0.55 THEN class A IF x2 < 0.90 AND x1 > 0.25 THEN class B
The above rules1 use discrete limit values that effectively partition the space into rectangular regions. An alternative approach to using rules is to derive a classification function by defining a
1Note that x1 is used in place of x and x2 in place of y. Subscripting on x is used when problems have a high number of dimensions. For example, x3 is used in place of z.

Neural Networks I 295
3 2.5 2 y 1.5 1 0.5
0 0.2 0.4 0.6 0.8 1
x
Figure 15.11 The separation of two classes using idealized data.
line that separates the two classes. The class for a new instance can be judged by plotting its location in the x, y plane and observing on which side of the line the instance is. In the present scenario it is possible to visualize the data as a two-dimensional picture. However, when handling hundreds of samples with a large number of attributes (i.e. a high-dimensional problem) it is no longer possible to visualize the task (not using standard charts).
The solution then is to use a decision function. The equation of the line that separates the two classes is:
x2 = 1.5×1 + 0.5
This equation can be used to create a decision function:
f(x1,x2)=–x2 +1.5×1 +0.5 d = class B if f (x1, x2 ) ≥ 0 class A if f (x1, x2 ) < 0 For example, a sample from class B < 0.4, 0.5 > will give: f(0.4, 0.5) = –0.5 + (1.5 × 0.4) + 0.5 = 0.6
The decision function will classify the sample correctly as a member of class B.
The above decision function could be modelled using a neural network, as shown in Figure
15.12.
The node in Figure 15.12 calculates an output value according to the threshold criteria:
1 if total input ≥ 0 output = 0 if total input < 0 An output value of 1 would indicate a member of class B and an output of 0 would indicate a member of class A. Example 2 • • • • • Find the weights for a neural model similar to Figure 15.12 that represents the equation: 2x2 = –4x1 + 8 Class A Class B 296 Artificial Intelligence 0.5 1.5 –1.0 bias 1 x1 x2 Figure 15.12 A simple neural network that models a line. For any point that falls on the line, the weights will be defined such that: w0 + x1w1 + x2w2 = 0 Rearranging gives: x =–x w1–w0 2 1w2 w2 Equating terms gives: –w1 =−4,–w0 =8 w2 2 w2 2 So, w0 = –8, w1 = 4 and w2 = 2. 15.2.2 Adapting the weights It should be obvious from Figure 15.11 that many straight lines can be drawn as the separating boundary and therefore many sets of weights exist to provide a solution. If tj denotes the target or desired output from node j and oj the actual output, then the error Ep for an example p can be defined as: Ep=1/2∑(tj –oj)2 (15.6) j and the overall error E = ΣEp. The factor 1/2 is included for convenience later on. The activation for any node is dependent on the net input to that node and therefore is dependent on the weights impinging on that node. Imagine a node as in Figure 15.12 but with no bias. Such a node can model any straight line that passes through the origin. For a linear activation function and a single input vector, eqn (15.6) can be written as: E = 1/2(t – net)2 (15.7) because for a linear activation function the output is the same as the input. Expanding gives: ••••• 112211112222 with net = x1w1 + x2w2. Differentiating eqn (15.8) with respect to w1: Neural Networks I 297 E = 1/2[t2 – 2t net + net2] (15.8) = 1/2[t2 – 2t(x w + x w ) + x 2w 2 + 2x w x w + x 2w 2] ∂E =(–t+x1w1+x2w2)x1 (15.9) ∂w1 Equation (15.8) shows that if the squared error is plotted against w1 the shape is parabolic, as shown in Figure 15.13, and if eqn (15.9) is set to 0 and solved the minimum point on the curve can be found. Training starts with random weights and the initial state can therefore be anywhere on the error surface but is unlikely to be at its minimum. During training, the network should adapt its weights so that the overall error is reduced. The weights should therefore be adjusted in the direction of steepest descent. 15.2.3 Minimizing the squared error A science experiment that many children will have performed at school is to plot the deflection of a beam against different loads and then to fit a best straight line by minimizing the sum of the squared differences between each point and the line. The same principle can be used to adapt weights and one such rule for adapting weights is the delta rule introduced earlier. This rule is: ∆wij = ηδjxi, δj = (tj –oj) (15.10) where tj is the target value at node j, oj is the actual output, xi is the signal coming from node i, η is the learning rate (by how much to adapt the weight) and ∆wij is the amount by which to change the weight connecting node i with j. The rule is simple to derive for a linear node with the output defined as: oj =∑xiwij i The chain rule can be used to express the derivative of the error surface with respect to a weight as a product that reflects how the error changes with a nodeʼs output and how the output changes with an impinging weight: E nw Figure 15.13 The line represents the derivative of the error with respect to a weight at time n. 298 Artificial Intelligence w1 Weight vector that gives minimum error w2 Figure 15.14 The error can be plotted for different combinations of weights. ∂E =∂E∂oj ∂wij ∂oj ∂wij ∂E =–δj ∂oj ∂oj =xi ∂wij and substituting back into eqn (15.11): –∂E =δjxi ∂w ij (15.11) (15.12) E Taking into account the fact that the weights need to change in a direction that is opposite to the direction of the gradient vector and that a learning rate is factored in, eqn (15.10) is derived. A modification to this method of adapting weights will be given in section 15.4 to train networks with multiple layers of nodes, but before that we shall take a look at what advantage such a network offers over one with just a single layer of weights (i.e. a layer of input nodes and a layer of output nodes). 15.3 Linear and non-linear problems The network in Figure 15.12 has two input nodes for two attributes. The number of attributes de- notes the dimension of the space that all input samples are drawn from: so a simple network model consisting of three inputs and a single output node will model a plane, and for n-inputs the model will be a n-dimensional plane. For a classification task, if a line (for two dimensions) or a plane (for n-dimensions) can separate all samples into their correct class, then the problem is linear. If the problem requires multiple lines or planes to separate the samples then the problem is non-linear. A famous non-linear example is the XOR problem introduced earlier. The XOR problem, then, is non-linear, and there are two options for solving it with a neural network: either use a network that will model two or more lines to separate the data or change the input vectors. The latter option can make the problem linear by augmenting the two input attributes with a third to make the samples three-dimensional (the two classes will then reside on two different ends of a cube). For most problems we prefer the network to sort out the non-linearity; after all, a main attraction of neural networks is their ability to solve problems for which a solution is not immediately obvious. Therefore, we shall concentrate on a solution that uses two lines to separate the data. The network then will require two hidden nodes, both connected to two input nodes to represent the two lines, and a third node to combine the outputs from the two hidden (middle) nodes. The XOR task with separating boundaries is illustrated in Figure 15.15. Figure 15.16 shows a neural network architecture that will model the two separating boundaries. (0, 1) (1, 1) Line 2 Line 1 (0, 0) (1, 0) Figure 15.15 The XOR problem solved with two separating lines. Neural Networks I 299 –1 –1 –1 Input layer –1 1.5 1 2 0.5 Hidden layer Figure 15.16 Neural net for modelling the XOR problem. The second layer of weights is shown later. Table 15.2 Response of hidden nodes for the network shown in Figure 15.16 Output layer x1 x2 1 1 0 0 1 0 0 1 Node 1 –0.5 1.5 0.5 0.5 Node 2 –1.5 0.5 –0.5 –0.5 Node 1 0 1 1 1 Node 2 0 1 0 0 Net input to hidden layer Output from hidden layer 300 Artificial Intelligence What does the first layer of weights do to each input vector? Nodes on the input layer will be indexed by i, nodes on the hidden layer by j and the output nodes by k. The network in Figure 15.16 has the nodes split into the three layers. Table 15.2 shows the net input and output (using a threshold function) of the nodes in the hidden layer in response to the inputs for the XOR problem. Each sample is transformed by the action of the first layer of weights and the hidden nodes. The second layer of weights connecting the hidden layer to the output layer will also model a line as there will be two inputs and a bias passing values to the single output node. If the output node is to respond with the correct class for each input sample, then the input to the second layer of weights must be linearly separable. This can be checked by plotting the response of the hidden nodes as shown in Figure 15.17. Figure 15.18 shows the weights that define the separating line given in Figure 15.17. A network with only linear activation functions is restricted to modelling linear problems. The association law of matrix multiplication tells us that for linear activation functions a single layer of (1, 1) (0, 0) (0, 1) (1, 0) Figure 15.17 The effect of the first layer of weights has been to move the original point (0, 1) to (1, 0). –1 –1 –1 –1 1.5 1 1 –1 2 0.5 –0.5 Figure 15.18 The complete network to solve the XOR problem. weights can be found that will achieve the same result as a network that has more than one layer of weights. In other words, a multilayered network with linear activation functions will only be able to solve a problem that a single layered network is capable of solving. Therefore, non-linear problems are ruled out. For multilayered networks then, a non-linear activation function is required, and for the backpropagation algorithm the function should be continuous, differentiable and monotonically increasing. A function satisfying these properties is the sigmoid function. 15.4 Backpropagationlearning For many years there was no rule available for updating the weights of a multilayered network undergoing supervised training. In the 1970s, Werbos developed a technique for adapting the weights, but it was the publication by Rumelhart, Hinton and Williams (1986a) that gave a new lease of life to neural networks. The weight adaptation rule is known as backpropagation. For what follows, a fully connected feed-forward network is assumed, which means activation travels in a direction from the input layer to the output layer, and the nodes in one layer are connected to every other node in the next layer up. The backpropagation algorithm defines two sweeps of the network: first a forward sweep from the input layer to the output layer and then a backward sweep from the output layer to the input layer. The backward sweep is similar to the forward sweep except that error values are propagated back through the network to determine how the weights are to be changed during training. This double sweep is illustrated in Figure 15.19. During training, each input sample will have an associated target vector. The objective of training is to find a set of network weights that provide a solution to the particular problem at hand. Before training commences, the weights are set to small random values, for example between –0.7 and 0.7. For reasons discussed earlier, a non-linear activation function, the sigmoid, will be used. The logistic sigmoid can only produce values between 0 and 1, and, because the function can never actually attain an exact 0 or 1, sometimes 0.1 and 0.9 are used instead of 0 and 1. The network is usually considered to have learned a task once all outputs fall within a specified tolerance of their target values. For example, if the target output is 1.0 and the tolerance is 0.1 then any actual output Neural Networks I 301 Feedforward sweep Backward sweep Figure 15.19 The shaded hidden node sends activation to each output node and so during the backward sweep this hidden node will receive error signals from these output nodes. 302 Artificial Intelligence between 0.9 and 1.0 will be within the specified tolerance. The procedure for backpropagation training is as follows: while not STOP STOP=TRUE for each input vector perform a forward sweep to find the actual output obtain an error vector by comparing the actual and target output if the actual output is not within tolerance set end for end while The error derivative can be expressed as ∂E = ∂E ∂oj ∂netj ∂wij ∂oj ∂netj ∂wij (15.13) (15.14) STOP=FALSE end if perform a backward sweep of the error vector use the backward sweep to determine weight changes update weights In summary, a sample is presented to the network and an error vector is calculated to determine how the weights should change; the process is then repeated for each other sample. An epoch is a complete cycle through all samples (i.e. each sample has been presented to the network). The samples are continually presented to the network, epoch after epoch, until during one epoch all actual outputs for each sample are within tolerance. 15.4.1 Sometheory Backpropagation uses a generalization of the delta rule. For a more in-depth presentation than that given here, the interested reader is referred to Rumelhart et al. (1986a) or Werbos (1990). δj is defined as: δj=− ∂E ∂net j The original delta rule in section 15.2.3 had the definition as: δ j = – ∂E ∂oj This definition is consistent with eqn (15.14) as the original delta rule is for linear nodes in which the output is the same as the input. The definition for δj can be expressed as: δj=−∂E ∂oj (15.15) ∂oj ∂netj we have: ∂E =–(tj –oj) (15.16) (15.17) (15.18) (15.19) Neural Networks I 303 As: Ep=1/2∑(tj –oj)2 j ∂oj For activation function f ′ (typically the logistic function) the output is: oj=f(netj) and so the derivative f ′ is given by: ∂oj = f′(netj) ∂net j So: δj = (tj – oj)f ′(netj) The standard summation of products is used to find the net total input: netj =∑xiwij i=0 and so So taking the product of each derivative and substituting back into eqn (15.13) gives: ∂E =(tj–oj)f′(netj)xi ∂netj =xi ∂wij (15.20) (15.21) ∂wij Noting that the weight change should be in the direction opposite to the derivative of the error surface, the weight change for a node is: ∆wij = ηδjxi The error δj given above is applicable to an output node, but the error for a hidden node is not directly related to the target output. However, a hidden node can be adapted in proportion to its assumed contribution to the error in the next layer up (i.e. output layer for a network with a single hidden layer). For a network with a single hidden layer, the error from each output node will make a contribution to the error of each hidden node by propagating error signals back through the network. The contribution to a hidden node will depend on the size of the error for an output node and the strength of the weight that connects both nodes. In other words, an output node with a high error 304 Artificial Intelligence makes a strong contribution to the error of any hidden node that is connected by a weight with a high strength. For a hidden node the error is given by: δj =f′(netj)∑δkwkj k (15.22) where k indexes the layer sending back the error (the output layer in a network with a single hidden layer). A suitable activation function is the logistic function: f (net j ) = 1 1 + exp(–net j ) The derivative of this activation function is: f′(net )= j exp(–netj) (1+exp(–netj ))2 = 1 1– 1  1+exp(–netj) 1+exp(–netj ) (15.23) = f(netj)[1– f(netj)] 15.4.2 Thebackpropagationalgorithm The first stage is to initialize the weights to small random values. Training continues until the change in the absolute value of the averaged squared error falls within some tolerance between one epoch and the next epoch. For example, a tolerance of 0.01 means that the averaged squared error must not change by more than ± 0.01 between successive epochs. If a network meets the tolerance during training it is said to have converged. An alternative way to judge the end of training is to insist that each target output for each training sample be within some specified tolerance. To reduce the likelihood of the weight changes oscillating, a momentum term, α, is introduced that adds in a proportion of the previous weight change. So, the weight change for sample m + 1 is dependent on the weight change for sample m: ∆wij(m + 1) = η(δjoj) + α∆wij(m) (15.24) The backpropagation algorithm is summarized in Figure 15.20. Example 3 • • • • • Figure 15.21 illustrates a single forward and backward sweep through a 2–2–1 network with input < 0.1, 0.9 >. The learning rate is 0.8 and the momentum 0. The weights are:
–2 3–2 23 –2 3–4 21
 2 2  3 –2 –2   
where each matrix represents a layer of weights. The last row is the bias weights.
•••••

step 1. Read first input sample and associated output sample. CONVERGE ̈ TRUE
step 2. For input layer- assign as net input to each node its corresponding element in the input vector. The ouput for each node is its net input.
Repeat step 3 for all subsequent hidden layers.
Neural Networks I 305
Read next input sample and associated output sample
step 3. For first hidden layer nodes – calculate the net input and ouput.
n
net ̈ w + Â x w , o ̈ j 0i=1iijj
1
1 + exp(net j )
.
step 4. For output layer nodes-calculate the net input and output.
n
net ̈w+Âxw, o ̈ j 0i=1iijj
1
1 + exp(net j )
.
step 5. Is the difference between target and ouput sample within tolerance? if No CONVERGE ̈ FALSE
step 6. For each output node calculate its error.
.
dj ̈(tj -oj)oj(1-oj)
step 7. For last hidden layer calculate error for each node. dj ̈oj(1-oj)Âdw .
k kj k
Repeat step 7 for all other hidden layers.
no
step 8. For all layers update weights for each node. Dwij(m +1) ̈h(d joi)+aDwij(m)
last sample?
CONVERGE=TRUE
STOP
Figure 15.20 The backpropagation algorithm. The index k refers to a previous layer in the reverse sense (i.e. when moving back through the net).

306 Artificial Intelligence
Target value is 0.9 so error for the
output node is:
(0.900-0.288) (0.288) x (1 – 0.288) = 0.125
Forward pass
0.288
–2
–2
–0.906
0.125
3 1
Error from previous layer multiplied by weights
3 1
0.122
0.728
0.375
0.125
3 –23 –2
Actual error: 0.122 (1–0.122) x 0.375
–2 2 –2 2
Output from
2 –4 node 2 –4
2222
Input to
–2 3 node –2 3
3 –2 3 –2
Figure 15.21 Illustrating a forward and backward sweep of the backpropagation algorithm.
15.4.3 Some practical considerations
The successful application of a neural network usually demands much experimentation. There are a number of parameters to be set that can affect how easy it is to find a solution. For feed-forward networks, the number of hidden nodes and the number of layers can be varied. Certainly, an important factor is the training data, and a good deal of attention must be paid to the test data so that we can be sure the network will generalize correctly on data it has not been trained on. Indeed, it is
–1.972
0.986
0.040
0.025
0.500
0.9933
–0.030
0
5.000
–0.007
–0.110
–0.001
0.1
0.9

Neural Networks I 307
possible to train a network on data with complete success only to find that it fails when processing new data. For any new application, there is no prescribed way of finding a solution. Sometimes the problem appears intractable, but failure to find a solution does not mean that the application is not amenable to a neural solution. Although applying neural networks involves a ‘try and see approachʼ, the requirement for knowledge of the application domain and knowledge of neural networks should not be underestimated.
Theory tells us that a network with a single hidden layer can represent any continuous mapping of the form:
y=f(x)
Since this form of mapping is the basis for many real-world problems, is there a need for more than one hidden layer? Well, a network with two hidden layers can sometimes be easier to train and a variant of the feed-forward net known as an auto-associater may use multiple hidden layers to perform data compression.
The learning algorithm as it has been presented uses pattern updating of the weights, which simply means that the weights are updated after each sample presentation. It can sometimes prove quicker to train a network using batch updating, in which the error for a node is accumulated for one epoch before adapting the nodeʼs weights. However, it can be beneficial to randomize the order of presentation for each training sample between epochs, in which case pattern updating is required. Usually, it is best to experiment with different approaches.
Generalization
If a network can produce the correct output for the majority of input samples in the test data set, the network is said to generalize well. It is assumed that the test data set was not used during training.
If a network is trained well to produce a smooth non-linear mapping, it should be able to inter- polate to new samples that are similar but not exactly the same as those samples used for training. A non-smooth mapping results when the network is overtrained. In this situation, the network will work more like a memory, looking up an output for a certain input from the training set.
Good generalization is dependent on the training set and network architecture. The training data set must be representative of the problem being tackled, but the number of hidden nodes is also an important factor. If there are more hidden nodes than is required to learn the input–output relation- ship, there will be more weights than necessary and overfitting of the data can result if training continues for too long. Sometimes a subset of the training data is used to act as an interim test for detecting overtraining.
The application of neural networks is an experimental approach to engineering. General guidelines sometimes exist for network design. Haykin (1998) uses a result derived by Baum and Haussler (1989) to give a guideline as to the size of the training data set. The guideline states that:
N > Wε
where N is the number of training examples, W is the number of network weights and ε is the percentage of allowed errors in testing. So, for a 10% error the number of training examples should be 10 times the number of weights.

308 Artificial Intelligence
15.5 An example of character classification
The task is to classify the digits 0 through to 9. There are 10 classes and so each target vector could be a 1-in-10 vector. For example, the target for the character 2 could be < 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 >, which indicates that the third output node should be on and all others off.
In the case study presented here, the target values are represented by 1-in-9 vectors for the numerals 1 to 9, with zero being represented by having all output nodes set to 0. The numerals to be learned are shown in Figure 15.22. Each numeral is represented by a 9 × 7 grid, with a grey pixel being 0 and a black pixel being 1.
A network architecture of 63–6–9 is chosen: (9 × 7) input nodes, one for each pixel, six hidden nodes and nine output nodes for the target vectors. The pixels are mapped into the input layer as shown in Figure 15.23.
The network was trained for 600 epochs using a learning rate of 0.3 and a momentum of 0.7. An output node was considered to be on if its activation was greater than 0.9 and off if its activation was less than 0.1.
The network trained successfully. It was then tested on the data in Figure 15.24. Each of the
Figure 15.22 Training data.
0 0 0
0 0 1
Figure 15.23 The grid is mapped to the input layer by treating it as a long string of bits that are set to 0 or 1. The bits are mapped by starting at the top left-hand corner, working down the grid for the first column and then repeating for each other column.

Neural Networks I 309
Figure 15.24 Noisy test data.
numbers has one or more bits missing. All the test numbers were correctly identified apart from 8. The sixth output node had an activation of 0.53 and the eighth node an activation of 0.41. The network was confused with the eighth sample being either the numeral 8 or 6 but not confident in either.
Summary
This chapter has looked at supervised learning, in which the object of learning is to map an input vector to a target vector. For supervised learning it is necessary to have:
• data that have a known classification;
• sufficient data that represent all aspects of the problem being solved;
• sufficient data to allow testing (the test data set and the training data set should be disjoint,
i.e. test data are not used during training). The backpropagation algorithm is in wide use:
• It learns by adapting its weights using the generalized delta rule, which attempts to mini- mize the squared error between what is the desired network output and the actual network output.
• During learning, it continually cycles through the data until the error is at a low enough value for the task to be considered solved. Even when the squared error is low, it is important to check that individual training samples are all correctly classified.
• Following training, the networkʼs weights are fixed and the network can then be used to classify unseen data.
• The knowledge of the solved task resides in the networkʼs weights
• A single hidden layer of nodes is theoretically sufficient for a task to be learned, but in
practice more than one hidden layer could lead to better performance.
• Backpropagation networks can take a long time to train.
• Generalization is a measure of how well the network performs on data not seen during
training. The number of hidden nodes and the length of training can have a significant effect on generalization.

310 Artificial Intelligence
Further reading
The original exposition of the backpropagation algorithm by Rumelhart et al. (1986a) is still worth studying. The later paper by Werbos (1990) also describes the theory behind backpropagation and a variation with recurrent connections. There are many variations of the backpropagation algorithm to speed up training and to help the network avoid the problem of local minima, which can result in the network failing to learn. The keen reader is referred to Haykin (1998) and to Masters (1995), who provides a practical insight to solving these issues using the C++ language. For a statistical treatment of neural networks, Bishop (1995) provides good coverage.
Exercises




1
2
Sketch a neural network along with weight values that will model the line: y = –0.3x + 0.7
The figure below shows a backpropagation network that is currently processing the training vector [1.0 0.9 0.9] and the associated target vector is [0.1 0.9 0.1].
–0.3
b 0.9
–0.6 –0.1
0.4 1.2

–0.8 a
c –0.3
Given that the output from unit b is 0.6 and from c is 0.8, and assuming that the logistic function is the activation function:
a Calculate the actual output vector.
b Calculate the error for each output unit.
c Calculate the error for each hidden unit.
d Calculate the weight changes for the weights connecting from unit a. Use a learning rate of
0.25 and momentum of zero.
Given that the points {(–1, 1), (–1, –1), (1, –1)} belong to class A and that {(–2, –2), (1, 1), (2, 2), (4, 1)} belong to class B:
3
a b
Show that the classes are not linearly separable.
Assuming a network with units that has outputs according to:
1 if total input ≥ 0 output = 0 if total input < 0  • c Derive a second layer of weights so that the network will correctly classify all the patterns. • Assume a single output unit. • Neural Networks I 311 show that the first layer of weights W1 in a three-layer network will transform the problem into a linear one. The first row in W1 defines the bias weights. 1 –6 • •  W1=–2 –2 • 4 The points {(4, –1), (8, –2), (1, 1), (3, 6)} belong to class A, and the points {(–8, 4), (–2, –3), • (–1, –1), (2, –9)} belong to class B. Derive a minimal network to classify these points cor- • rectly. • 5 Derive a suitable feed-forward network that models the logical AND. • 6 The bipolar sigmoid is another activation function that is commonly used with a backpropaga- • tion network. The range of the function is (–1, 1) and is defined as: • f(x) = 1/2[1 + f(x)] [1 – f(x)] • Give the error correcting rule for both an output and a hidden unit. • –1 –3 • f (x) = 2 –1 1+ exp(–x) • The derivative can be expressed as: • Symbols ∀ universal quantifier, 35 ∃ existential quantifier, 36 → implication, 21 ∧ logical and, 20 ∨ logical or, 20 ↔ if and only if, 21 ¬ logical negation, 21 ← backward rule notation and used as assignment for algorithms ⊆ subset ∈ set membership = test of equality used in algorithms f entailment, 33 g turnstile, 28 ≡ logical equivalence, 24 != not equal to (used in algorithms) A* search, 57–60 abductive reasoning, 105, 167 Abelson, R., 16, 394 absorption, in logic, 26 absorption in a junction tree, 144 ACL see agent communication language action decision, 174 action description, in planning, 190 action nodes, 194 activation function, 289 adaptive resonance theory, 329 add edge, in planning, 194 add-list, in planning, 190 adjective, 351 ADL, 224 admissible cost function, 60 adverb, 351 agenda, in chart parsing, 370 agent, 189 autonomy, 437 belief desire intention architecture, 439–40 definition of, 436–40 flexibility, 437 linguistic role, 382 multiagent system, 436 reactive, 440 situatedness, 437 agent communication language, 438 Aho, A., 394 AI applications, 452–73 connectionist, 8 definition of, 6 traditional, classical, 8 AIBO, 473 Allen, J., 394 allophone, 397 Alshawi, H., 394 ambiguity, of sentences, 386–7 amplitude, 397 anaphor phrase, 388 and, fuzzy variables, 158 and, logic connective see conjunction Anderson, C., 210, 211, 224 annealing see simulated annealing antecedent, in logic, 21 antecedent, of a rule, 11 antecedent phrase, 388 argument, logic, 25 arguments, of a predicate in logic, 34 Index arithmetic in Prolog, 480–2 arity, of a predicate, 34 arrow junction, 418 ART see adaptive resonance theory Ascent Technology, 458 ASIMO, 473 assert, statement in Prolog, 475 associative learning, 226 assumptions, in logic, 28 atom, 18, 72 atomic formula, 72 atomic proposition, 18 atomic sentence, in predicate logic, 35 autonomy, 467 auxiliary verb, 351 Aydede, M., 451 background knowledge, 253–4 background theory, 254 backpropagation learning, 301–9 backtracking, 97–100 backward chaining, 100–1 backward search, 192 Baker, C., 394 Ballard, D., 432 Barrett, A., 224 Barto, A., 284 batch updating, neural networks, 307 Baum, E., 307 Bayes’ classifier, 322–5 Bayes’ rule, 112–13 Bayesian network definition, 119–21 Bayesian networks, 104–52 cluster tree, 121, 133–44 Index 501 converging connection, 115–16 decision network, 174–86 definition of, 119–21 discrete state, 106 diverging connection, 116 d-separation, 115–17 handling evidence, 107, 145–9 inexact inferencing, 150–2 marginalization, 122 sepset, 121 serial connection, 117 stochastic sampling, 150–2 variable, 106 BDI see belief desire intention architecture belief, in Dempster–Shafer theory, 164–5 belief desire intention architecture, 439–40 belief network see Bayesian networks Bellman, R., 284 Bernard, D., 473 best-first search, 57 bigram model, 391 Bishop, C., 310, 325 blending method for genetic operator, 343 blind search, 50–5 blind sight, 445 Blum, A., 194, 224 Blythe, J., 224 Boden, M., 451 bottom-up hypothesis derivation, 256 bottom-up parser, 355–6 bound variable in FOPC, 40–1 unification, 82 boundary extraction, 424–8 branching factor, 101 Bratko, I., 101, 491 breadth-first search, 50–2 Breiman, L., 251 Brooks, R., 16, 440, 473 Brown, C., 432 Buntine, W., 152 camera calibration, 413 Cameron-Jones, R., 267 candidate elimination algorithm, 234–7 Carpenter, G., 329 case frame, for representing sentences, 382 Cassandra, A., 224 centre of curvature, 418 chain rule, 113 Chalmers, D., 445, 450, 451 Charniak, E., 394 chart parser, 367–70 Chinchor, N., 458 Chinese room, 442–4 chord, in a graph, 135 chromosome, 331 Clark, A., 16, 451 class, object-oriented, 13 classification learning, concept of, 231–2 clause, in natural language, 349 Clocksin, W., 101, 491 closed-world assumption, 90, 195 of Prolog, 475 Clowes, M., 432 cluster tree, construction of, 133–44 cluster tree of a Bayesian network, 121 clustering with neural networks, 312–25 CNF see conjunctive normal form Cohen, W., 251 Cole, R., 394 Collins, A., 16 common sense, 350, 467–9 commutativity, in logic, 26 competing needs, mutex relations, 196 complementary pairs, in resolution, 75 complete, in logic, 32, 33 complete rule in chart parser, 368 completeness in search, 47 in ILP, 254 compound proposition, 19 concept learning, 233–7 conclusion, in logic, 18 conditional dependence, 111–13 conditional effects, in planning, 208–11 conflict resolution, in production system, 11 conjunction, in logic, 20 conjunctive normal form, 73–4, 77–9 connectionism, 9 connectionist, 8 connectionist networks, 286 connective, in natural language, 351 connectives, in logic, 19–22 consciousness, 444–6 consequent in logic, 21 of a rule, 11 consistency, in ILP, 254 consistent, in logic, 33 in concept learning, 233 consistent junction tree, 141 consistent link, in junction tree, 133 constant, in predicate logic, 34 constraint satisfaction, 66–8 context analysis, in natural language processing, 363–4, 387–90 context-dependent planning, 208 context knowledge in natural language, 350 context-free grammar, 353 contradiction, 23 converged, backpropagation training, 304 converging connection, in a Bayesian network, 115–16 convolution, 416 Copeland, J., 16, 442, 444, 451 corpora, in text processing, 391 correspondence, 414 cost, in search, 49 cost function, 57–60 cost, of a sepset, 138 cover, in concept learning, 233 critics, in planning, 222 cross-validation, 466 502 Artificial Intelligence Crystal, D., 394 cueing, a script, 389 Currie, K., 224, 473 curse of dimensionality, 229 curvature, 418 cut, in Prolog, 485–6 CVOnline, 432 Cyc, 468–9 Cytyc, 463 D–S theory see Dempster– Shafer theory DAG see directed acyclic graph Darwiche, A., 152 data mining, 463–7 data-driven search, 100 Davis, R., 15, 16, 451 de Kleer, J., 170 De Morgan’s laws, equivalence of quantifiers, 24, 38 De Raedt, L., 267 decision networks, 174–86 decision tree learning, 241–51 declarative language, 88 declarative mood, 361 deduction, in logic, 28–32 Deep Space 1, 452–4 Deerwester, S., 16 defuzzification, 162 DeJong, K., 338 delete edge, in planning, 194 delete list, in planning, 190 delta rule, 294 Dempster, A., 170, 324 Dempster–Shafer theory, 163–6 DENDRAL, 2–3 Dennett, D., 451 dependency-directed backtracking, 168 depth-first search, 50, 52 derived class, object-oriented, 14 Derthick, M., 451 designation of symbols, 449 desJardins, M., 224 determiner, 351 deterministic outcome, 105 deterministic planner, 212, 214 dictionary, 349 directed acyclic graph, 93 Discern Communication, 461 discounting, 272 discourse anaphora, 388 discrete state, in Bayesian networks, 106 disjunction, logic connective, 20 disjunctive normal form, 73 distributivity, in logic, 26 diverging connection, in a Bayesian network, 116 DNF see disjunctive normal form domain, in predicate logic, 35 double implication see if and only if Doyle, J., 170 Dreyfus, H., 451 d-separation, 115–17 Duce, D., 16 Duda, R., 322, 432 Earley, J., 394 edge, 418 edge detection, 418–21 effect, in planning, 190 eigenfaces, 462 eight-tile puzzle, 47–8 eliminating, a node in a graph, 135–6 ELIZA, 446 Elkan, C., 170 Elman, J., 16 EM see expectation maximization entailment, in logic, 33 entropy, 244 epipole, 414 episode, 272 epistemology, 442 epoch, in reinforcement learning, 272 epoch, in backpropagation learning, 302 equivalence, in logic, 24–5, 26 Erol, K., 222, 224 error, in learning, 228 Essa, I., 432 Euclidean squared distance, 312–3 evidence entering in Bayesian network, 145–9 in Bayesian networks, 107 exclusive or, logical, 20 existential quantifier, 36 expectation maximization, 324–5 expert systems, 2–4, 9 explicit representation, 231 extended rule, in chart parser, 368–9 extensional meaning, in FOPC, 43 Eysenck, M., 451 factored expansion, in planning, 210–11 fail, in Prolog, 486–7 false-positive diagnosis, 181 falsum, in logic, 75 FASTUS, 459–61 Fausett, L., 329 features, used in grammars, 370–7 feed-forward network, 286 Feigenbaum, E., 2 Fikes, R., 189, 206 fill-ins, in moral graph, 134 filter, in vision, 416 finding, in Bayesian networks, 107 first-order predicate calculus, 34 see also predicate calculus Flach, P., 267 focal element, in Dempster– Shafer theory, 164 focal length, 411 Fodor, J., 450, 451 FOIL, 257–61 FOPC see first-order predicate calculus fork junction, 418 FORMS, 432 formula, in logic, 32 Forsyth, D., 432 forward chaining, 100–1 Fourier transform, 398 frame, in speech processing, 398 frame representation, 12–5 frame of discernment, in Dempster–Shafer theory, 163 frame problem, 203–6 Index 503 Fredman, M., 69 free variable, in FOPC, 40–1 frequency, 397 Friedman, N., 152 function optimization, in search, 60–6 functions, in FOPC, 38–9 fundamental rule, 111–2 Furst, M., 194 fuzzification, 157 fuzzy inferencing, 157–63 fuzzy logic, 154–63 and connective, 158 fuzzification, 157 inferencing, 157–63 linguistic value, 154 membership function, 155–6 or connective, 159 fuzzy membership, 155 fuzzy membership function, 155–6 fuzzy set, 154 fuzzy value, 154 fuzzy variable, 154 Gabbay, D., 170 GABIL, 338 gain, as used in FOIL, 260 gain, information used in ID3, 244–5 Gaussian function, 228, 229 Gazdar, G., 394 Geman, D., 152 Geman, S., 152 gene, 331 general boundary set, 234 generalization in learning, 248 in neural networks, 307 generalization operators, in learning, 233 genetic algorithms, 330–44 genetic operator, 330, 332 genetic programming, 343 Gent, I., 69 Georgeff, M., 438, 439 Ghallab, M., 224 Gibbs sampling, 151–2 global minimum, in search, 64 goal, in search, 46 goal description, in planning, 190 goal state, in search, 49 goal-driven search, 100 Goldberg, D., 344 Golumbic, N., 152 gradient descent, 63 gradient, in edge detection, 420 grammar, 349, 352 GraphPlan, 193–206 Gray codes, 341 greatest lower bound, 264 Grevier, D., 16, 451 Grossberg, S., 329 ground literal, 190 GSAT, 68 Hart, P., 322, 432 Haupt, R., 344 Haupt, S., 344 Haussler, D., 307 Haykin, S., 307, 310 head of phrase, 375 Hecht-Nielsen, R., 329, 468 Heckerman, D., 152 Henrion, M., 152 heuristic function, in search, 57 heuristic search, 55–60 hidden Markov model, 402–8 hidden node, 292 hierarchical task network, 219–23 hill-climbing search, 68 Hinton, G., 301, 451 history list, in NLP, 388 HNC, 468 Hobbs, J., 473 Holland, J., 341 Honda, 473 Horn, B., 432 Horn clause, 97 Hough transform, 425–6 Howard, R., 186 HTN see hierarchical task network Huang, C., 152 Huffman, D., 432 HUGIN, 473 Hugin propagation, 141–5 Hwee Tou Ng, 394 hypothesis learning, 227 IBM, 464 ID3 algorithm, 242–50 identity function, 290 if and only if, 19, 21–2 if ... then, 19, 21 ILP see inductive logic programming image plane, 411 implication see if ... then implicit parallelism, 342 inclusive or, in logic, 20 independent event, 113 induction, as inverse of deduction, 261 inductive bias, 239 inductive inference, 227 see also inductive logic programming inductive logic programming, 253–67 inexact inferencing in Bayesian networks, 149–502 inference engine, 9 information gain, 244–5 inheritance, object-oriented, 14 initial state, in search, 49 instantiation, object-oriented, 13 instrument, linguistic role in NLP, 382 intensional meaning, in FOPC, 43 interference, mutex relations, 196 interfering goal, 191 intermediate-logical form, 383–6 interpretation, in FOPC, 41–4 interpretation of symbols, 449 interrogative mood, 361 intervening action, 174, 177–81 intonation planning, 397 intransitive verb, 351 intrasentential anaphora, 388 introduced rule in chart parser, 368–9 inverting resolution, 261–2 iRobot, 473 irregular verb, 351 iterative-deepening search, 54–5 504 Artificial Intelligence Jennings, N., 437, 440 Jensen, F., 141, 152, 473 Johnson, D., 69 joint probability distribution, 111 Juang, B., 408 junction tree, 133 see also cluster tree Jurafsky, D., 392, 394, 408 Kambhampati, S., 224 Kautz, H., 206 Kay, M., 394 Keane, M., 451 Kelly, J., 101 kernel, of a filter, 416 Kinny, D., 438 Kirkpatrick, S., 66 Knight, K., 101 knowledge query and manipulation language, 438 knowledge-based planning, 219–23 system, 4 Koehler, J., 224 Kohonen, T., 314, 320, 329 Koller, D., 152 Kosko, B., 170 Kowalski, R., 101 Koza, J., 343 KQML see knowledge query and manipulation language L junction, in vision, 418 lambda calculus, 362 reduction, 362 Langford, J., 224 language bias, in ILP, 265–6 language of thought hypothesis, 449–50 Lauritzen, S., 152 law of contradiction, 26 learning, 225–344 as search, 233–7 associative, concept of, 226 backpropagation, 301–9 bias, 238–9 candidate elimination, 234–7 classification, concept of, 231–2 clustering, 312–25 concept, 233–7 decision tree, 241–51 expectation maximization, 324–5 FOIL, 257–61 generalization, 248 neural nets, 307 operators, 233 genetic algorithms, 330–44 ID3, 242–51 inductive logic programming, 253–67 introduction to, 226–40 neural networks, 286–329 overfitting, 248 Q-learning, 281–4 radial-basis functions, 325–8 reinforcement, 269–84 self-organizing feature map, 314–21 soundness and completeness, 32–3 specialization operator, 233 supervised, 229, 231 target function, 227, 229–31 temporal difference, 282 types of, 231–2 unsupervised, 229, 231 validation data, 248 version space, 234 learning as search, 233–7 learning bias, 238–9 least general generalization, 264 least upper bound, 264 Lecun, Y., 432 Lee, K., 408 Lenat, D., 467, 468 levelled off graph, in GraphPlan, 203 lexicon, 349 Lindsay, R., 2 linear predictive coding, 399 linear resolution, 98 linguistic value, in fuzzy logic, 154 linguistics, 352 lists, in Prolog, 482–4 local minimum, in search, 64 location, linguistic role, 382 Loebner, H., 447 logic, 18–44 and connective, 20 antecedent, 21 arguments and validity, 25–8 assumptions, 28 atomic formula, 72 atomic proposition, 18 atomic sentence, 35 commutativity, 26 conjunctive normal form, 73–4, 77–9 connectives, 19–22 consequent, 21 deduction, 28–32 disjunction, 20 disjunctive normal form, 73 distributivity, 26 entailment, 33 equivalence, 24–5, 26 FOPC, 33–44 formula, 32 fuzzy see fuzzy logic if and only if, 19, 21–2 if ... then, 19, 21 inductive programming see inductive logic programming premise, 18 Prolog, see Prolog propositional, 18–33 reasoning, automation of, 72–101 refutation proof, 27 resolution, FOPC, 77–88 resolution, propositional, 72–7 semantics, 19 skolemization, 79–82 soundness and completeness, 32–3 syntax, 19 truth tables, 19, 22–4 unification, 82–5 valid argument, 25–8 well-formed formulas, 39–41 logic connectives, 19–22 Index 505 logic operators see logic connectives logical and, 19, 20 logical if and only if, 19, 21–2 logical if ... then, 19, 21 logical not, 19, 21 logical or, 19, 20 logistic function, 291 LOTH see language of thought hypothesis LPC see linear predictive coding LUNAR, 394 Lynch, M., 467 McAllester, D., 170 McCarthy, J., 6, 451 McDermott, D., 170 McGeoch, L., 69 Manning, D., 394 marginalization, 122 Markov assumption, 214 Markov decision process, 214 Markov model, 214 Markov property, in reinforcement learning, 273 Marr, D., 432 Martin, J., 392, 394, 408 mass, of a sepset, 138 Masters, T., 310 Matheson, J., 186 MDP see Markov decision process Mellish, C., 101, 394, 491 memoization, in GraphPlan, 203 message passing, in junction tree, 141, 144 method, object oriented, 13 method, in planning, 222 Metropolis, N., 64, 69 Michell, M., 344 Microsoft, 464, 473 Miikkulainen, R., 394 Minker, J., 170 Minsky, M., 6, 12, 13, 16, 451 missionaries and cannibals, 50 Mitchell, T., 233, 234, 239, 240 Mlnet, 240 modal operator, in non- monotonic inferencing, 167 modal operator, in NLP, 384 model, in logic, 32 momentum in backpropagation learning, 304 monotonic reasoning, 166 moral graph, 134 morpheme, 350 morphological knowledge, 350 most general unifier, 83 MUC-5, 458 Muggleton, S., 255, 266, 267 multiagent system, 437 mutate, 330, 332 mutex relations, 193, 196 MYCIN, 3 NASA, 473 natural language processing, 348–93 bottom-up parser, 355–6 chart parser, 367–70 context analysis, 363–4, 387–90 features, in grammars, 370–7 parts of speech, 351–2 in Prolog, 487–91 phrase, 349 semantic analysis, 357–63, 377–83 speech, 396–408 stages of, 349–50 statistical approaches, 390–3 structure analysis, 352–6 top-down parser, 355, 356 with Prolog, 487–91 Ndumu, D., 437, 438 negation, logical, 21 negative literal, 72 neural networks, 8, 286–329 activation functions, versions of, 289–91 backpropagation, 301–9 basic description, 287–94 clustering, 312–25 linear and non-linear problems, 298–301 radial-basis functions, 325–8 self-organizing feature map, 314–21 Neural Technologies, 468 Neurodynamics, 467 neuron, 287 Newell, A., 206 N-gram model, 391 Nilsson, N., 44, 69, 189, 192, 206 NLP see natural language processing node, neural, 287 noise, in an image, 419 non-intervening action, 174, 175–6 non-monotonic reasoning, 166–9 non-terminal node, in a tree, 353 no-op operator, 194 Norsys, 473 noun, 351 noun phrase, 351 Nwana, H., 437 object, linguistic role, 382 object, object-oriented, 13 object, in natural language, 349, 351 objective function for genetic algorithm, 330, 332 object-oriented concepts, 13 OLAP, 466 ontology, 442 operator, in search, 49 operator, in planning, 190 O-plan, 224, 454 optical axis, 411 optimality, in search, 47 or, fuzzy variables, 159 Oracle, 464 orthographic projection, 412 other minds response, 444 overfitting, in learning, 248 overtrained neural network, 307 parallel processing, with genetic algorithms, 343 parent, object-oriented, 14 parse tree, 355 parser, 353 parsing, 353–6, 367–70 bottom-up, 355–6 506 Artificial Intelligence chart, 367–70 top-down, 355, 356 shift-reduce, 355–6 partially observable Markov decision process, 216–17, 224 partially ordered plan, 222 part-of-speech tagging, 392 Passino, K., 170 pattern updating, 307 PCA see principal component analysis PDDL see planning domain definition language Pearl, J., 152 Pednault, E., 224 PEGASUS, 462 Penn treebank, 392 Pentland, A., 432, 462 Peot, M., 186 Pereira, F., 394 perfect tense, 352 perfect progressive tenses, 352 perspective projection, 411–12 philosophy, 441–51 Chinese room, 442–4 strong and weak AI, 442–4 thinking machines, 444–6 thought as language, 448–51 Turing test, 446–8 what is it?, 441–2 phone, 397 phoneme, 397 phonetic knowledge, 350 phrase, in natural language, 349 physical symbol system hypothesis, 6 pinhole camera, 411 pixel, 409, 410 plan recognition, 390 planning, 189–224 action description, 190 as search, 191–3 conditional effects, 208–11 context dependent, 208 critics, 222 goal description, 190 effect, 190 GraphPlan, 193–206 HTN, 219–23 knowledge-based methods, 219–23 method, 222 operator, 190 O-plan, 224, 454 possible worlds, 218 precondition, 190 primitive task, 219 regression, 192 replanning, 223 state description, 190 STRIPS, 189–91 task, 219 with uncertainty, 211–19 planning domain definition language, 224 plausibility, in Dempster–Shafer theory, 164–5 Plotkin, G., 262 Poggio, T., 432 policy, 271 policy in Markov models, 215 policy iteration, 280–281 Pollack, J, 451 POMDP see partially observable Markov decision process Ponce, J., 432 population of hypotheses, 331 positive literal, 72 possible worlds, in planning, 218 posterior probability, 111 power set, 163, 238 precedence, of logic symbols, 26 precondition edge, 194 precondition, in planning, 190 predicate, 34 predicate calculus, 33–44 preference bias, in ILP, 266 preferences between quantifiers, 387 premise, in logic, 18 prenex normal form, 78 preposition, 351 primitive task, in planning, 219 principal component analysis, 429–31 prior probability, 111 probabilistic network see Bayesian networks probability distribution, 110 probability mass, in Dempster– Shafer theory, 163 probability theory, 109–13 ProDAPS, 470–2 production system, 11 productivity, 450 Progol, 255, 266 progressive tenses, 351 Prolog, 88–101, 474–91 arithmetic, 480–2 assert statement, 475 cut, 485–6 fail, 486–7 lists, 482–4 NLP, 487–91 recursion, 480 pronoun, 351 proposition, 18 proposition nodes, 194 propositional logic, 18–33 propositional satisfiability, 66 pruning decision tree, 248–9 Pylyshyn, Z., 450, 451 Q-learning, 281–4 quantification, in natural language, 383–6 quantifier, in natural language, 351 quantization, 398 Quillian, R., 16 Quinlan, J., 242, 251, 257, 267 Rabiner, L., 408 Radcliff, N., 343 radial basis function network, 325–8 random variable, 110 Rao, A., 438, 439 reactive architecture, 440 reasoning abductive, 105, 167 automation of, 72–101 monotonic, 166 non-monotonic, 166–9 see also Bayesian networks; logic, deduction and resolution recursion, in Prolog, 480 refutation proof, in logic, 27 regression planning, 192 regular verb, 351 Index 507 reinforcement learning, 269–84 remote agent experiment, 452 replanning, 223 representation, 10–16 resolution, in logic inferencing, 74–88 resolution, of an image, 411 resolution deduction, 76 resolution in FOPC, 77–88 resolution in propositional logic, 74–7 resolution refutation, 75 return, 272 reward, 269, 271 rewrite rule, 352 Ringland, G., 16 Roberts, L., 432 Robinson, J., 101 robot response, 444 robotics, 473 roles, linguistic, 382 Rosenberg, C., 408 roulette wheel selection, 336 Rowley, H., 432 rules of deduction, 28–32 Rumelhart, D., 301, 302, 310, 394 St. John, 451 sample space, 109 sampling, 397 SAT, 66 satisfaction, in FOPC, 43–4 satisfy in logic, 33 in concept learning, 233 Schank, R., 16, 394 schema theorem, 341–2 schemata, 341 Schutz, H., 394 scoping, in FOPC, 40–1 scorecard, 468 script, 389 SDR-4X, 473 search, 46–69 A*, 57–60 as function optimization, 60–6 backward, 192 backward chaining, 100–1 best-first, 57 blind, 50–5 breadth-first, 50–2 completeness, 47 constraint satisfaction, 66–8 cost, 49, 57–60 data-driven, 100 depth-first, 50, 52 forward chaining, 100–1 function optimization, 60–6 global minimum, 64 goal, 46, 49 goal-driven, 100 heuristic, 55–60 hill-climbing, 68 iterative deepening, 54–5 local minimum, 64 operator, 49 optimality, 47 state, 49 simulated annealing, 63–6 Searle, J., 438, 442 segmentation, in speech, 407 segmentation, in vision, 422–4 segmentation, through thresholding, 423–4 Sejnowski, T., 408 selectional restriction, 388 self-organizing feature map, 314–21 Selman, B., 68, 69, 206 semantic analysis, 357–63 features, 361 knowledge, 349–50 network, 11–12 semantics, in logic, 19 semantics, in NLP, 357–62, 377–83 Sensory GraphPlan, 217–19 sepset, in Bayesian networks, 121 serial connection, in Bayesian network, 117 SGP see Sensory GraphPlan Shachter, R., 186 Shafer, G., 170 shape description, 418 Shenoy, P., 186 Shieber, S., 394 shift reduce parser, 355–6 SHRDLU, 394 sigmoid function, 291 Simon, H., 206 simple tenses, 351 simulated annealing, 63–6 single-point crossover, 332, 337 singly connected DAG, 134 singular value decomposition, 326 SIPE, 224 skolem function, 80 skolemization, 79–82 Sloman, A., 441, 445, 451 SmartAirport, 458 Smets, P., 170 Smiths Aerospace, 469 Smolensky, P., 451 Sobel operator, 421 SOFM see self-organizing feature map Sony, 473 sound logic, 32 specialization, object-oriented, 14 specialization operator, in learning, 233 specific boundary set, 234 spectrum, 398 speech processing, 396–408 frame, 398 hidden Markov model, 402–8 recognition, 401–8 segmentation, 407 signal processing, 397–401 spherical projection, 412 Spiegelhalter, D., 152 Spirtes, P., 152 Spivey, M., 101 state, in search, 49 state description, in planning, 190 state space, in search, 49 statistical approaches, to natural language processing, 390–3 stereo imaging, 414 stochastic outcome, 105 stochastic sampling, inexact inferencing in Bayesian networks, 150–2 Stolke, A., 408 strength of a weight, 288 STRIPS, 189–91 508 Artificial Intelligence strong AI, 442–4 subclass, object-oriented, 14 subject in natural language, 349, 351 θ-subsumption, 261–3 subsumption architecture, 440 subtype, object-oriented, 14 Sugihara, K., 432 Sundheim, B., 458 supervised learning, 229, 231 surface normal, 418 Sussman anomaly, 191 Sutton, R., 284 SVD see singular value decomposition, 326 symbol system hypothesis, 449 symbolism, 9 syntactic knowledge, 349 syntax, in logic, 19 system response, 444 systematicity, 450 T junction, in vision, 418 tagset, in NLP, 392 target function, in learning, 227, 229–31 task, planning, 219 Tate, A., 224, 473 tautology, 23 temporal difference learning, 282 tense, 351 term, in FOPC, 39 terminal node in a tree, 353 test decision, 174, 181–3 texture, 416 threshold function, 290 TMS see truth maintenance system top-down hypothesis derivation, 256 top-down parser, 355, 356 Touretzky, D., 451 transitive verb, 351 travelling salesman, 341 trellis, 404 triangulated graph, 135–8 triangulation, in image processing, 414 triangulation, of a graph, 136–8 trigram model, 391 Tripath Imaging, 463 truth maintenance system, 168–9 truth table, 19, 22–4 Turing test, 446 Turk, M., 432, 462 turnstile, logic symbol, 28 two-point crossover, 337 type, object-oriented, 14 types of learning, 231–2 UCPOP, 224 Ullman, J., 394 uncertainty, 103–70 Bayesian networks, 104–52 Dempster–Shafer theory, 163–6 explanation of, 104–5 fuzzy logic, 154–63 in planning, 211–19 unification, 82–5 unifier, 83 uniform crossover, 337 unit clause, 77 universal quantifier, 35 unsupervised learning, 229, 231 utility, 174 utility value, 272 valid argument, in logic, 25–8 valid deduction, in logic, 18 validation data, in learning, 248 valuation function, in logic, 32 value iteration, 279–80 value of information, 183–5 variable, in Bayesian network, 106 verb, 349, 351 verb phrase, 351 version space, 234 vision, 409–32 classifying objects, 428–31 edge detection, 418–21 filter, 416 Hough transform, 425–6 image, basic concepts, 410–13 segmentation, 422–4 stereo imaging, 414 visual cues, 414–17 Visionic, 463 visual cues, 414–17 Viterbi algorithm, 407 Waibel, A., 408 Waldinger, R., 206 Walsh, T., 69 weak AI, 442–4 weakest precondition, 192 weight, in neural networks, 287–88 weight, of a cluster, 138 Weiznbaum, J., 446 Weld, D., 193, 195, 206, 217 well-formed formula, in FOPC, 39–41 Werbos, P., 301, 302, 310 Whitely, D., 344 Widrow–Hoff rule, 294 Wilkins, D., 224 Williams, R., 301 Williamson, J., 152 Winograd, T., 393 Winston, P., 16, 458 Wittgenstein, L., 445 Woods, W., 394 Wooldridge, M., 440 working memory, production system, 11 world knowledge, 350 Wos, L., 101 XOR definition, 292 Yuille, A., 432 Yurkovich, S., 170 Zadeh, L., 154, 170 Zelle, J., 394 Zhu, S., 432