程序代写代做代考 Hidden Markov Mode information retrieval python data science Introduction to NLE

Introduction to NLE
Natural Language Engineering
Informatics Data Science Group
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 1 / 34

About This Module
An introduction to concepts, tools and techniques in computational processing of natural language
You will learn about software technology that can be used to process textual data
The focus will be on applications of the technology to specific tasks
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 2 / 34

Module Aims/Learning Outcomes
On successful completion of this module you should be able to:
deploy generic NLP technologies to large quantities of realistic data.
design and run an empirical investigation that would establish whether or not there is scope to successfully deploy existing text processing technologies.
determine which language processing technologies would be effective in a given scenario.
build a prototype system that combines off-the-shelf technologies into a practical language processing system.
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 3 / 34

Teaching Methods
Lectures: Two hours of lectures a week covering key concepts, techniques and applications
Lab Classes: Lab classes give hands-on experience in constructing practical language processing software
Use Python and the Natural Language Processing Toolkit (NLTK)
Study Direct: All teaching materials – lecture notes, lab materials, assessment details, etc. – available from the site.
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 4 / 34

Module Prerequisites
Does not assume knowedge of Python
— but does assume you can program and learn Python quickly
Explains any machine learning techniques used Does not assume any knowledge of linguistics
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 5 / 34

Assessment
Assessed coursework
This is worth 50 percent of the total mark for the module
A report on work completed during the lab sessions: opinion extraction
Submission of preliminary report for feedback around middle of term
Submission of final report for formal assessment in Week 10
Unseen Exam
The exam will be in the January assessment period Questions may be about any topic covered in the module Further guidance about the exam and revision towards end of module
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 6 / 34

Recommended Reading
The main textbook uses NLTK to introduce the field of natural language processing:
Bird, S., Klein, E. and Loper, E. (2009) Natural Language Processing in Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media. http://www.nltk.org/book_1ed/
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 7 / 34

Recommended Reading
Two further introductions to language processing, although the second one is now rather dated:
Jurafsky, D. and Martin, J. (2008) Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics, and Speech Recognition, Prentice Hall. (Second Edition)
Manning, C. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, MIT Press.
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 8 / 34

Other Useful Reading
An introduction to the field of information retrieval:
Manning, C.D., Raghavan, P. and Schutze, H. (2008) Introduction to Information Retrieval, Cambridge University Press.
An introduction to opinion mining:
Pang, B. and Lee, L. (2008) Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval: Vol. 2: No 1–2, pp 1–135. (available online)
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 9 / 34

Syllabus Overview
Introduction to natural language engineering Text documents
Text preprocessing/relating documents Document classification
Classification scenarios and approaches Performance and evaluation
Sequence labelling
Part-of-speech labelling / hidden Markov models
Syntactic analysis
Context-free grammar and parsing/statistical methods Dependency parsing
Semantic analysis
Lexical meaning/distributional similarity
Information extraction and information retrieval Example applications
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 10 / 34

A Note about Python and the NLTK
NLTK
A platform for building language processing applications in Python Many built-in language resources and processing modules
Used by language processing novices and industry experts alike
Python
An interpreted, object-oriented programming language
Available on a wide range of platforms and used in a wide range of applications
Extensible – encourages code re-use
Free!
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 11 / 34

What is Natural Language Engineering?
Computer processing of natural language for practical applications The academic discipline of natural language processing is also
sometimes called computational linguistics
Technological / engineering approaches to natural language analysis
are sometimes called text analytics or text mining
NLP is mostly concerned with written language, rather than spoken language (speech)
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 12 / 34

Why is this Topic Important?
Natural language is overwhelmingly the preferred human medium for information exchange
Vast amounts of machine-readable text produced and available online
too much for humans to read and deal with
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 13 / 34

Machine-Readable Text
online news
emails
weblogs (blogs)
microblogs (Twitter, Tumblr, instant messages) wikis (Slashdot, Stack Exchange, etc.)
online encyclopaedias online tutorials scientific articles product descriptions product reviews
job adverts etc, etc
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 14 / 34

Currently a Hot Topic
Lots of applications with significant potential value
Benefits from great strides made in field of machine learning
Automatic acquisition of knowledge from massive volume of electronically available text
Current compute, storage and network performance allows very large scale data processing
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 15 / 34

Language Processing Applications
Retrieve and rank documents on the web that are relevant to a set of query terms
— information retrieval
Translate web pages from one language to another
— machine translation Provide answers to questions
— question answering
Find a document that contains the answer to a question Simplify a document (e.g. so that a child can understand it) Summarise today’s important news in 500 words
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 16 / 34

More Applications: Tracking News and Views
Monitor the views of tweeters (e.g. to predict the outcome of an election)
Monitor general mood of tweeters (e.g. to predict overall changes in stock market)
Monitor latest news about a company (e.g. to predict change in company’s stock value)
Monitor people’s views on a new initiative/product (to find out how effective an advertising campaign has been)
Monitor employee emails for evidence of illegality
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 17 / 34

Even More Applications: Finding and Filtering
Find recent scientific articles on genes associated with dementia Find jobs that match the skills described in a CV
Identify people considered experts on some topic
Filter user-generated content for inappropriate language Automatically anonymise documents
Target adverts shown on a web page based on content of pages you view
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 18 / 34

Yet More Applications: Discovery and Detection
Automatically discover new biomedical terminology Detect new events being mentioned on Twitter
Provide real-time situational awareness during a riot using tweets sent by citizen journalists
Determine age/gender profile of people who are fans of Metallica
Determine which authors have similar writing styles or write about similar topics
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 19 / 34

From Applications to Tasks
The potential applications of language engineering are growing
There is a set of underlying generic tasks that arise with these applications
We will study (mostly machine learning) methods that address these underlying tasks
Applying these methods to particular applications often requires skilled researchers and developers
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 20 / 34

Text Analysis: generic modules
tokenisation: convert a string of characters into a sequence of tokens (words/punctuation)
segmentation: convert a sequence of tokens into a sequence of sentences
stemming: remove affixes and suffixes from tokens — canonicalisation
lemmatisation: morphological analysis of tokens
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 21 / 34

Text Analysis: generic modules (cont.)
part-of-speech tagging: label tokens with part-of-speech tags (noun/verb/adjective/preposition/determiner)
phrasal chunking: identify sub-sequences of tokens that form phrasal units, typically referring to entities and actions
syntactic analysis: hierarchically analyse sentences into components (subject/object/main verb/prepositional phrase)
semantic analysis: express literal meaning of sentence using a formal logical language
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 22 / 34

Text Analysis: generic modules (cont.)
named entity recognition: identify type of entity being referred to (person/institution/place/time)
reference resolution: link different references to the same entity
word sense disambiguation: use context to disambiguate the intended meaning of a word
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 23 / 34

Text Analysis: generic modules (cont.)
relation detection: identify and classify relationships between entities
event detection: identify, classify and order events in time
topic identification: identify words/phrases that convey the topic of
a document
text similarity: measure relevance of a document to a query, or of one document to another document
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 24 / 34

Traditional Natural Language Processing Pipeline
raw text
tokenizer
tokenized text
syntactic analyser
parsed text
semantic analyser
meaning
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 25 / 34

Why is NLP Hard?
variation: many different ways of expressing the same thing ambiguity: same bit of text can have many distinct interpretations
thousands of different languages, and even more dialects:
different languages are structured differently
diverse linguistic styles: formal/informal, technical/nontechnical, …
messiness: text will sometimes be ungrammatical / fragmentary / non-standard
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 26 / 34

Why is NLP Hard?
language use evolves: introduction of new terms and new meanings of existing terms
language is used in highly creative ways: literal interpretation often not useful
world knowledge needed to understand language: ultimately an AI-complete problem
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 27 / 34

Machine Learning in NLP
Building models from data
Huge amounts of data now available
Field of Machine Learning is about how to learn from data Supervised vs. unsupervised approaches
Most of the best-performing NLP systems incorporate some form of machine learning
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 28 / 34

Data to Learn from
Text in which tokens are labelled with their part of speech (noun, verb etc.)
Text in which groups of tokens are labelled as forming a phrase Text labelled with hierarchical phrase structure
Text in which noun phrases marked (labelled) with class of entity being described
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 29 / 34

Data to Learn from
Corpora in which documents are labelled
— positive, negative, neutral sentiment — relevant, irrelevant to some topic
Parallel texts
— multiple texts saying same thing in different languages — texts are aligned
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 30 / 34

Translating without Understanding
Reasonable machine translation is possible without understanding
Learn from examples that specific phrases in one language map to specific phrases in another language
No need to build representation of sentence meaning
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 31 / 34

Question Answering without Understanding
Answers to questions can sometimes be found by keyword search and matching patterns of words
Transform questions into phrases that might be part of the answer, then look for these phrases
No need to try and work out meaning of question
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 32 / 34

Summmary
Automated analysis of text essential to make sense of vast quantities of machine readable text
Many application areas for language processing technology
Complex applications tackled using combinations of simpler, generic language processing modules
Machine learning methods allow language processing modules to be learnt from large quantities of text data
Text data to learn from will usually suffer from sparse data problems
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 33 / 34

Next Topic: Text Document Pre-Processing
Feature extraction
Sentence segmentation
Tokenisation
Morphology
Analysing the structure of words
Canonicalisation
Lemmatisation and stemming
Data Science Group (Informatics)
Introduction to NLE
Autumn 2015 34 / 34