Microsoft PowerPoint – Lecture8_NLP.pptx
MFIN 290 Application of Machine Learning in
Finance: Lecture 8
Yujie He
8/14/2021
Agenda
Solution to HW 2
Natural Language Processing
State-of-the-Art AI systems
Introduction to NLP and its fundamentals
Word2Vec and embeddings
Deep learning applications of Natural Language Understanding (NLU) and Natural Language Generation (NLG) and Trends
Transformer models
Transfer learning
Current Status and Trends
2
Goal
Build intuitions towards multiple commonly used models
Need to know some key math components of models
Pros and cons of different classic NLP models and their applications
Know industry practices and use cases of different models
Excellent resource:
https://scikit-learn.org/stable/tutorial/index.html
https://github.com/huggingface
Machine Learning Roadmap 2020 (Roadmap on Github)
3
State-of-the-Art AI Systems
4
AI Empowered Technology
Visual studio 2022 code completion
Github CoPilot
AI Empowered Technology
Visual studio 2022 code completion
Github CoPilot
The poet in the machine (Microsoft, 2018)
https://www.microsoft.com/en-us/research/blog/the-poet-in-the-machine-auto-generation-of-poetry-directly-from-images-through-multi-
adversarial-training-and-a-little-inspiration/
DALL-E (OpenAI, 2021)
https://openai.com/blog/dall-e/
AlphaFold (Google DeepMind, 2020)
A solution to a 50-year-old grand challenge in biology
AlphaFold can accurately predict 3D models of protein structures and has the potential to accelerate
research in every field of biology.
https://deepmind.com/research/case-studies/alphafold
Introduction to NLP and its fundamentals
10
What is NLP?
Natural language processing is a field at the intersection of
computer science
artificial intelligence
linguistics
Goal: for computers to process or “understand” natural language in order to perform tasks that are useful
Performing Tasks, e.g. make appointments using chatbot
Machine translation
Question Answering, e.g. which movie won Best Picture in last Oscar?
Summarization, auto text generation
Fully understand and represent the meaning of language (or even defining it) is a difficult goal
Perfect language understanding is AI-complete, but we are getting there
11
Stanford CoreNLP https://corenlp.run/
12
Allen NLP
13
https://demo.allennlp.org/coreference-resolution
Some examples of NLP applications
Spell/grammar checking, neural rewrite/style transfer
Knowledge mining
Knowledge graph building (entities, relations, acronyms, who knows what etc.)
Sentiment analysis guiding trading strategy
Question Answering, Machine reading comprehension
Search relevance, auto-suggest, recommendations
Text summarization and generation (e.g. Ads)
Machine translation
Personal assistance/chatbot/customer service
14
Search
15
Personalized Search Results
16
Question Answering and Semantic Search
17
Question Answering and Semantic Search
18
What is special about human language?
The categorical symbols of a language can be encoded as a signal for communication in several ways:
Sound
Gesture
Writing/Images
the symbol is invariant across different encodings!
The large vocabulary, symbolic encoding of words creates a problem for machine learning – sparsity!
Sparse encoding (traditional NLP) vs. continuous encoding
19
Fundamental NLP techniques
Tokenization/Stemming/Lemmatization: determining word units
Part-of-speech tagging: determining the grammatical function of each word: e.g. noun, verb, pronoun,
preposition, adjective, etc.
Named entity recognition: determining the correspondence between words/phrases and real-world
things: people, places, organizations, etc.
Chunking: determining grouping for word sequences
Dependency Parsing: determining the syntactic structure of a sentence
Text Classification: assign categories/labels to documents
Topic Models: represent documents and individual words in terms of topics
Co-reference resolution: determining words and phrases that refer to the same entity: e.g., “John Lee”,
“he”, “John
20
Tokenization
Is the process of breaking a stream of text up into words, phrases, symbols and other meaningful
elements called tokens
21
Tokenization
Tokens are usually* the units of NLP analysis
https://www.nltk.org/api/nltk.tokenize.html
Main question of tokenization: where to split?
Punctuation
White space
Language specific issues
Word-breaker or word-segmentation (maximum matching)
Consistent tokenizer
Can be considered a “solved” problem in NLP
22
Tokenization
Left: word tokenizer
Right: subword tokenizer (used a lot in deep learning models)
23
Stemming and Lemmatization
The goal of both stemming and lemmatizing is to reduce inflectional forms to a common/root form
Stem might not be an actual word whereas lemma is
Stemming can be considered as simplified lemmatization
24
Stemming and Lemmatization
25
Part-of-speech tagging (POS Tagging)
POS tagging assigns grammatical categories to individual words
POS tagging underpins many other NLP tools (as features), such as NER
POS tagging is important for syntactic and semantic analysis
26
Part-of-speech tagging (POS Tagging)
Stochastic sequence models (conditional random field) are widely used
Accuracy upto 97% (trained/tested on Penn TreeBank WSJ)
Rule-based
Accuracy upto 95% (trained/tested on Brown corpus) with automated rule-generation
No model is perfect!
27
Named Entity Recognition (NER)
Detect named entity mentions in text into pre-defined categories
Person, organization, location (3 most common)
Others: datetime, numbers, monetary values or domain specific types
Standard dataset (e.g. CoNLL-2003) for NLP research
28
Named Entity Recognition (NER)
BIO encoding for NER to define tags
B-PERS, B-DATE… : beginning of a mention of a person/date type
I-PERS, I-DATE… : inside of a mention of a person/date type
O: outside of any mention of a named entity
29
Naïve Bayes for classification
Applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair
of features given the value of the class variable.
30
Conditional independence between features
Simplify
P(x1,..xn) is constant given the input
Maximum A Posterior (MAP)
estimation
31
Convert dataset to frequency
table
Calculate likelihood table by
calculating marginal and
conditional probability
Use Bayes rule to calculate
posterior probability for each
class
How to adapt to spam
filtering?
Convert dataset to frequency
table
Calculate likelihood table by
calculating marginal and
conditional probability
Use Bayes rule to calculate
posterior probability for each
class
How to adapt to spam
filtering?
32
P(word) is constant, can be ignored
Naïve Bayes
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi∣y)
Pros:
Worked quite well in document classification and spam detection
Require small amount of data
Fast calculation
Conditional independence assumption makes each distribution can be independently estimated as a one-dimensional
distribution, alleviates curse of dimensionality
Cons:
Independency assumption is too strong
Zero-frequency issue (use smoothing, e.g. laplace smoothing, add-one)
NB probability is a bad estimator, take with a grain of salt
33
Gaussian NB:
Traditional NLP applications: Sentiment analysis
Traditional: Treat sentence as a bag-of-words (ignore word order); consult a curated list of “positive” and
“negative” words to determine sentiment of sentence. Need hand-designed features to capture negation! –
-> Ain’t gonna capture everything
34
Word2Vec and Embeddings
35
Representation for all levels? Vectors!
Semantic vector space model
Represent each word with a real-
valued vector
These vectors can be used as
features in a variety of applications
36
Inspirations
The statistics of word occurrences in a corpus is the primary source of information available to all
unsupervised methods for learning word representations.
Can use it to solve the following:
How meaning is generated from these statistics
How the resulting word vectors might represent that meaning
37
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
Hotels, conference, motel – a localist representation
38
Problem with words as discrete symbols
Example: in web search, if user searches for “Seattle motel”, we would like to match documents
containing “Seattle hotel” -> semantic search
But:
These two vectors are orthogonal -> cosine similarity is 0!
There is no natural notion of similarity for one-hot vectors!
Solution:
Could rely on WordNet’s list of synonyms to get similarity?
But it is well-known to fail: incompleteness, etc.
Instead: learn to encode similarity in the vectors themselves
39
40
41
42
Word2vec overview
43
Word2vec overview
44
Word2vec overview
45
46
47
48
49
50
51
Illustration of Word Embeddings
52
Illustration of Word Embeddings
53
54
Summary of word2vec
Go through each word of the whole corpus
Predict surrounding words of each word
This captures co-occurrence of words one at a time
55
56
57
58
59
60
61
Some findings about word embeddings
Performance is better on the syntactic subtask for small and asymmetric context windows.
Syntactic info in mostly drawn from the immediate context and can depend strongly on word order
Semantic info is more frequently non-local, and more of it is captured with larger window sizes
62
63
64
65
Deep learning applications in NLP and recent
trends
66
Deep learning in NLP
Traditional NLP was up until 2012, since 2014 with the onset of
neural machine translation, deep learning-based NLP took off
Statistical Machine Translation systems, built by hundreds of
engineers over many years, outperformed by Neural Machine
Translation systems trained by a handful of engineers in a few months
2016, Google replaced its monolithic phrase-based machine
translation system of 500k lines of code with 500-line neural
model
67
68
69
State-of-the-Art Progress
GLUE Benchmark
The General Language Understanding
Evaluation (GLUE) benchmark
Introduced in 2019
Human ranks #16
Sentiment
analysis
Semantic
similarity
71
SuperGLUE Benchmark
GLUE, single metric on a collection of tasks, is
approaching non-expert human
Limited headroom for further research
SuperGLUE is a more difficult set of language
understanding tasks
Pre-transformer Era
RNN, LSTM, GRU etc.
All have recurrent structure
All use static word embedding as inputs
Differ in how inputs get encoded within each cell
Classification
Seq2Seq model (translation, token classification etc.)
Often has encoder-decoder structure
What is transformer network
Common in SOTA models
Attention mechanism
More parallelizable
Trains faster
Illustrated Transformer
Attention Mechanism
Why we need it
Seq2seq models like RNN/LSTM can’t encode long
sequences well
Words have ordering
Types of attention
General attention (b/w input and output elements)
Self-attention (within input elements)
Transformer based models – BERT as an example
Contextual word embeddings
Pretrained on Masked Language Model + Next Sentence Prediction task
76
BERT: Pre-training of Deep Bidirectional Transformers for Language. Devlin et al. (2018)
Transfer Learning in NLP
Trends and Challenges in NLP
80
GPT3 Demo
81
Exploding Model Capacity
Observations
Model quality improves as a function of parameter size
Show no signs of stopping
Similar findings from our own experiments
82
Figure 1: Log-linear relationship between model size and model quality (from GPT-2)
Figure 2 Log-linear trend on model size vs accuracy for image
recognition models for ImageNet.
Large-scale Training Challenges
Compute and Data Size
For example, takes 1Tflops CPU 8.2 years to pretrain BERT-large
Data Parallelism
Memory and Inter-connect
Large number of weights and activations to be stored and transferred
Model Parallelism
Optimization Challenges
Optimization Instability and Convergence
Train Target Selection
Baby Steps: Pretrain — A Challenging Task
Pretrain Bert Large
Compute
1.01Tflop per sample x 256M samples
Total: 2.6 x 1020 flop
CPU with 1Tflops throughput: 8.2 years
Memory
Weights: 0.6GB
Activations / sample : 0.7GB
Minibatch size of 1: 1.3GB
84
Knowledge Distillation
Knowledge distillation is model compression method in which a small model is trained to mimic
a pre-trained, larger model (or ensemble of models).
Teacher-student setting
Speed up from 21ms 24-layer teacher to 3ms 3-layer student
Summary — Change in how we do machine learning
The Classical ML “Silo” – A Task-driven ML Process
Collect training data for a task
Design model for a task
Feature engineering for a task
Evaluate and inference model for a task
Transfer learning makes ML process more efficient
Pretrain model on large corpus then finetune on small task-specific dataset
Multi-task adaptation
Rethink ML development process
How do we organize teams and find good collaboration model
How do we inference DL models in a cost conscious way
How do people contribute and how the improvements are integrated
…
Next Step
Lecture 9: Review of all lectures, HW3, and mid-term exam
87
88