CS计算机代考程序代写 deep learning finance AI Excel Microsoft PowerPoint – Lecture8_NLP.pptx

Microsoft PowerPoint – Lecture8_NLP.pptx

MFIN 290 Application of Machine Learning in
Finance: Lecture 8

Yujie He

8/14/2021

Agenda

Solution to HW 2

Natural Language Processing

State-of-the-Art AI systems

Introduction to NLP and its fundamentals

Word2Vec and embeddings

Deep learning applications of Natural Language Understanding (NLU) and Natural Language Generation (NLG) and Trends

Transformer models

Transfer learning

Current Status and Trends

2

Goal

Build intuitions towards multiple commonly used models

Need to know some key math components of models

Pros and cons of different classic NLP models and their applications

Know industry practices and use cases of different models

Excellent resource:

https://scikit-learn.org/stable/tutorial/index.html

https://github.com/huggingface

Machine Learning Roadmap 2020 (Roadmap on Github)

3

State-of-the-Art AI Systems

4

AI Empowered Technology

Visual studio 2022 code completion

Github CoPilot

AI Empowered Technology

Visual studio 2022 code completion

Github CoPilot

The poet in the machine (Microsoft, 2018)

https://www.microsoft.com/en-us/research/blog/the-poet-in-the-machine-auto-generation-of-poetry-directly-from-images-through-multi-
adversarial-training-and-a-little-inspiration/

DALL-E (OpenAI, 2021)

https://openai.com/blog/dall-e/

AlphaFold (Google DeepMind, 2020)

A solution to a 50-year-old grand challenge in biology

AlphaFold can accurately predict 3D models of protein structures and has the potential to accelerate

research in every field of biology.

https://deepmind.com/research/case-studies/alphafold

Introduction to NLP and its fundamentals

10

What is NLP?

Natural language processing is a field at the intersection of

computer science

artificial intelligence

linguistics

Goal: for computers to process or “understand” natural language in order to perform tasks that are useful

Performing Tasks, e.g. make appointments using chatbot

Machine translation

Question Answering, e.g. which movie won Best Picture in last Oscar?

Summarization, auto text generation

Fully understand and represent the meaning of language (or even defining it) is a difficult goal

Perfect language understanding is AI-complete, but we are getting there

11

Stanford CoreNLP https://corenlp.run/

12

Allen NLP

13

https://demo.allennlp.org/coreference-resolution

Some examples of NLP applications

Spell/grammar checking, neural rewrite/style transfer

Knowledge mining

Knowledge graph building (entities, relations, acronyms, who knows what etc.)

Sentiment analysis guiding trading strategy

Question Answering, Machine reading comprehension

Search relevance, auto-suggest, recommendations

Text summarization and generation (e.g. Ads)

Machine translation

Personal assistance/chatbot/customer service

14

Search

15

Personalized Search Results

16

Question Answering and Semantic Search

17

Question Answering and Semantic Search

18

What is special about human language?

The categorical symbols of a language can be encoded as a signal for communication in several ways:

Sound

Gesture

Writing/Images

the symbol is invariant across different encodings!

The large vocabulary, symbolic encoding of words creates a problem for machine learning – sparsity!

Sparse encoding (traditional NLP) vs. continuous encoding

19

Fundamental NLP techniques

Tokenization/Stemming/Lemmatization: determining word units

Part-of-speech tagging: determining the grammatical function of each word: e.g. noun, verb, pronoun,

preposition, adjective, etc.

Named entity recognition: determining the correspondence between words/phrases and real-world

things: people, places, organizations, etc.

Chunking: determining grouping for word sequences

Dependency Parsing: determining the syntactic structure of a sentence

Text Classification: assign categories/labels to documents

Topic Models: represent documents and individual words in terms of topics

Co-reference resolution: determining words and phrases that refer to the same entity: e.g., “John Lee”,

“he”, “John

20

Tokenization

Is the process of breaking a stream of text up into words, phrases, symbols and other meaningful

elements called tokens

21

Tokenization

Tokens are usually* the units of NLP analysis

https://www.nltk.org/api/nltk.tokenize.html

Main question of tokenization: where to split?

Punctuation

White space

Language specific issues

Word-breaker or word-segmentation (maximum matching)

Consistent tokenizer

Can be considered a “solved” problem in NLP

22

Tokenization

Left: word tokenizer

Right: subword tokenizer (used a lot in deep learning models)

23

Stemming and Lemmatization

The goal of both stemming and lemmatizing is to reduce inflectional forms to a common/root form

Stem might not be an actual word whereas lemma is

Stemming can be considered as simplified lemmatization

24

Stemming and Lemmatization

25

Part-of-speech tagging (POS Tagging)

POS tagging assigns grammatical categories to individual words

POS tagging underpins many other NLP tools (as features), such as NER

POS tagging is important for syntactic and semantic analysis

26

Part-of-speech tagging (POS Tagging)

Stochastic sequence models (conditional random field) are widely used

Accuracy upto 97% (trained/tested on Penn TreeBank WSJ)

Rule-based

Accuracy upto 95% (trained/tested on Brown corpus) with automated rule-generation

No model is perfect!

27

Named Entity Recognition (NER)

Detect named entity mentions in text into pre-defined categories

Person, organization, location (3 most common)

Others: datetime, numbers, monetary values or domain specific types

Standard dataset (e.g. CoNLL-2003) for NLP research

28

Named Entity Recognition (NER)

BIO encoding for NER to define tags

B-PERS, B-DATE… : beginning of a mention of a person/date type

I-PERS, I-DATE… : inside of a mention of a person/date type

O: outside of any mention of a named entity

29

Naïve Bayes for classification

Applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair

of features given the value of the class variable.

30

Conditional independence between features

Simplify

P(x1,..xn) is constant given the input

Maximum A Posterior (MAP)
estimation

31

Convert dataset to frequency
table

Calculate likelihood table by
calculating marginal and
conditional probability

Use Bayes rule to calculate
posterior probability for each
class

How to adapt to spam
filtering?

Convert dataset to frequency
table

Calculate likelihood table by
calculating marginal and
conditional probability

Use Bayes rule to calculate
posterior probability for each
class

How to adapt to spam
filtering?

32

P(word) is constant, can be ignored

Naïve Bayes

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the

distribution of P(xi∣y)

Pros:

Worked quite well in document classification and spam detection

Require small amount of data

Fast calculation

Conditional independence assumption makes each distribution can be independently estimated as a one-dimensional

distribution, alleviates curse of dimensionality

Cons:

Independency assumption is too strong

Zero-frequency issue (use smoothing, e.g. laplace smoothing, add-one)

NB probability is a bad estimator, take with a grain of salt

33

Gaussian NB:

Traditional NLP applications: Sentiment analysis

Traditional: Treat sentence as a bag-of-words (ignore word order); consult a curated list of “positive” and

“negative” words to determine sentiment of sentence. Need hand-designed features to capture negation! –

-> Ain’t gonna capture everything 

34

Word2Vec and Embeddings

35

Representation for all levels? Vectors!

Semantic vector space model

Represent each word with a real-

valued vector

These vectors can be used as

features in a variety of applications

36

Inspirations

The statistics of word occurrences in a corpus is the primary source of information available to all

unsupervised methods for learning word representations.

Can use it to solve the following:

How meaning is generated from these statistics

How the resulting word vectors might represent that meaning

37

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols:

Hotels, conference, motel – a localist representation

38

Problem with words as discrete symbols

Example: in web search, if user searches for “Seattle motel”, we would like to match documents
containing “Seattle hotel” -> semantic search

But:

These two vectors are orthogonal -> cosine similarity is 0!

There is no natural notion of similarity for one-hot vectors!

Solution:
Could rely on WordNet’s list of synonyms to get similarity?

But it is well-known to fail: incompleteness, etc.

Instead: learn to encode similarity in the vectors themselves

39

40

41

42

Word2vec overview

43

Word2vec overview

44

Word2vec overview

45

46

47

48

49

50

51

Illustration of Word Embeddings

52

Illustration of Word Embeddings

53

54

Summary of word2vec

Go through each word of the whole corpus

Predict surrounding words of each word

This captures co-occurrence of words one at a time

55

56

57

58

59

60

61

Some findings about word embeddings

Performance is better on the syntactic subtask for small and asymmetric context windows.

Syntactic info in mostly drawn from the immediate context and can depend strongly on word order

Semantic info is more frequently non-local, and more of it is captured with larger window sizes

62

63

64

65

Deep learning applications in NLP and recent
trends

66

Deep learning in NLP

Traditional NLP was up until 2012, since 2014 with the onset of

neural machine translation, deep learning-based NLP took off

Statistical Machine Translation systems, built by hundreds of

engineers over many years, outperformed by Neural Machine

Translation systems trained by a handful of engineers in a few months

2016, Google replaced its monolithic phrase-based machine

translation system of 500k lines of code with 500-line neural

model

67

68

69

State-of-the-Art Progress

GLUE Benchmark

The General Language Understanding

Evaluation (GLUE) benchmark

Introduced in 2019

Human ranks #16

Sentiment
analysis

Semantic
similarity

71

SuperGLUE Benchmark

GLUE, single metric on a collection of tasks, is

approaching non-expert human

Limited headroom for further research

SuperGLUE is a more difficult set of language

understanding tasks

Pre-transformer Era

RNN, LSTM, GRU etc.

All have recurrent structure

All use static word embedding as inputs

Differ in how inputs get encoded within each cell

Classification

Seq2Seq model (translation, token classification etc.)

Often has encoder-decoder structure

What is transformer network

Common in SOTA models

Attention mechanism

More parallelizable

Trains faster

Illustrated Transformer

Attention Mechanism

Why we need it

Seq2seq models like RNN/LSTM can’t encode long

sequences well

Words have ordering

Types of attention

General attention (b/w input and output elements)

Self-attention (within input elements)

Transformer based models – BERT as an example

Contextual word embeddings

Pretrained on Masked Language Model + Next Sentence Prediction task

76
BERT: Pre-training of Deep Bidirectional Transformers for Language. Devlin et al. (2018)

Transfer Learning in NLP

Trends and Challenges in NLP

80

GPT3 Demo

81

Exploding Model Capacity

Observations
Model quality improves as a function of parameter size

Show no signs of stopping

Similar findings from our own experiments

82

Figure 1: Log-linear relationship between model size and model quality (from GPT-2)

Figure 2 Log-linear trend on model size vs accuracy for image
recognition models for ImageNet.

Large-scale Training Challenges

Compute and Data Size
For example, takes 1Tflops CPU 8.2 years to pretrain BERT-large

Data Parallelism

Memory and Inter-connect
Large number of weights and activations to be stored and transferred

Model Parallelism

Optimization Challenges
Optimization Instability and Convergence

Train Target Selection

Baby Steps: Pretrain — A Challenging Task

Pretrain Bert Large

Compute

1.01Tflop per sample x 256M samples

Total: 2.6 x 1020 flop

CPU with 1Tflops throughput: 8.2 years

Memory

Weights: 0.6GB

Activations / sample : 0.7GB

Minibatch size of 1: 1.3GB

84

Knowledge Distillation

Knowledge distillation is model compression method in which a small model is trained to mimic
a pre-trained, larger model (or ensemble of models).
Teacher-student setting
Speed up from 21ms 24-layer teacher to 3ms 3-layer student

Summary — Change in how we do machine learning

The Classical ML “Silo” – A Task-driven ML Process
Collect training data for a task

Design model for a task

Feature engineering for a task

Evaluate and inference model for a task

Transfer learning makes ML process more efficient
Pretrain model on large corpus then finetune on small task-specific dataset

Multi-task adaptation

Rethink ML development process
How do we organize teams and find good collaboration model

How do we inference DL models in a cost conscious way

How do people contribute and how the improvements are integrated

Next Step

Lecture 9: Review of all lectures, HW3, and mid-term exam

87

88