程序代写代做代考 data structure database deep learning AI algorithm Computational

Computational
Linguistics
CSC 485/2501 Fall 2021
1
1. Introduction to computational linguistics

Department of Computer Science, University of Toronto (many slides taken or adapted from others)
Reading: Jurafsky & Martin: 1. Bird et al: 1, [2.3, 4].
Copyright © 2021 , and . All rights reserved.

Why would a computer need to use natural language?
Why would anyone want to talk to a computer?
2

• Computer as autonomous agent.
Has to talk and understand like a human.
3

• Computer as servant. Has to take orders.
4

• Computer as personal assistant. Has to take orders.
5

• Computer as researcher.
Needs to read and listen to everything.
6

• Computer as researcher.
Brings us the information we need.
Find me a well-rated hotel in or near Stockholm where the food is good, but not one that has any complaints about noise.
7

• Computer as researcher.
Brings us the information we need.
Did people in 1878 really speak like the characters in True Grit?
8

• Computer as researcher.
Brings us the information we need.
Is it true that if you turn a guinea pig upside down, its eyes will fall out?
9

• Computer as researcher. Organizes the information we need.
Please write a 500- word essay for me on “Why trees are important to our environment”.
And also write a thank-you note to my grandma for the birthday present.
10

• Computer as researcher. Wins television game shows.
IBM’s Watson on Jeopardy!, 16 February 2011 https://www.youtube.com/watch?v=yJptrlCVDHI
https://www.youtube.com/watch?v=Y2wQQ-xSE4s 11

• Computer as language expert. Translates our communications.
12

• Input: Spoken
Written
• Output:
An action
A document or artifact
Some chosen text or speech
Some newly composed text or speech
13

Intelligent language processing
• Document applications
Searching for documents by meaning Summarizing documents
Answering questions
Extracting information
Content and authorship analysis Helping language learners
Helping people with disabilities

14

Example: Early detection of Alzheimer’s
• Look for deterioration in complexity of vocabulary and syntax.
• Study: Compare three British writers
P.D. of Alzheimer’s No Alzheimer’s Suspected Alzheimer’s
15

Increase in short-distance word repetition
Rise, p < .01 Rise, p < .01 n.s. 17 Speech recognition for dysarthria • Use articulation data to improve speech recognition for people with speech disabilities • Created large database of dysarthric speech and articulation data for study 18 Language Change through Time - Disney Dynamics (Randle & Hanson, 2013) 19 Mathematics of syntax and language • Fowler’s algorithm (2009): first quasi- polynomial time algorithm for parsing with Lambek categorial grammars • McDonald’s algorithm (2005): novel dependency-grammar parsing algorithm based upon minimum spanning trees • Parsing in freer-word-order languages 20 Statistical Pattern Recognition Information Science Linguistics CL/NLP Psycho- linguistics Knowledge representation and reasoning Signal processing 21 Computational linguistics 1 • Anything that brings together computers and human languages ... • ... using knowledge about the structure and meaning of language (i.e., not just string processing). • The dream: “The linguistic computer”. • Human-like competence in language. 22 Computational linguistics 2 • The development of computational models with natural language as input and/or output. • Goal: A set of tools for processing language (semi-) automatically: • To access linguistic information easily and to transform it — e.g., summarize, translate, .... • To facilitate communication with a machine. • “NLP”: Natural language processing. 23 Computational linguistics 3 • Use of computational models in the study of natural language. • Goal: A scientific theory of communication by language: • To understand the structure of language and its use as a complex computational system. • To develop the data structures and algorithms that can implement/approximate that system. 24 Current research trends • Emphasis on large-scale NLP applications. • Combines: language processing and machine learning. • Availability of large text corpora, development of statistical methods. • Combines: grammatical theories and actual language use. • Embedding structure into known problem spaces (especially with neural networks). • Combines: statistical pattern recognition and some relatively simple linguistic knowledge. 25 Focus of this course 1 • “Grammars“ • “Parsing" • Resolving "syntactic" ambiguities • Determining "argument structure" • Lexical semantics, resolving word-sense ambiguities • “Compositional” semantics • Understanding pronoun reference 26 Focus of this course 2 • Current methods • Integrating statistical knowledge into grammars and parsing algorithms. • Using text corpora as sources of linguistic knowledge. 27 Not included • Machine translation, language models, text classification, part-of-speech tagging...* • Graph-theoretic and spectral methods% • Speech recognition and synthesis*¶ • Cognitively based methods§ • Semantic inference,% semantic change/drift^ • Understanding dialogues and conversations¶ • Bias, fake news detection, ethics in NLP$ * CSC 401 / 2511. % CSC 2517. ¶ CSC 2518. § CSC 2540. ^ CSC 2519. $csc 2528. 28 What about “Deep Learning?” • Until very recently, more of a euphemism in NLP – the depth is mostly unrelated to abstraction or cognitive plausibility. • It is very useful, however, for plugging different components together – modular? • It would be more accurate to call what we do “fat learning” – neural models work well because they take lots of earlier/later context into account. • But deep/fat learning hasn’t solved all of our problems... 29 Grammaticality 30 “well formed; in accordance with the productive rules of the grammar of a language” - lexico.com (Oxford) From grammatical, “of or pertaining to grammar” 16th century: ≈ literal 18th century: a state of linguistic purity 19th century: relating to mere arrangement of words, as opposed to logical form or structure Grammaticality vs. Probability 31 “I think we are forced to conclude that ... probabilistic models give no particular insight into some of the basic problems of syntactic structure.” - Chomsky (1957) Grammaticality vs. Probability (Chomsky, 1955) colorless green ideas sleep furiously furiously sleep ideas green colorless 32 Grammaticality vs. Probability (Pereira, 2000) colorless green ideas sleep furiously (-40.44514457) furiously sleep ideas green colorless (-51.41419769) This is not only a probabilistic model, but a probabilistic language model 33 (Agglomerative Markov Process, Saul & Pereira,1997). Language Modelling (Shannon, 1951; Jelinek, 1976) 34 Wi = argmax P(w | w1 ... wi-1) w Examples: Athens is the capital __ Athens is the capital of __ What do you need to know to predict the first? What do you need to know to predict the second? Grammaticality vs. Probability (Pereira, 2000) colorless green ideas sleep furiously (-40.44514457) furiously sleep ideas green colorless (-51.41419769) This is not only a probabilistic model, but a probabilistic language model 35 (Agglomerative Markov Process, Saul & Pereira,1997). 36 (-39.5588693) colorless sleep green ideas furiously colorless ideas furiously green sleep colorless sleep furiously green ideas colorless green ideas sleep furiously (-40.44514457) furiously sleep ideas green colorless (-51.41419769) green furiously colorless ideas sleep green ideas sleep colorless furiously (-51.69151925) Point-Biserial Correlations • Grammaticality taken to be a binary variable (yes/no). • The probability produced by a language model for a string of words is continuous. • Point-biserial correlations: • M1 = mean of the continuous values assigned to samples that received the positive binary value. • M0 = mean of the continuous values assigned to the samples that received the negative binary value. • Sn = standard dev. of all samples’ continuous values. • p = Proportion of samples with negative binary value. • q = Proportion of samples with positive binary value. 37 Corrected Point-biserial Correlations with CGISF Permutation Judgements 38 • Grammaticality taken to be a discrete variable. • Two linguists scored then adjudicated the permutations (27 grammatical, 93 ungrammatical). Corrected Point-biserial Correlations with CoLA (Warstadt et al., 2018) 39 • 10,657 (English) examples taken from linguistics papers. • Roughly 71% of their development set labelled positively. What about GPT-2? OpenAI’s GPT-2 has been promoted as “an AI” that exemplifies an emergent understanding of language after mere unsupervised training on about 40GB of webpage text. It sounds really convincing in interviews: • Q: Which technologies are worth watching in 2020? A: I would say it is hard to narrow down the list. The world is full of disruptive technologies with real and potentially huge global impacts. The most important is artificial intelligence, which is becoming exponentially more powerful. There is also the development of self-driving cars. There is a lot that we can do with artificial intelligence to improve the world.... • Q: Are you worried that ai [sic] technology can be misused? A: Yes, of course. But this is a global problem and we want to tackle it with global solutions.... – “AI can do that”, The World in 2020, The Economist] Surely something this sophisticated can predict grammaticality, right? 40 Wrong Model GPT-2 small GPT-2 XL normalization raw normalized raw normalized Score log exp log exp log exp log exp PB 0.15 0.012 0.224 0.159 0.184 0.012 0.25 0.164 Adj. B 0.161 0.014 0.244 0.174 0.201 0.013 0.272 0.18 41 • Should conclusions about grammaticality be based upon scientific experimentation or self-congratulatory PR stunts? • People are very good at attributing interpretations to natural phenomena that defy interpretation. • The misrepresentation isn’t limited to this, moreover... The Deep Learning Advantage? 42 • There is now a robust thread of research that uses language models for tasks other than predicting the next word, not because they are the best approach, but because the people using them are scientifically illiterate: • What language consists of and how it works, • How to evaluate performance and progress in the task. • When these models work well at all, they often get credit just for placing. • Grammaticality prediction is one of these tasks. The Deep Learning Retort 43 • In the case of grammaticality, the reply by this community has been: • To blame linguists for coining a task (they didn’t) that is ill posed (it isn’t), • To shift to a different, easier task, relative grammaticality, which is also known to be more stable across samples of human annotations. • Pedestrian attempts at promoting deep learning will often represent fields such as CL as blindly hunting for “hand- crafted” features in order to improve the performance of their classifiers. • In fact, several discriminative pattern-recognition methods were already in widespread use before the start of the “deep learning revolution” that had made this approach very unattractive. The Deep Learning Advantage 44 • Nevertheless, deep learning is adding value, but more in terms of: • Modularity of the different network layers that allows for separation and recombination, • Novelty of the approaches, even if performance isn’t state of the art, and • the “liberated practitioner,” who can now produce a baseline system with very little expertise that has a higher accuracy than earlier naïve baselines. Legitimate Points of Concern 45 • Is grammaticality really a discrete variable? • Several have argued that a presumed correlation between neural language models and grammaticality suggests that grammaticality should be viewed as gradient (Lau et al., 2017; Sprouse et al., 2018). • Eliciting grammaticality ≠ blindly probing the elephant. • Numerous papers on individual features of grammaticality (Linzen et al., 2016; Bernardy & Lappin, 2017; Gulordava et al., 2018). • How do you sample grammaticality judgements? • Acceptability judgements (Sprouse & Almeida 2012; Sprouse et al., 2013) are not quite the same thing – experimental subjects can easily be misled by interpretability. • Round-trip machine translation of grammatical sentences for generating ungrammatical strings (Lau et al., 2014;2015). Grammaticality vs. Interpretability 46 We sampled the British National Corpus by: 1) Using the 27 grammatical part-of-speech tag sequences from the CGISF permutations, 2) Using ClausIE (Del Corro and Gemulla, 2013) to ensure that our sequences exactly matched 5-word clauses. This resulted in 36 5-word clauses from the BNC that are both grammatical and interpretable. Corrected Point-biserial Correlations with BNC/CGISF provenance LOG BNC 0.83 BNC 0.79 WT-103 0.80 GPT Model colorless colourless Mikolov QRNN GPT-2 EXP LOG LOG LOG LOG 0.13 0.83 0.71 0.13 0.81 0.11 0.11 0.90 QRNN EXP LOG 0.17 0.79 0.17 0.38 0.30 0.59 (regular) EXP 0.14 0.15 0.27 -0.39 (TAS) EXP 0.59 0.36 0.19 0.811 GPT-2 XL EXP 0.043 0.806 EXP 0.052 47 Is Grammaticality Testing Beyond our Grasp? 48 There are more successful methods (Warstadt et al., 2018; Liu et al., 2019; Lan et al., 2019; Raffel et al., 2019), but: 1) They are supervised (unlike most language models), 2) They are trained on CoLA (which has been split into training, development and test sets 3) Many use neural networks 4) But they clamp on a classifier that takes softmax outputs as inputs (and so there is more there than a language model). The results: • Accuracy: between 65-71% (not everyone reports this) • MCC: 0.253 - 0.7 • But uniformly guessing “grammatical” on CoLA has 71% accuracy, also. Future Work 49 • Grammaticality prediction remains a practical challenge. • It is important, even for practical considerations, e.g. grammar checking. • It is hard to imagine that the next breakthrough in accuracies/correlations of grammaticality prediction would not use statistical modelling of some kind. • But it also seems unlikely that that breakthrough would come solely from a statistical language model. • Language models have developed the way that they have because making sense seems to be more important than being grammatical.