Computational
Linguistics
CSC 485 Summer 2020
1
1. Introduction to computational linguistics
Gerald Penn
Department of Computer Science, University of Toronto (many slides taken or adapted from others)
Reading: Jurafsky & Martin: 1. Bird et al: 1, [2.3, 4].
Copyright © 2020 Graeme Hirst, Suzanne Stevenson and Gerald Penn. All rights reserved.
Why would a computer need to use natural language?
Why would anyone want to talk to a computer?
2
• Computer as autonomous agent.
Has to talk and understand like a human.
3
• Computer as servant. Has to take orders.
4
• Computer as personal assistant. Has to take orders.
Schedule a meeting tomorrow with George. Book me a flight to Vancouver for the conference. Find out why our sales have dropped in Lithuania. And write a thank-you note to my grandma for the birthday present.
5
• Computer as researcher.
Needs to read and listen to everything.
6
• Computer as researcher.
Brings us the information we need.
Find me a well-rated hotel in or near Stockholm where the food is good, but not one that has any complaints about noise.
Did people in 1878 really speak like the characters in True Grit?
Is it true that if you turn a guinea pig upside down, its eyes will fall out?
7
• Computer as researcher. Organizes the information we need.
Please write a 500- word essay for me on “Why trees are important to our environment”.
And also write a thank-you note to my grandma for the birthday present.
8
• Computer as researcher. Wins television game shows.
IBM’s Watson on Jeopardy!, 16 February 2011 https://www.youtube.com/watch?v=yJptrlCVDHI
https://www.youtube.com/watch?v=Y2wQQ-xSE4s 9
• Computer as language expert. Translates our communications.
10
• Input: Spoken
Written
• Output:
An action
A document or artifact
Some chosen text or speech
Some newly composed text or speech
11
Intelligent language processing
• Document applications
Searching for documents by meaning Summarizing documents
Answering questions
Extracting information
Content and authorship analysis Helping language learners
Helping people with disabilities
…
12
Example: Answering clinical questions at the point of care
In a patient with suspected MI, does thrombolysis decrease the risk of death even if it is administered ten hours after the onset of chest pain?
13
Example: Early detection of Alzheimer’s
• Look for deterioration in complexity of vocabulary and syntax.
• Study: Compare three British writers
Iris Murdoch P.D. James Agatha Christie
Died of Alzheimer’s No Alzheimer’s Suspected Alzheimer’s
14
Increase in short-distance word repetition
Rise, p < .01
Rise, p < .01
n.s.
16
Spoken documents
• “Google for speech”
Search, indexing, and browsing through audio documents.
• Speech summarization Automatically select the 5–20% most important sentences of audio documents.
17
Speech recognition for dysarthria
• Use articulation data to improve speech recognition for people with speech disabilities
• Created large database of dysarthric speech and articulation data for study
18
Speech transformation for dysarthria
• Transform dysarthric
speech to improve comprehensibility
19
Models of human
language processing
• Highly multidisciplinary approach
• Exploit the relation between linguistic knowledge and statistical behaviour of words
20
Models of children’s language acquisition
• Models of how children learn their language just from what they hear and observe
• Apply machine-learning techniques to show how children can learn:
° to map words in a sentence to real world objects
° the relation between verbs and their arguments
21
Mathematics of syntax and language
• Fowler’s algorithm (2009): first quasi- polynomial time algorithm for parsing with Lambek categorial grammars
• McDonald’s algorithm (2005): novel dependency-grammar parsing algorithm based upon minimum spanning trees
• Parsing in freer-word-order languages
22
Machine Learning
Information Science
Linguistics
CL/NLP
Psycho- linguistics
Knowledge representation and reasoning
Signal processing
23
Computational linguistics 1
• Anything that brings together computers and human languages ...
• ... using knowledge about the structure and meaning of language (i.e., not just string processing).
• The dream: “The linguistic computer”. • Human-like competence in language.
24
Computational linguistics 2
• The development of computational models with natural language as input and/or output.
• Goal: A set of tools for processing language (semi-) automatically:
• To access linguistic information easily and to transform it — e.g., summarize, translate, ....
• To facilitate communication with a machine.
• “NLP”: Natural language processing.
25
Computational linguistics 3
• Use of computational models in the study of natural language.
• Goal: A scientific theory of communication by language:
• To understand the structure of language and its use as a complex computational system.
• To develop the data structures and algorithms that can implement/approximate that system.
26
What does it mean to “understand” language?
27
The Turing Test
In the first line of your sonnet which reads “Shall I compare thee to a summer’s day,” would not “a spring day” do as well or better?
It wouldn’t scan.
How about “a winter’s day”? That would scan all right.
Yes, but nobody wants to be compared to a winter’s day.
Alan Turing, “Computing machinery and intelligence”, Mind, 59, 1950, 433–460.
28
The Turing Test
Would you say Mr Pickwick reminded you of Christmas?
In a way.
Yet Christmas is a winter’s day, and I do not think Mr Pickwick would mind the comparison.
I don’t think you’re serious. By a winter’s day one means a typical winter’s day, rather than a special one like Christmas.
Alan Turing, “Computing machinery and intelligence”, Mind, 59, 1950, 433–460.
29
Using language
is not necessarily understanding language
30
A computer psychiatrist
Men are all alike.
In what way?
They’re always bugging us about something or the other.
Can you think of a specific example?
Well, my boyfriend made me come here.
Your boyfriend made you come here. He says I’m depressed.
Joseph Weizenbaum, Computer Power and Human Reason, W.H. Freeman, 1976.
31
A computer psychiatrist
I’m sorry to hear you are depressed. It’s true. I am unhappy.
Do you think coming here will help you not to be unhappy?
Perhaps I could learn to get along with my mother.
Tell me more about your family.
Joseph Weizenbaum, Computer Power and Human Reason, W.H. Freeman, 1976.
32
Levels of understanding 1
• 0. Keyword processing:
Limited knowledge of particular words or phrases, or their collocations.
• Chatbots.
• Information retrieval.
• Web searching.
33
Levels of understanding 2
• 1. Limited linguistic ability: Appropriate response to simple, highly constrained sentences.
• Database queries in NL.
“Show all sales staff who exceeded their quota in May.”
• Simple NL interfaces.
“I want to fly from Toronto to Vancouver next Sunday.”
34
Levels of understanding 2
35
Levels of understanding 3
• 2. Full text comprehension: Understanding multi-sentence text and its relation to the “real world”.
• Conversational dialogue.
• Automatic knowledge acquisition
• Machine translation?
• 3. Emotional understanding/generation:
• Responding to literature, poetry, humour
• Story narration.
36
Current research trends
• Emphasis on large-scale NLP applications. • Combines: language processing and machine
learning.
• Availability of large text corpora, development of statistical methods.
• Combines: grammatical theories and actual language use.
• Embedding structure into known problem spaces (especially with neural networks).
• Combines: statistical pattern recognition and some relatively simple linguistic knowledge.
43
Levels of linguistic structure and analysis 1
• Phonology
• The sound system of a language.
• Morphology
• The minimal meaningful units of language (root of a word; suffixes and prefixes), and how they combine.
• Lexicon
• The semantic and syntactic properties of words.
44
Levels of linguistic structure and analysis 2
• Syntax
• The means of expressing meaning: how words
can combine, and in what order. • Semantics
• The meaning of a sentence (a logical statement?). • Pragmatics
• The use of a sentence: pronominal referents; intentions; multi-sentence structure.
45
“Building blocks” of CL systems 1
• Language interpretation, language generation, and machine translation.
• Part-of-speech (PoS) tagging.
• Parsing and grammars.
• Reference resolution.
• Dialogue management.
• These are better thought of as functional units now rather than as modular components of modern NLP architectures.
46
Natural language interpretation
Does Flight 207 serve lunch?
YNQ ( ∃e SERVING(e) ∧ SERVER(e, flight-207) ∧ SERVED(e, lunch) )
47
Natural language generation
(spray-1 (OBJECT paint-1) (PATH (path-1
(DESTINATION wall-1)))) (CAUSER sally-1)
Sally sprayed paint onto the wall.
48
Machine translation
• History lesson: the Vauquois triangle (1968).
• Current systems based purely on statistical associations and lexical semantic embeddings.
• Getting incrementally better as they learn from more data.
• Probably more emergent knowledge of linguistics in there than we give them credit for, but it’s awfully difficult for us to extract it.
49
http://www.duchcov.cz/gymnazium/
50
http://www.duchcov.cz/gymnazium/ Translated by Google Translate, 14 July 2008
51
http://gymdux.sokolici.eu/index.php/informace/historie-koly Translated by Google Translate, 3 August 2010.
53
http://gymdux.sokolici.eu/index.php/informace/historie-koly Translated by Google Translate, 17 June 2013.
55
http://www.gspsd.cz/historie/historie-skoly Translated by Google Translate, 26 May 2014.
56
https://www.gspsd.cz/index.php?type=Post&id=256&ids=249 Translated by Google Translate, 5th September 2019.
57
“Building blocks" of CL systems 2
• Information extraction
• Chunking (instead of parsing).
• Template filling.
• Named-entity recognition.
58
Information extraction
“Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990.”
Tie-up-1: Relation: Tie-up
Entities: Bridgestone Sports Co.
a local concern
a Japanese trading house
Joint venture: Bridgestone Sports Taiwan Co. Activity: Activity-1
Amount: NT $ 20,000,000
Activity-1: Company: Bridgestone Sports Taiwan Co. Product: golf clubs
Start date: January 1990
59
“Building blocks" of CL systems 3
• Lexical semantics
• Word sense disambiguation (WSD).
• Taxonomies of word senses.
• Analysis of verbs and other predicates
• Embeddings of words into continuous vector space (word2vec, BERT, XLNet, etc.) .
• Computational morphology
60
Why is understanding hard? 1
• The structures that we are interested in are richer than strings – often hierarchical or scope-bearing.
Nadia knows Ross left.
S
NP VP
Nadia V
knows NP VP
Ross left
S
KNOWS(Nadia, LEFT(Ross))
61
Why is understanding hard? 2
• Mapping from surface-form to meaning is many-to-one: Expressiveness.
Nadia kisses Ross. Ross is kissed by Nadia.
KISS (Nadia, Ross)
Nadia gave Ross a kiss. Nadia gave a kiss to Ross.
62
Why is understanding hard? 3
• Mapping is one-to-many: Ambiguity at all levels.
• Lexical
• Syntactic • Semantic • Pragmatic
63
Lexical ambiguity
The lawyer walked to the bar and addressed the jury. The lawyer walked to the bar and ordered a beer. You held your breath and the door for me. (Alanis Morissette)
• Computational issues
• Representing the possible meanings of words,
and their frequencies and their indications.
• Representing semantic relations between words.
• Maintaining adequate context.
64
used to strain microscopic plant life from the zonal distribution of plant life .
close-up studies of plant life and natural too rapid growth of aquatic plant life in water
the proliferation of plant and animal life establishment phase of the plant virus life cycle
that divide life into plant and animal kingdom many dangers to plant and animal life
mammals . Animal and plant life are delicately
automated manufacturing plant in Fremont vast manufacturing plant and distribution
chemical manufacturing plant , producing viscose keep a manufacturing plant profitable without
computer manufacturing plant and adjacent discovered at a St. Louis plant manufacturing
copper manufacturing plant found that they copper wire manufacturing plant , for example
‘s cement manufacturing plant in Alpena
vinyl chloride monomer plant , which is molecules found in plant and animal tissue
Nissan car and truck plant in Japan is
and Golgi apparatus of plant and animal cells
union responses to plant closures . cell types found in the plant kingdom are
company said the plant is still operating Although thousands of plant and animal species
animal rather than plant tissues can be
Decision list for plant
LogL Collocation Sense
8.10 plant life →A 7.58 manufacturing plant →B 7.39 life (within ±2-10 words) →A 7.20 manufacturing (in ±2-10 words) →B 6.27 animal (within ±2-10 words) →A 4.70 equipment (within ±2-10 words) →B 4.39 employee (within ±2-10 words) →B 4.30 assembly plant →B 4.10 plant closure →B 3.52 plant species →A 3.48 automate (within ±2-10 words) →B 3.45 microscopic plant →A
...
Syntactic ambiguity 1
NP NadiaV
VP
NP
NP
VP NP
Nadia saw the cop with the binoculars.
SS
saw
NP PP
Nadia
saw thecopP
V
PP NP
with
the binoculars
the copP with
NP
the binoculars
67
Syntactic ambiguity 2
[ ][ ]
Put the book in the box on the table.
[ ][[[[ []]]]
Put the book in the red book box.
Noun phrase Adj Noun
Visiting relatives can be trying.
Verb Noun
Verb phrase
[ [ ]]
68
Syntactic ambiguity 3
• These are absolutely everywhere. Some real headlines:
Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids
Stolen Painting Found by Tree
Clinton Wins on Budget, but More Lies Ahead Hospitals are Sued by 7 Foot Doctors
Ban on Nude Dancing on Governor’s Desk
• Usually we don’t even notice – we’re that good at this kind of resolution.
69
Syntactic ambiguity 4
• Most syntactic ambiguity is local — resolved by syntactic or semantic context.
Visiting relatives is trying.
Visiting relatives are trying.
Nadia saw the cop with the gun.
• Sometimes, resolution comes too fast!
[ ][ ][ [????
The cotton clothing is made from comes from Mississippi.
[[ ] [ ]][ [ ]]
“Garden-path” sentences.
70
Syntactic ambiguity 5
• Computational issues
• Representing the possible combinatorial
structure of words.
• Capturing syntactic preferences and frequencies.
• Devising incremental parsing algorithms.
71
Semantic ambiguity
• Sentence can have more than one meaning, even when the words and structure are agreed on.
Nadia wants a dog like Ross’s.
Everyone here speaks two languages. Iraqi Head Seeks Arms.
DCS Undergrads Make Nutritious Snacks.
72
Pragmatic ambiguity
• A sample dialogue
Nadia: Do you know who’s going to the party?
Emily: Who?
Nadia: I don’t know.
Emily: Oh ... I think Carol and Amy will be there.
• Computational issues
• Representing intentions and beliefs.
• Planning and plan recognition.
• Inferencing and diagnosis.
73
Need for domain knowledge 1
Derivatization of the carboxyl function of retinoic acid by fluor- escent or electroactive reagents prior to liquid chromatography was studied. Ferrocenylethylamine was synthesized and could be coupled to retinoic acid. The coupling reaction involved activ- ation by diphenylphosphinyl chloride. The reaction was carried out at ambient temperature in 50 min with a yield of ca. 95%. The derivative can be detected by coulometric reduction (+100 mV) after on-line coulometric oxidation (+400 mV). The limit of de- tection was 1 pmol of derivative on-column, injected in a volume of 10μl, but the limit of quantification was 10 pmol of retinoic acid.
S. El Mansouri, M. Tod, M. Leclercq, M. Porthault, J. Chalom, “Precolumn derivatization of retinoic acid for liquid chromatography with fluorescence and coulometric detection.” Analytica Chimica Acta, 293(3), 29 July 1994, 245–250.
74
Need for domain knowledge 2
In doing sociology, lay and professional, every reference to the “real world”, even where the reference is to physical or biological events, is a reference to the organized activities of everyday life. Thereby, in contrast to certain versions of Durkheim that teach that the objective reality of social facts is sociology’s fundamental principle, the lesson is taken instead, and used as a study policy, that the objective reality of social facts as an ongoing accomp- lishment of the concerted activities of daily life, with the ordinary, artful ways of that accomplishment being by members known, used, and taken for granted is, for members doing sociology, a fun- damental phenomenon.
Harold Garfinkel, Preface, Studies in Ethnomethodology, Prentice-Hall, 1967, page vii.
75
Focus of this course 1
• Grammars and parsing.
• Resolving "syntactic" ambiguities.
• Determining "argument structure."
• Lexical semantics, resolving word-sense ambiguities.
• “Compositional” semantics.
• Understanding pronouns.
76
Focus of this course 2
• Current methods
• Integrating statistical knowledge into grammars
and parsing algorithms.
• Using text corpora as sources of linguistic knowledge.
77
Not included
• Machine translation, language models, text classification, part-of-speech tagging...*
• Graph-theoretic and spectral methods%
• Speech recognition and synthesis*¶
• Cognitively based methods§
• Semantic inference,% semantic change/drift^
• Understanding dialogues and conversations¶
• Bias, fake news detection, ethics in NLP$
* CSC 401 / 2511. % CSC 2517. ¶ CSC 2518. § CSC 2540. ^ CSC 2519. $csc 2528. 78
What about “Deep Learning?”
• Yes, we’ll definitely cover neural methods.
• Until recently, deep learning has been more of a euphemism in NLP – the depth of the networks still hasn’t paid off to the same extent that it has in some other areas.
• It would be more accurate to call what we do “fat learning” – what seems to matter more is the ability of the network to take earlier/later input into account.
• But deep/fat learning hasn’t solved all of our problems...
79
Grammaticality
80
“well formed; in accordance with the productive rules of the grammar of a language”
- lexico.com (Oxford)
From grammatical, “of or pertaining to grammar”
16th century: ≈ literal
18th century: a state of linguistic purity
19th century: relating to mere arrangement of words, as opposed to logical form or structure
Grammaticality vs. Probability
81
“I think we are forced to conclude that ... probabilistic models give no particular insight into some of the basic problems of syntactic structure.”
- Chomsky (1957)
Grammaticality vs. Probability (Chomsky, 1955)
colorless green ideas sleep furiously
furiously sleep ideas green colorless
82
Grammaticality vs. Probability (Pereira, 2000)
colorless green ideas sleep furiously (-40.44514457)
furiously sleep ideas green colorless (-51.41419769)
This is not only a probabilistic model, but a probabilistic language model
83
(Agglomerative Markov Process, Saul & Pereira,1997).
Language Modelling (Shannon, 1951; Jelinek, 1976)
84
Wi = argmax P(w | w1 ... wi-1) w
Examples:
Athens is the capital __ Athens is the capital of __
What do you need to know to predict the first? What do you need to know to predict the second?
Grammaticality vs. Probability (Pereira, 2000)
colorless green ideas sleep furiously (-40.44514457)
furiously sleep ideas green colorless (-51.41419769)
This is not only a probabilistic model, but a probabilistic language model
85
(Agglomerative Markov Process, Saul & Pereira,1997).
86
(-39.5588693) colorless sleep green ideas furiously colorless ideas furiously green sleep colorless sleep furiously green ideas colorless green ideas sleep furiously
(-40.44514457) furiously sleep ideas green colorless
(-51.41419769)
green furiously colorless ideas sleep green ideas sleep colorless furiously
(-51.69151925)
Point-Biserial Correlations
• Grammaticality taken to be a binary variable (yes/no).
• The probability produced by a language model for a
string of words is continuous.
• Point-biserial correlations:
• M1 = mean of the continuous values assigned to samples that received the positive binary value.
• M0 = mean of the continuous values assigned to the samples that received the negative binary value.
• Sn = standard dev. of all samples’ continuous values.
• p = Proportion of samples with negative binary value.
• q = Proportion of samples with positive binary value. 87
Corrected Point-biserial Correlations with CGISF Permutation Judgements
88
• Grammaticality taken to be a discrete variable.
• Two linguists scored then adjudicated the permutations (27 grammatical, 93 ungrammatical).
Corrected Point-biserial Correlations with CoLA (Warstadt et al., 2018)
89
• 10,657 (English) examples taken from linguistics papers.
• Roughly 71% of their development set labelled positively.
What about GPT-2?
OpenAI’s GPT-2 has been promoted as “an AI” that exemplifies an emergent understanding of language after mere unsupervised training on about 40GB of webpage text. It sounds really convincing in interviews:
• Q: Which technologies are worth watching in 2020?
A: I would say it is hard to narrow down the list. The world is full of disruptive technologies with real and potentially huge global impacts. The most important is artificial intelligence, which is becoming exponentially more powerful. There is also the development of self-driving cars. There is a lot that we can do with artificial intelligence to improve the world....
• Q: Are you worried that ai [sic] technology can be misused? A: Yes, of course. But this is a global problem and we want to tackle it with global solutions....
– “AI can do that”, The World in 2020, The Economist]
Surely something this sophisticated can predict grammaticality,
right? 90
Wrong
Model
GPT-2 small
GPT-2 XL
normalization
raw
normalized
raw
normalized
Score
log
exp
log
exp
log
exp
log
exp
PB
0.15
0.012
0.224
0.159
0.184
0.012
0.25
0.164
Adj. Becker PB
0.161
0.014
0.244
0.174
0.201
0.013
0.272
0.18
91
• Should conclusions about grammaticality be based upon scientific experimentation or self-congratulatory PR stunts?
• People are very good at attributing interpretations to natural phenomena that defy interpretation.
Control for lexical choice, context, case, punctuation...
GPT-2 xl
averaged
averaged
raw
raw
averaged
averaged
log
log
exp
log
exp
0.694
0.500
0.652
0.599
0.591
0.575
0.558
0.575
0.566
max
1.00
1.00
1.00
1.00
min
-1.00
-1.00
-1.00
-1.00
upper breakpoint
0.99999996
0.991
0.997
0.986
0.998
0.9987
lower breakpoint
0.053
exp
0.660
1.00
-1.00
0.152
0.286
0.264
0.686
1.00
-1.00
0.241
0.294
92
• Breaking down CoLA data by group in publication helps: that partially controls for lexical choice and context.
• But the correlations are substantially lower if we confine ourselves to groupings of size greater than 4: 67% are of size 2-4, with 29% of size 2, on which there is a 50% chance of guessing correctly.
• Correlations are also lower if we downcase the text and strip away punctuation in every model except Pereira’s, which was trained on such text. Should a grammaticality test care about this?
Model
GPT-2
normalization
raw
score
log
median
0.626
stdev
0.618
1.00
-1.00
0.949
0.142
raw
exp
0.490
0.591
1.00
-1.00
0.99989
0.200
0.712
Legitimate Points of Concern
93
• Is grammaticality really a discrete variable?
• Several have argued that a presumed correlation between neural language models and grammaticality suggests that grammaticality should be viewed as gradient (Lau et al., 2017; Sprouse et al., 2018).
• Eliciting grammaticality ≠ blindly probing the elephant.
• Numerous papers on individual features of grammaticality (Linzen et al., 2016; Bernardy & Lappin, 2017; Gulordava et al., 2018).
• How do you sample grammaticality judgements?
• Acceptability judgements (Sprouse & Almeida 2012; Sprouse et al., 2013) are not quite the same thing – experimental subjects can easily be misled by interpretability.
• Round-trip machine translation of grammatical sentences for generating ungrammatical strings (Lau et al., 2014;2015).
Legitimate Points of Concern
94
• Is grammaticality really a discrete variable?
• Several have argued that a presumed correlation between neural language models and grammaticality suggests that grammaticality should be viewed as gradient (Lau et al., 2017; Sprouse et al., 2018).
• Eliciting grammaticality ≠ blindly probing the elephant.
• Numerous papers on individual features of grammaticality (Linzen
et al., 2016; Bernardy & Lappin, 2017; Gulordava et al., 2018). • How do you sample grammaticality judgements?
• Acceptability judgements (Sprouse & Almeida 2012; Sprouse et al., 2013) are not quite the same thing – experimental subjects can easily be misled by interpretability.
• Round-trip machine translation of grammatical sentences for generating ungrammatical strings (Lau et al., 2014;2015).
Grammaticality vs. Interpretability
95
We sampled the BNC by:
1) Using the 27 grammatical part-of-speech tag sequences from the CGISF permutations,
2) Using ClausIE (Del Corro and Gemulla, 2013) to ensure that our sequences exactly matched 5-word clauses.
This resulted in 36 5-word clauses from the BNC that are both grammatical and interpretable.
Corrected Point-biserial Correlations with BNC/CGISF provenance
Pereira
Training
LOG
BNC
0.83
BNC
0.79
WT-103
0.80
GPT
Model
colorless
colourless
Mikolov
QRNN
GPT-2
EXP
LOG
LOG
LOG
LOG
0.13
0.83
0.71
0.13
0.81
0.11
0.11
0.90
QRNN
EXP
LOG
0.17
0.79
0.17
0.38
0.30
0.59
(regular)
EXP
0.14
0.15
0.27
-0.39
(TAS)
EXP
0.59
0.36
0.19
0.811
GPT-2 XL
EXP
0.043
0.806
EXP
0.052
96
Is Grammaticality Testing Beyond our Grasp?
97
There are more successful methods (Warstadt et al., 2018; Liu et al., 2019; Lan et al., 2019; Raffel et al., 2019), but:
1) They are supervised (unlike most language models),
2) They are trained on CoLA (which has been split into training,
development and test sets
3) Many use neural networks
4) But they clamp on a classifier that takes softmax outputs as inputs (and so there is more there than a language model).
The results:
• Accuracy: between 65-71% (not everyone reports this)
• MCC: 0.253 - 0.7
• But uniformly guessing “grammatical” on CoLA has 71% accuracy, also.
Future Work
98
• Grammaticality prediction remains a practical challenge.
• It is important, even for practical considerations, e.g.
grammar checking.
• It is hard to imagine that the next breakthrough in accuracies/correlations of grammaticality prediction would not use statistical modelling of some kind.
• But it also seems unlikely that that breakthrough would come solely from a statistical language model.
• Language models have developed the way that they have because making sense seems to be more important than being grammatical.