Introduction to Weak Supervision
Chris Ré
CS229
Messages for Today
Introduce two key concepts:
• Method of moments for latent probabilistic variable models • They have provable global solution (Compare with EM methods)
• Widely used in “tensor methods”
• Probability distributions on graphs (graphical models)
• Fun facts about Gaussians that are good for your soul (Inverse covariance
• High-level overview of new area called weak supervision.
• Why supervision is so critical in this age and resources (nascent)
• Very recent work & biased by our own group’s work—but you have likely used it today!
matrix structure and graphs)
Various techniques for limited labeled data
• Active learning: Select points to label more intelligently
• Semi-supervised learning: Use unlabeled data as well
• Transfer learning: Transfer from one training dataset to a new task
This lecture.
• Weak supervision: Label data in cheaper, higher-level ways
https://www.snorkel.org/blog/weak-supervision
Related Work in Weak Supervision
• Crowdsourcing: Dawid & Skene 1979, Karger et. al. 2011, Dalvi et. al. 2013, Ruvolo et. al. 2013, Zhang et. al. 2014, Berend & Kontorovich 2014, etc.
• Distant Supervision: Mintz et. al. 2009, Alfonesca et. al. 2012, Takamatsu et. al. 2012, Roth & Klakow 2013, Augenstein et. al. 2015, etc.
• Co-Training: Blum & Mitchell 1998
• Noisy Learning: Bootkrajang et. al. 2012, Mnih & Hinton 2012, Xiao et. al. 2015, etc.
• Indirect Supervision: Clarke et. al. 2010, Guu et. Al. et. al. 2017, etc.
• Feature and Class-distribution Supervision: Zaidan & Eisner 2008, Druck et. al. 2009, Liang et. al. 2009, Mann & McCallum 2010, etc.
• Boosting & Ensembling: Schapire & Freund, Platanios et. al. 2016, etc.
• Constraint-Based Supervision: Bilenko et. al. 2004, Koestinger et. al. 2012, Stewart & Ermon 2017, etc.
Propensity SVMs: Joachims 17
More Related work
• So much more! Work was inspired by classics and new Cotraining , GANs, capsule networks, semi-supervised learning, crowd-sourcing and so much more!
• Please see blog for summary. https://www.snorkel.org/blog/weak- supervision
… Biased by on-going work…
ML Application =
Model + Data + Hardware
State-of-the-art models and hardware are available. Training data is not
But supervision comes from god herself….
… but training data usually comes from a dirty, messy process.
Can we provide mathematical and systems structure for this messy process?
Supervision is where the action is…
Model differences overrated, and supervision differences underrated.
J. Dunnnmon, D. Yi, C. Langlotz, C. Re, D. Rubin, M. Lungren. “Assessing Convolutional Neural Networks for Automated Radiograph Triage.” Radiology, 2019.
Model
BOVW + KSVM
ResNet-18
DenseNet-121
Test
Accuracy
AlexNet
0.88
0.87
0.89
0.91
We spent a year on this challenge • Created large dataset of clinical labels • Evaluated effect of label quality
• Work published in a clinical journal
Often: Differences in models ~ 2-3 points.
Label quality & quantity > model choice.
11
Data augmentation by specifying invariances
Images
Text
Medical
• Rotations
• Scaling / Zooms
• Brightness
• Color Shifts
• Etc…
• Synonymy
Domain-specific transformations. Ex: 1. Segment tumor mass
2. Move
3. Resample background tissue
4. Blend
• •
Positional Swaps Etc…
How do we choose which to apply? In what order?
Simple Benchmarks:
Data Augmentation is Critical
Ex: 13.4 pt. avg. accuracy gain from data augmentation across top ten CIFAR-100 models— difference in top-10 models is less!
13
Training Signal is key to pushing SotA
New methods for gathering signal leading the state of the art, lots of exciting ML progress here (SotA due to noisy teacher!)
• Google AutoAugment: Using learned data augmentation policies
• Augmentation Policies first in Ratner et al. NIPS ’17
• Facebook Hash tag weakly supervised pre-training • Pre-train using a massive dataset with hashtags
Henry Alex Ratner Ehrenberg (to: Washington)
Sharon Y. Li (to: Wisconsin)
Check out Sharon’s series on hazyresearch.Stanford.edu
http://ai.stanford.edu/blog/data-augmentation/
Sharon Y. Li (to: Wisconsin)
Training data: the new bottleneck
Slow, expensive, and static
Manual Labels
Slow
Expensive
Static
{Positive, Negative}
{Positive, Neutral, Negative}
Dynamic
Time
Fast
$10 – $100/hr
Cheap
Programmatic Labels
write run programs programs
Time
Trade-off: programmatic labels are noisy…
17
$0.10/hr
Labels Labels
Snorkel: Formalizing Programmatic Labeling
Pattern Matching
Distant Supervision
Subset A Subset B Subset C
[e.g. Mintz 2009] Third-Party Models
[e.g. Schapire 1998]
Augmentation
“Change abbreviate names, and replace…”
regex.match(
r“{A} is caused by {B}” )
[e.g. Hearst 1992, Snow 2004] Topic Models
[e.g. Hingmire 2014]
Crowdsourcing
[e.g. Dalvi 2013]
Observation: Weak supervision applied in ad hoc and isolated ways.
18
Snorkel: Formalizing Programmatic Labeling
UNLABELED DATA
PROBABILISTIC LABELS
WEAK SUPERVISION SOURCES
“If A is mentioned in the same…
Subset A Subset B Subset C
regex.match(
r“{A} is caused by {B}”
)
Goal: Replace ad hoc weak supervision with a formal, unified,
theoretically grounded approach for programmatic labeling
19
The Real Work
Stephen Braden Henry Alex Paroma Bach Hancock Ehrenberg Ratner Varma
Snorkel.org
Running Example: NER
PER:DOCTOR
Dr. Bob Jones is a specialist in cardiomyopathy treatment, leading the cardiology division at Saint Francis.
ORG:HOSPITAL
Goal: Label training data using weak supervision strategies for these tasks
Let’s look at labeling “Person” versus “Hospital”
Weak Supervision as Labeling Functions
def existing_classifier(x):
return off_shelf_classifier(x)
“PERSON”
“PERSON”
“HOSPITAL”
Dr. Bob Jones is a specialist in cardiomyopathy treatment, leading the cardiology division at Saint Francis.
def upper_case_existing_classifier(x):
if all(map(is_upper, x.split())) and \
off_shelf_classifier(x) == ‘PERSON’:
return PERSON
def is_in_hospital_name_DB(x):
if x in HOSPITAL_NAMES_DB:
return HOSPITAL
Problem: These noisy sources conflict and are correlated
The Snorkel Pipeline
def LF_short_report(x): if len(X.words) < 15:
return “NORMAL”
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return “NORMAL”
def LF_pneumo(x):
if re.search(r’pneumo.*’, X.text):
return “ABNORMAL”
def LF_ontology(x):
if DISEASES & X.words:
return “ABNORMAL”
LABELING FUNCTIONS
Users write labeling functions to generate noisy labels
𝑌! 𝑌"
𝑌 𝑌#
𝑌$
LABEL MODEL
PROBABILISTIC
TRAINING DATA
END MODEL
3 The resulting probabilistic labels
train a model
1
2
Snorkel models and combines the noisy labels into probabilities
KEY IDEA: Probabilistic training point carries accuracy. No hand labeled data needed.
23
People use it...
Http://snorkel.org
“Snorkel DryBell” collaboration with Google Ads. Bach et al. SIGMOD19.
Used in production in many industries, startups, and other tech companies!
Collaboration Highlight: Google + Snorkel
𝑋
𝑓! ( 𝑥 )
Custom Taggers & Classifiers
Web Crawlers
• Snorkel DryBell is a production version of Snorkel focused on:
• Using organizational knowledge resources to train ML models
• Handling web-scale data • Non-servable to servable
feature transfer.
𝑌&
𝜆!
𝜆" 𝑌 Pattern(p_2) 𝜆#
Semantic Categorization
Knowledge Graph
Aggregate Statistics
Exec. Infra
Pattern(p_1) and
Heuristics & Rules
Snorkel DryBell
Organizational Resources
[Bach et. al., SIGMOD 2019]
Thank you, Google! Even best funded teams...
25
Maybe you have used it?
It has changed use real systems...
https://arxiv.org/abs/1909.05372
CIDR2020
A couple of highlights
• Used by multiple teams with good error reduction over production.
• Take away: many systems are almost entirely weak supervision based.
Weak Supervision in Science & Medicine
Cross-Modal Weak Supervision
J. Dunnmon et al., “Cross-Modal Data Programming Enables Rapid Medical Machine Learning,” 2020.
Blog: http://hazyresearch.stanford.edu/ws4science
Text & Extraction
A. Callahan et al., NPJ Dig Med, 2020
V. Kuleshov et al., Nat Comms, 2019
Imaging & Diagnostics
J. Fries et al., Nat Comms, 2019 J. Dunnmon et al., Radiology, 2019 K. Saab et al., NPJ Dig Med, 2020
High-Level Related Work
Alex Ratner Fred Sala (to Washington) (to Wisconsin)
Let’s look under the hood and take a peak at some math (to the whiteboard soon..)
The Snorkel Pipeline
def lf_1(x):
return per_heuristic(x)
𝑌"
𝜆"
𝜆$ 𝑌 𝜆#
LABEL MODEL
PROBABILISTIC
TRAINING DATA
def lf_2(x):
return doctor_pattern(x)
def lf_3(x):
return hosp_classifier(x)
𝑌$
𝑌#
MULTI-TASK LABELING FUNCTIONS
TASK GRAPH
MULTI-TASK MODEL
We use the probabilistic labels to train a multi-task model
123 Users write labeling
functions for multiple related tasks
We model the labeling functions’ behavior to de-noise them
How can we do anything without the ground truth labels?
Model as Generative Process
def existing_classifier(x):
return off_shelf_classifier(x)
“PERSON”
“PERSON”
“HOSPITAL”
Later: We will define what this picture means precisely.
𝜆!
𝑌 𝜆"
𝜆#
def upper_case_existing_classifier(x):
if all(map(is_upper, x.split())) and \
off_shelf_classifier(x) == ‘PERSON’:
return PERSON
def is_in_hospital_name_DB(x):
if x in HOSPITAL_NAMES_DB:
return HOSPITAL
How to learn the parameters of this model (accuracies & correlations) without 𝒀?
Intuition: Learn from the Overlaps
x1 x2
Sources.
“PERSON”
“PERSON”
“HOSPITAL”
“PERSON”
“HOSPITAL”
“HOSPITAL”
...
def existing_classifier(x):
return off_shelf_classifier(x)
def upper_case_existing_classifier(x):
if all(map(is_upper, x.split())) and \
off_shelf_classifier(x) == ‘PERSON’:
return PERSON
def is_in_hospital_name_DB(x):
if x in HOSPITAL_NAMES_DB:
return HOSPITAL
Key idea: We observe agreements (+1) and disagreements (-1) on many points! (More later!)