Named_Entity_Extraction_Tutorial
Named Entity Extraction Tutorial¶
This tutorial is a slight modification of the tutorial by Sam Galen.
In [1]:
from __future__ import print_function
from sklearn.metrics import confusion_matrix
import io
import nltk
import scipy
import codecs
import sklearn
import pycrfsuite
import pandas as pd
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report
print(‘sklearn version:’, sklearn.__version__)
print(‘Libraries succesfully loaded!’)
sklearn version: 0.19.2
Libraries succesfully loaded!
In [22]:
def sent2features(sent, feature_func):
return [feature_func(sent, i) for i in range(len(sent))]
def sent2labels(sent):
#print(‘sent’, sent)
return [s[-1] for s in sent]
def sent2tokens(sent):
return [s[0] for s in sent]
def bio_classification_report(y_true, y_pred):
“””
Classification report for a list of BIO-encoded sequences.
It computes token-level metrics and discards “O” labels.
Note that it requires scikit-learn 0.15+ (or a version from github master)
to calculate averages properly!
“””
lb = LabelBinarizer()
y_true_combined = lb.fit_transform(y_true)
y_pred_combined = lb.transform(y_pred)
tagset = set(lb.classes_) – {‘O’}
tagset = sorted(tagset, key=lambda tag: tag.split(‘-‘, 1)[::-1])
class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
return classification_report(
y_true_combined,
y_pred_combined,
labels = [class_indices[cls] for cls in tagset],
target_names = tagset,
)
def generate_kaggle_res_file(ids, labels, file_path):
“””
Generate result file for submitting to Kaggle.
ids – the id for the tokens in test file
should be in the same order as test file
labels – the predictted label for each token
file_path – the path includes the filename where
you want to save the result
“””
with open(file_path, ‘w’) as res_file:
res_file.write(‘id,label\n’)
for i,l in zip(ids, labels):
res_file.write(‘{},{}\n’.format(i,l))
def word2simple_features(sent, i):
”’
This makes a simple baseline.
You can add and/or remove features to get (much?) better results.
Experiment with it as you will need to do this for assignment.
”’
word = sent[i][0]
features = {
‘bias’: 1.0,
‘word.lower()’: word.lower(),
‘word[-2:]’: word[-2:],
}
if i == 0:
features[‘BOS’] = True
if i == len(sent)-1:
features[‘EOS’] = True
return features
# load data and preprocess
def extract_data(path):
“””
Extracting data from train file or test file.
path – the path of the file to extract
return:
res – a list of sentences, each sentence is a
a list of tuples. For train file, each tuple
contains token and label. For test file, each
tuple only contains token.
ids – a list of ids for the corresponding token. This
is mainly for Kaggle submission.
“””
#with open(path) as file:
file = io.open(path, mode=”r”, encoding=”utf-8″)
next(file)
res = []
ids = []
sent = []
for line in file:
if line != ‘\n’:
parts = line.strip().split(‘ ‘)
sent.append(tuple(parts[1:]))
ids.append(parts[0])
else:
res.append(sent)
sent = []
return res, ids
Build a NER classifier¶
Load data and extract features¶
In [23]:
# Load train and test data
train_data, train_ids = extract_data(‘train’)
test_data, test_ids = extract_data(‘test’)
# Load true labels for test data
test_labels = list(pd.read_csv(‘test_ground_truth’).loc[:, ‘label’])
print(‘Train and Test data upload succesfully!’)
# Feature extraction using the word2simple_features function
train_features = [sent2features(s, feature_func=word2better_features_tag) for s in train_data]
train_labels = [sent2labels(s) for s in train_data]
test_features = [sent2features(s, feature_func=word2better_features_tag) for s in test_data]
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(train_features, train_labels):
trainer.append(xseq, yseq)
print(‘Feature Extraction done!’)
# Explore the extracted features
sent2features(train_data[0], word2simple_features)
Train and Test data upload succesfully!
Feature Extraction done!
Out[23]:
[{‘BOS’: True,
‘bias’: 1.0,
‘word.lower()’: u’tambi\xe9n’,
‘word[-2:]’: u’\xe9n’},
{‘bias’: 1.0, ‘word.lower()’: u’el’, ‘word[-2:]’: u’el’},
{‘bias’: 1.0, ‘word.lower()’: u’secretario’, ‘word[-2:]’: u’io’},
{‘bias’: 1.0, ‘word.lower()’: u’general’, ‘word[-2:]’: u’al’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’la’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’asociaci\xf3n’, ‘word[-2:]’: u’\xf3n’},
{‘bias’: 1.0, ‘word.lower()’: u’espa\xf1ola’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’operadores’, ‘word[-2:]’: u’es’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’productos’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’petrol\xedferos’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’,’, ‘word[-2:]’: u’,’},
{‘bias’: 1.0, ‘word.lower()’: u’aurelio’, ‘word[-2:]’: u’io’},
{‘bias’: 1.0, ‘word.lower()’: u’ayala’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’,’, ‘word[-2:]’: u’,’},
{‘bias’: 1.0, ‘word.lower()’: u’ha’, ‘word[-2:]’: u’ha’},
{‘bias’: 1.0, ‘word.lower()’: u’negado’, ‘word[-2:]’: u’do’},
{‘bias’: 1.0, ‘word.lower()’: u’la’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’existencia’, ‘word[-2:]’: u’ia’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’cualquier’, ‘word[-2:]’: u’er’},
{‘bias’: 1.0, ‘word.lower()’: u’tipo’, ‘word[-2:]’: u’po’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’acuerdos’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’sobre’, ‘word[-2:]’: u’re’},
{‘bias’: 1.0, ‘word.lower()’: u’los’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’precios’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’,’, ‘word[-2:]’: u’,’},
{‘bias’: 1.0, ‘word.lower()’: u’afirmando’, ‘word[-2:]’: u’do’},
{‘bias’: 1.0, ‘word.lower()’: u’que’, ‘word[-2:]’: u’ue’},
{‘bias’: 1.0, ‘word.lower()’: u’\xfanicamente’, ‘word[-2:]’: u’te’},
{‘bias’: 1.0, ‘word.lower()’: u’es’, ‘word[-2:]’: u’es’},
{‘bias’: 1.0, ‘word.lower()’: u’la’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’cotizaci\xf3n’, ‘word[-2:]’: u’\xf3n’},
{‘bias’: 1.0, ‘word.lower()’: u’internacional’, ‘word[-2:]’: u’al’},
{‘bias’: 1.0, ‘word.lower()’: u’la’, ‘word[-2:]’: u’la’},
{‘bias’: 1.0, ‘word.lower()’: u’que’, ‘word[-2:]’: u’ue’},
{‘bias’: 1.0, ‘word.lower()’: u’pone’, ‘word[-2:]’: u’ne’},
{‘bias’: 1.0, ‘word.lower()’: u’de’, ‘word[-2:]’: u’de’},
{‘bias’: 1.0, ‘word.lower()’: u’acuerdo’, ‘word[-2:]’: u’do’},
{‘bias’: 1.0, ‘word.lower()’: u’a’, ‘word[-2:]’: u’a’},
{‘bias’: 1.0, ‘word.lower()’: u’todos’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’los’, ‘word[-2:]’: u’os’},
{‘bias’: 1.0, ‘word.lower()’: u’pa\xedses’, ‘word[-2:]’: u’es’},
{‘EOS’: True, ‘bias’: 1.0, ‘word.lower()’: u’.’, ‘word[-2:]’: u’.’}]
Explore the classifier parameters¶
In [27]:
trainer.params()
Out[27]:
[‘feature.minfreq’,
‘feature.possible_states’,
‘feature.possible_transitions’,
‘c1’,
‘c2’,
‘max_iterations’,
‘num_memories’,
‘epsilon’,
‘period’,
‘delta’,
‘linesearch’,
‘max_linesearch’]
Set the classifier parameters¶
In [25]:
trainer.set_params({
‘c1’: 100.0, # coefficient for L1 penalty
‘c2’: 1e-3, # coefficient for L2 penalty
‘max_iterations’: 50, # stop earlier
# include transitions that are possible, but not observed
‘feature.possible_transitions’: True
})
Train a NER model¶
In [26]:
%%time
trainer.train(‘ner-esp.model’)
print(‘Training done :)’)
Training done 🙂
CPU times: user 17 s, sys: 12 ms, total: 17 s
Wall time: 17 s
Make predictions with your NER model¶
Make predictions and evaluate your model on the test set.
To use your NER model, create pycrfsuite.Tagger, open the model, and use the “tag” method, as follows:
In [15]:
# Make predictions
tagger = pycrfsuite.Tagger()
tagger.open(‘ner-esp.model’)
test_pred = [tagger.tag(xseq) for xseq in test_features]
test_pred = [s for w in test_pred for s in w]
# Generate Kaggle file
generate_kaggle_res_file(test_ids, test_pred, ‘result.csv’)
## Print evaluation
print(bio_classification_report(test_pred, test_labels))
precision recall f1-score support
B-LOC 0.02 0.97 0.04 38
B-MISC 0.00 0.12 0.00 8
I-MISC 0.00 0.00 0.00 1952
B-ORG 0.12 0.97 0.21 388
I-ORG 0.00 0.03 0.00 31
B-PER 0.02 0.69 0.03 48
I-PER 0.03 0.79 0.06 67
avg / total 0.02 0.20 0.04 2532
In [8]:
print (len(trainer.logparser.iterations), trainer.logparser.iterations[-1])
50 {‘loss’: 100898.779131, ‘error_norm’: 3205.740655, ‘linesearch_trials’: 1, ‘active_features’: 224, ‘num’: 50, ‘time’: 0.157, ‘scores’: {}, ‘linesearch_step’: 1.0, ‘feature_norm’: 17.142624}
Check what the classifier has learned¶
In [9]:
from collections import Counter
info = tagger.info()
def print_transitions(trans_features):
for (label_from, label_to), weight in trans_features:
print(“%-6s -> %-7s %0.6f” % (label_from, label_to, weight))
print(“Top likely transitions:”)
print_transitions(Counter(info.transitions).most_common(15))
print(“\nTop unlikely transitions:”)
print_transitions(Counter(info.transitions).most_common()[-15:])
Top likely transitions:
B-PER -> I-PER 5.106086
I-ORG -> I-ORG 4.577166
B-MISC -> I-MISC 4.393290
I-MISC -> I-MISC 4.270381
I-LOC -> I-LOC 4.204211
B-ORG -> I-ORG 4.126288
B-LOC -> I-LOC 3.718146
I-PER -> I-PER 3.667023
O -> B-ORG 2.404751
O -> B-LOC 1.634268
O -> O 1.560973
O -> B-MISC 1.326638
O -> B-PER 1.316798
B-ORG -> O 0.586847
B-LOC -> O 0.363073
Top unlikely transitions:
O -> O 1.560973
O -> B-MISC 1.326638
O -> B-PER 1.316798
B-ORG -> O 0.586847
B-LOC -> O 0.363073
I-PER -> O 0.265646
I-ORG -> O 0.018350
I-ORG -> I-MISC -0.000135
B-MISC -> O -0.056457
I-LOC -> O -0.123269
I-MISC -> O -0.306588
O -> I-LOC -1.506816
O -> I-MISC -2.224771
O -> I-PER -2.417485
O -> I-ORG -2.586834
We can see that, for example, it is very likely that the beginning of a person name (B-PER) will be followed by a token inside person name (I-PER). Also note O -> B-LOC are penalized.
Check the state features¶
In [10]:
def print_state_features(state_features):
for (attr, label), weight in state_features:
print(“%0.6f %-6s %s” % (weight, label, attr))
print(“Top positive:”)
print_state_features(Counter(info.state_features).most_common(20))
print(“\nTop negative:”)
print_state_features(Counter(info.state_features).most_common()[-20:])
Top positive:
2.562545 O word.lower():el
2.328283 O bias
2.295047 O EOS
2.139253 O word.lower():en
2.137480 I-PER word[-2:]:ez
2.084413 B-ORG word.lower():efe
2.001973 B-ORG word.lower():gobierno
1.790164 B-LOC BOS
1.670007 O word.lower():con
1.605402 B-PER BOS
1.426642 O word.lower():para
1.372486 O word.lower():una
1.365916 O word.lower():,
1.365916 O word[-2:]:,
1.365916 O word[-3:]:,
1.352841 O word[-2:]:se
1.308297 O word[-2:]:de
1.269299 O word.lower():la
1.257194 B-ORG word[-2:]:FE
1.216900 B-ORG word[-3:]:EFE
Top negative:
-0.262594 I-PER bias
-0.280604 O word[-2:]:ga
-0.286225 O word[-3:]:uel
-0.292385 B-ORG word[-2:]:es
-0.298937 O word[-2:]:sé
-0.325356 B-PER word.lower():la
-0.328440 O word[-3:]:rra
-0.418856 O word[-3:]:opa
-0.418973 I-LOC bias
-0.436460 B-MISC bias
-0.445661 O word[-2:]:pa
-0.477377 O word[-3:]:ina
-0.498014 I-ORG bias
-0.572642 B-ORG word[-2:]:de
-0.594455 O word.lower():efe
-0.648540 O word[-3:]:ona
-0.655822 O word[-2:]:ez
-0.687738 B-ORG bias
-0.729016 O word[-2:]:ia
-0.896983 O word[-2:]:ña