example_code_quality_comment_1
FIT5196 Assessment 1¶
Student Name:¶
Student ID:¶
Date: 02/04/2017
Version: 2.0
Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)
Libraries used:
xml.etree.ElementTree (for parsing XML doc, included in Anaconda Python 3.6)
pandas 0.19.2 (for data frame, included in Anaconda Python 3.6)
re 2.2.1 (for regular expression, included in Anaconda Python 3.6)
nltk 3.2.2 (Natural Language Toolkit, included in Anaconda Python 3.6)
nltk.collocations (for finding bigrams, included in Anaconda Python 3.6)
nltk.tokenize (for tokenization, included in Anaconda Python 3.6)
nltk.corpus (for stop words, not included in Anaconda, nltk.download(‘stopwords’) provided)
1. Introduction¶
This assignment comprises the execution of different text processing and analysis tasks applied to patent documents in XML format. There are a total of 2500 patents in one 158 MB file named patents.xml. The required tasks are the following:
Extract the IPC code for each patent and store the list into a .txt file.
Extract all the citations for each patent and store the list into a .txt file.
Analyse the abstracts for each patent and generate a vocabulary count vector. Store the results into a .txt file.
More details for each task will be given in the following sections.
2. Import libraries¶
In [2]:
import xml.etree.ElementTree as ET
import pandas as pd
import re
import nltk
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.corpus import stopwords
3. Examining and loading data¶
As a first step, the file patents.xml will be loaded so its first 10 lines can be inspected.
In [3]:
# print first ten lines of the file
with open(‘/Users/land/git_hub_repo/FIT5196_S1_2017/Assessments/Assessment_1_Text_Preprocessing/patents.xml’,’r’) as infile:
print(‘\n’.join([infile.readline().strip() for i in range(0, 10)]))
We can see that the first XML document has an XML declaration and a root tag
A regex is defined so strings starting with an XML declaration are captured individually. The non-greedy pattern *? is necessary so the whole file is not matched. The regex also uses the pattern [\s\S] (white space or non white space characters) which causes to capture everything, even line breaks, between the XML declaration and the closing tag.
In [4]:
# read the whole file
with open(‘/Users/land/git_hub_repo/FIT5196_S1_2017/Assessments/Assessment_1_Text_Preprocessing/patents.xml’,’r’) as infile:
text = infile.read()
# matches everything between the XML declaration and the root closing tag
regex = r’<\?xml[\s\S]*?‘
patents = re.findall(regex, text)
print(len(patents))
2500
The result is a list of strings (patents) with length 2500. All patents haven been successfully extracted from the main file. Let’s inspect the last one.
In [5]:
lp_lines = patents[len(patents) – 1].split(‘\n’, 10) # get first 10 lines of last patent, discard the rest
print(‘\n’.join([lp_lines[i] for i in range(0, 10)]))
4. Parsing XML and IPC code extraction¶
The first task is to parse each patent XML and extract its IPC code. According to World International Property Organization (2016), the International Patent Classification (IPC) code has the following structure:
Section (one of the capital letters A through H)
Class (two digit number)
Subclass (a capital letter)
Main group (a one- to three-digit number)
Subgroup (a number of at least two digits or 00 if no subgroup considered)
A .txt file should be generated with a patent ID and its IPC code per line in the following format:
patent_id:section,class,subclass,main_group,subgroup
The first patent XML structure is explored to look for the root’s direct nodes. The first patent is considered to be our sample in this case. XML parsing is carried out as indicated in Python’s xml.etree.ElementTree official documentation (Python Software Foundation, 2017).
In [6]:
# pick first patent as sample and get the first level nodes
sample = patents[0]
root_sample = ET.fromstring(sample)
for node in root_sample:
print(node.tag)
us-bibliographic-data-grant
abstract
drawings
description
us-claim-statement
claims
In order to explore the tree structure of an XML node, a small recursive routine is defined to get the tag names of all descendants of any given node. Inspired by this post (Whitmore, 2013).
In [7]:
# prints the tree structure of any XML node (tag + text)
def print_tree(element, indent):
line = ”
for i in range(0, indent): # tree-like format
line += ‘ ‘
print(line, element.tag, ” if not element.text else element.text.strip())
for child in element:
print_tree(child, indent + 1)
By using the print_tree routine, the tags of the direct children of us-bibliographic-data-grant containing the text classification can be explored. The goal is to find a path to specific tags with information about the IPC code. This can be accomplished by using the find() function (Python Software Foundation, 2017) and only obtaining tags that contain the string classification.
In [8]:
# explore the ‘us-bibliographic-data-grant’ children for tags containing the word ‘classification’
for node in [t for t in root_sample.find(‘./us-bibliographic-data-grant’) if ‘classification’ in t.tag]:
print_tree(node, 0)
(”, ‘classifications-ipcr’, ”)
(‘ ‘, ‘classification-ipcr’, ”)
(‘ ‘, ‘ipc-version-indicator’, ”)
(‘ ‘, ‘date’, ‘20060101’)
(‘ ‘, ‘classification-level’, ‘A’)
(‘ ‘, ‘section’, ‘A’)
(‘ ‘, ‘class’, ’01’)
(‘ ‘, ‘subclass’, ‘H’)
(‘ ‘, ‘main-group’, ‘5’)
(‘ ‘, ‘subgroup’, ’00’)
(‘ ‘, ‘symbol-position’, ‘F’)
(‘ ‘, ‘classification-value’, ‘I’)
(‘ ‘, ‘action-date’, ”)
(‘ ‘, ‘date’, ‘20110222’)
(‘ ‘, ‘generating-office’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘classification-status’, ‘B’)
(‘ ‘, ‘classification-data-source’, ‘H’)
(”, ‘classification-national’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘main-classification’, ‘PLT161’)
(”, ‘us-field-of-classification-search’, ”)
(‘ ‘, ‘classification-national’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘main-classification’, ‘PLT161’)
The tags section, class, subclass, main-group and subgroup have been found in the previous tree. They are all inside the following XPath (Python Software Foundation, 2017): ‘./us-bibliographic-data-grant/classifications-ipcr/classification-ipcr/’.
The next step is try to find the patent ID. Let’s still consider the tag us-bibliographic-data-grant, but this time only the first two levels would be explored.
In [9]:
# explore first two levels of node ‘us-bibliographic-data-grant’
for node in root_sample.find(‘./us-bibliographic-data-grant’):
print(node.tag)
for child in node:
print(‘ ‘ + child.tag)
publication-reference
document-id
application-reference
document-id
us-application-series-code
us-term-of-grant
us-term-extension
classifications-ipcr
classification-ipcr
classification-national
country
main-classification
invention-title
us-botanic
latin-name
variety
references-cited
citation
citation
citation
number-of-claims
us-exemplary-claim
us-field-of-classification-search
classification-national
figures
number-of-drawing-sheets
number-of-figures
us-related-documents
related-publication
parties
applicants
agents
assignees
assignee
examiners
primary-examiner
Apparently both publication-reference and application-reference contain a tag named document-id. Let’s explore further these two tags.
In [10]:
print_tree(root_sample.find(‘./us-bibliographic-data-grant/publication-reference’), 0)
print_tree(root_sample.find(‘./us-bibliographic-data-grant/application-reference’), 0)
(”, ‘publication-reference’, ”)
(‘ ‘, ‘document-id’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘doc-number’, ‘PP021722’)
(‘ ‘, ‘kind’, ‘P3’)
(‘ ‘, ‘date’, ‘20110222’)
(”, ‘application-reference’, ”)
(‘ ‘, ‘document-id’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘doc-number’, ‘12316880’)
(‘ ‘, ‘date’, ‘20081216’)
The tag doc-number contains the patent identifier for both application and publication documents. Considering a patent identifier to be the publication document, the full XPath would be: ‘./us-bibliographic-data-grant/publication-reference/document-id/doc-number’.
Retrieved XPaths so far:
Patent ID: ‘root/us-bibliographic-data-grant/publication-reference/document-id/doc-number’
IPC Classification: ‘root/us-bibliographic-data-grant/classifications-ipcr/classification-ipcr/’
Now it’s time to extract this data for each patent. Since the XPaths are already retrieved, the data can easily be accessed with the find() function (Python Software Foundation, 2017).
In [11]:
# patent ID and IPC code extraction
lst_section = []
lst_class = []
lst_subclass = []
lst_maingroup = []
lst_subgroup = []
lst_patent_id = []
for patent in patents:
root = ET.fromstring(patent)
patent_id = root.find(‘./us-bibliographic-data-grant/publication-reference/document-id/doc-number’)
lst_patent_id.append(patent_id.text.strip())
cl = root.find(‘./us-bibliographic-data-grant/classifications-ipcr/classification-ipcr’)
lst_section.append(cl.find(‘./section’).text.strip())
lst_class.append(cl.find(‘./class’).text.strip())
lst_subclass.append(cl.find(‘./subclass’).text.strip())
lst_maingroup.append(cl.find(‘./main-group’).text.strip())
lst_subgroup.append(cl.find(‘./subgroup’).text.strip())
dic_ipc = {} # new data dictionary
dic_ipc[‘patentid’] = lst_patent_id
dic_ipc[‘section’] = lst_section
dic_ipc[‘class’] = lst_class
dic_ipc[‘subclass’] = lst_subclass
dic_ipc[‘maingroup’] = lst_maingroup
dic_ipc[‘subgroup’] = lst_subgroup
df_ipc = pd.DataFrame(dic_ipc) # data frame from dictionary
print(df_ipc.head())
print(df_ipc.shape)
class maingroup patentid section subclass subgroup
0 01 5 PP021722 A H 00
1 01 7 RE042159 G B 14
2 06 11 RE042170 G F 00
3 41 13 07891018 A D 00
4 41 13 07891019 A D 00
(2500, 6)
According to the export format, the patent ID should be followed by the section by a colon ‘:’. For this purpose, a new column is generated and the final data frame is exported into a .txt file by using the to_csv() function (The pandas Project, 2016a).
In [12]:
# add extra column and export
df_ipc[‘patidsection’] = df_ipc[‘patentid’] + ‘:’ + df_ipc[‘section’]
df_ipc.sort_values(by = (‘patentid’), ascending = True, inplace = True) # sort by patent ID
df_ipc.reset_index(inplace = True, drop = True) # reset index
df_ipc.to_csv(‘./classification.txt’, columns = [‘patidsection’, ‘class’, ‘subclass’, ‘maingroup’, ‘subgroup’],
header = False, index = False) # export to .txt file
# verify (print first 10 lines)
with open(‘classification.txt’,’r’) as infile:
print(‘\n’.join([infile.readline().strip() for i in range(0, 10)]))
07891018:A,41,D,13,00
07891019:A,41,D,13,00
07891020:A,41,D,13,00
07891021:A,62,B,17,00
07891023:A,41,F,19,00
07891025:A,61,F,9,02
07891026:A,41,D,13,00
07891027:E,03,D,9,00
07891029:A,61,G,9,00
07891030:A,47,K,11,06
5. Extracting the citation network¶
This task requires to extract all patent citations for every document. The final result should be exported into a .txt file in the following format:
citing_patent_id:cited_patent_id,cited_patent_id,…
From explorations in the section 4, a citation related tag was discovered inside ./us-bibliographic-data-grant with the name references-cited.
In [13]:
print_tree(root_sample.find(‘./us-bibliographic-data-grant/references-cited’), 0)
(”, ‘references-cited’, ”)
(‘ ‘, ‘citation’, ”)
(‘ ‘, ‘patcit’, ”)
(‘ ‘, ‘document-id’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘doc-number’, ‘PP17672’)
(‘ ‘, ‘kind’, ‘P3’)
(‘ ‘, ‘name’, ‘Hofmann’)
(‘ ‘, ‘date’, ‘20070500’)
(‘ ‘, ‘category’, ‘cited by examiner’)
(‘ ‘, ‘classification-national’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘main-classification’, ‘PLT161’)
(‘ ‘, ‘citation’, ”)
(‘ ‘, ‘patcit’, ”)
(‘ ‘, ‘document-id’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘doc-number’, ‘PP18482’)
(‘ ‘, ‘kind’, ‘P3’)
(‘ ‘, ‘name’, ‘Ligonniere’)
(‘ ‘, ‘date’, ‘20080200’)
(‘ ‘, ‘category’, ‘cited by examiner’)
(‘ ‘, ‘classification-national’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘main-classification’, ‘PLT161’)
(‘ ‘, ‘citation’, ”)
(‘ ‘, ‘patcit’, ”)
(‘ ‘, ‘document-id’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘doc-number’, ‘PP18483’)
(‘ ‘, ‘kind’, ‘P3’)
(‘ ‘, ‘name’, ‘Ligonniere’)
(‘ ‘, ‘date’, ‘20080200’)
(‘ ‘, ‘category’, ‘cited by examiner’)
(‘ ‘, ‘classification-national’, ”)
(‘ ‘, ‘country’, ‘US’)
(‘ ‘, ‘main-classification’, ‘PLT161’)
Every citation tag includes a patcit tag that encapsulates a patent identifier doc-number. The required task would be to iterate through every citation element inside references-cited for each patent. To achieve this, the findall() function (Python Software Foundation, 2017) is applied to the XPath of references-cited in order to obtain all nodes under it.
In [14]:
# get all citations for each patent
lst_patent_id = []
lst_cited_patent_id = []
for patent in patents:
root = ET.fromstring(patent)
patent_id = root.find(‘./us-bibliographic-data-grant/publication-reference/document-id/doc-number’)
citations = root.findall(‘./us-bibliographic-data-grant/references-cited/*’) # get all citations
for ct in citations: # ct is an element with tag ‘citation’
lst_patent_id.append(patent_id.text)
lst_cited_patent_id.append(ct.find(‘./patcit/document-id/doc-number’).text) # extract patent ID
dic_citations = {}
dic_citations[‘patentid’] = lst_patent_id
dic_citations[‘citedpatentid’] = lst_cited_patent_id
df_citations = pd.DataFrame(dic_citations)
print(df_citations.head())
print(df_citations.shape)
citedpatentid patentid
0 PP17672 PP021722
1 PP18482 PP021722
2 PP18483 PP021722
3 4954776 RE042159
4 4956606 RE042159
(47041, 2)
In order to export the df_citations data frame into the desired format, it’s necessary to group the cited patents by their corresponding citing patents. To achieve this, the df_citations is grouped by patentid with the DataFrame.groupby() function (The pandas Project, 2016b), the resulting groups are merged and delimited by a comma, as explained by Biek (2008), and then the concatenation of patent with its group of cited documents exported into a .txt file.
In [15]:
# group by patent ID and print citing patent with group of cited patents delimited by ‘,’
c_writer = open(‘citations_test.txt’, ‘w’)
df_citations.sort_values(by = [‘patentid’, ‘citedpatentid’], ascending = True, inplace = True) # sort by patent ID
df_citations.reset_index(inplace = True, drop = True) # reset index
for name, group in df_citations.groupby([‘patentid’]):
c_writer.write(name + ‘:’ + ‘,’.join(group.citedpatentid) + ‘\n’)
c_writer.close()
# verify
with open(‘citations.txt’,’r’) as infile:
print(‘\n’.join([infile.readline().strip() for i in range(0, 5)]))
07891018:4561124,4831666,4920577,5105473,5134726,5611081,5729832,5845333,6115838,6332224,6805957,7089598,D338281
07891019:4355632,4702235,5032705,5148002,5603648,6439942,6757916,6910229
07891020:101 55 935,103 11 185,103 50 869,103 57 193,197 49 862,2003/0214408,2004/0009729,203 08 642,4599609,4734072,4843014,5061636,5493730,5635909,6080690,6267232,6388422,6767509,WO 00/62633,WO 2004/073798
07891021:4507808,4627112,4864655,5010591,5031242,5165110,5410759
07891023:1335927,1398962,1446948,1839143,1852030,1983636,2002/0112275,2003/0110550,2006/0185056,2006/0289585,2009/0070915,2133505,2411724,2682669,3167786,3401857,4923105,5214806,5319806,5413262,5488738,5497923,5611079,5623735,6021528,6088831,6216931,6766532,6804834,6959455,7318542,7596813,770761,D581633
6. Abstract extraction and processing¶
This task consists of extracting the abstract of every patent and calculating its corresponding sparse count vector. This vector is a collection of word, frequency pairs that count the number of occurrences of every word in the abstract. Not every word is counted, words from all abstracts are filtered/transformed and stored into a vocabulary according to the following rules:
Meaningful word pairs are merged into bigrams (e.g. ‘data’ and ‘wrangling’ would be transformed to the bigram ‘data_wrangling’). At least 100 bigrams considered.
No stop words are included in the final vocabulary.
The top-20 most frequent words are also not included.
Words appearing in only one abstract are not included.
Final sparse count vectors are stored into a .txt file with the following format for every patent:
patent_id,word_id:freq,word_id:freq,…
The vocabulary file is also a .txt file with each word having the following format:
word_id:word
The root element has an abstract tag. In order to obtain the full XPath, let’s explore that tag.
In [16]:
print_tree(root_sample.find(‘./abstract’), 0)
(”, ‘abstract’, ”)
(‘ ‘, ‘p’, u’A new apple tree named \u2018Daligris\u2019 is disclosed. The fruit of the new variety is particularly notable for its eating quality and distinctive flavor and appearance. The fruit is very sweet and has a pronounced aniseed flavor, and takes on a distinctive red orange coloration as it ripens on the tree.’)
The full XPath for the abstract is: ‘root/abstract/p’
Now, as a first step, let’s tokenize every abstract for each patent and store each token into a data frame with their correspondent patent ID. For the tokenization, only words containing 2 or more letters (a to z lowercase) are considered (regex [a-z]{2,}). Also, characters should only appear consecutively a maximum of two times. According to Blank (2011) in this post, matching a character not followed by itself one time is achieved with the negative lookahead regex (?!\1) where \1 is the group including the repeating character. To simply add another occurrence, the regex can be expanded to (?!\1{2}). The final regex can be put together as (?:([a-z])(?!\1{2})){2,}. However, the NLTK class RegexpTokenizer clearly states in (NLTK Project, 2017) that the regex pattern must not contain capturing parentheses, so non-capturing ones (?:…) must be used instead. This causes the NLTK engine to return partial matches. To solve this, the finditer() function from the re package is used instead to retrieve the group component of each match.
In [17]:
regex = r”(?:([a-z])(?!\1{2})){2,}” # regex pattern
lst_token = []
lst_patent_id = []
for patent in patents:
root = ET.fromstring(patent)
patent_id = root.find(‘./us-bibliographic-data-grant/publication-reference/document-id/doc-number’)
abstract = root.find(‘./abstract/p’).text.strip().lower()
matches = re.finditer(regex, abstract) # dismiss the NLTK tokenizer, use re.finditer() directly
tokens = [m.group() for i,m in enumerate(matches)] # retrieve the groups as matches
lst_token += tokens
lst_patent_id += ([patent_id.text] * len(tokens))
token_dic = {}
token_dic[‘token’] = lst_token
token_dic[‘patentid’] = lst_patent_id
df_tokens = pd.DataFrame(token_dic)
print(df_tokens.head())
print(df_tokens.shape)
patentid token
0 PP021722 new
1 PP021722 apple
2 PP021722 tree
3 PP021722 named
4 PP021722 daligris
(262397, 2)
Next, a list of bigrams will be calculated from the last token list df_tokens. The function BigramCollocationFinder.from_words() (NLTK Project, 2015) takes that list as input to generate pairs of words that occur together. Since only meaningful bigrams are required, such pairs are filtered so no two stop words may occur together or be the same. This is achieved with the function apply_ngram_filter() (NLTK Project, 2015) which uses a lambda expression to match bigram components w1 and w2 to a desired pattern. Additionally, bigrams can be filtered by their frequency in the corpus and also by an association measure such as PMI. PMI measures the likelihood of two words occurring together given individual distributions in the corpus (Bouma, 2009). Filtering by high frequency of occurrence can produce bigrams with good semantics but it can also increase the generation of pairs with non captured stop words. Lowering the frequency removes these obscure stop word pairs and by also using the PMI filter, new meaningful pairs with high scores can climb to the top results. In this case, bigrams with less than 7 occurrences are filtered while applying PMI at the same time.
In [18]:
nltk.download(‘stopwords’)
stopwords_list = stopwords.words(‘english’) # list of stop words from the NLTK package
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(list(df_tokens.token))
finder.apply_freq_filter(7) # bigrams with less than 7 occurrences are ignored
# no bigrams with both words the same or both as stop words
finder.apply_ngram_filter(lambda w1, w2: w1 == w2 or (w1 in stopwords_list and w2 in stopwords_list))
bigrams = finder.nbest(bigram_measures.pmi, 100) # PMI filter (best 100)
print(len(bigrams))
print(bigrams)
[nltk_data] Downloading package stopwords to /Users/land/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
100
[(u’raman’, u’amplification’), (‘promotion’, ‘codes’), (‘flip’, ‘flop’), (‘hydrophobic’, ‘drug’), (‘white’, ‘saturation’), (‘gray’, ‘scale’), (‘fire’, ‘suppression’), (‘millimeter’, ‘wave’), (‘positron’, ’emission’), (‘shipping’, ‘label’), (‘impurity’, ‘diffused’), (‘drill’, ‘string’), (‘diversity’, ‘combining’), (‘bond’, ‘fingers’), (‘golf’, ‘club’), (‘ejection’, ‘outlets’), (‘lift’, ‘tappet’), (‘resource’, ‘allocation’), (‘calling’, ‘party’), (‘dunnage’, ‘bag’), (‘unused’, ‘locations’), (‘charged’, ‘particle’), (‘pseudo’, ‘random’), (‘diffused’, ‘resistor’), (‘accelerator’, ‘pedal’), (u’direct’, u’methanol’), (‘packing’, ‘list’), (‘hermetic’, ‘seal’), (‘permanent’, ‘magnet’), (‘moisture’, ‘removal’), (‘cross’, ‘sectional’), (‘rfid’, ‘reader’), (‘capacity’, ‘utilization’), (‘spaced’, ‘apart’), (‘planetary’, ‘gear’), (‘hard’, ‘mask’), (‘replaceable’, ‘lamp’), (‘writing’, ‘implement’), (‘native’, ‘code’), (‘movable’, ‘stile’), (‘turbo’, ‘machine’), (‘esd’, ‘protection’), (‘permeable’, ‘membrane’), (‘refractive’, ‘index’), (‘radially’, ‘inwardly’), (‘rfid’, ‘tag’), (‘clock’, ‘buffering’), (u’cooler’, u’box’), (‘floating’, ‘diffusion’), (‘sun’, ‘gear’), (‘buy’, ‘order’), (‘sell’, ‘order’), (‘articulating’, ‘paper’), (‘link’, ‘aggregation’), (‘alignment’, ‘mark’), (‘build’, ‘up’), (‘three’, ‘dimensional’), (‘wind’, ‘turbine’), (‘order’, ‘price’), (‘search’, ‘results’), (‘pick’, ‘up’), (‘hand’, ‘held’), (‘color’, ‘gamut’), (‘trip’, ‘actuator’), (‘vibrational’, ‘energy’), (‘positive’, ‘refractive’), (’emitting’, ‘diodes’), (‘encryption’, ‘key’), (‘working’, ‘hose’), (‘respective’, ‘ones’), (‘ink’, ‘jet’), (‘default’, ‘order’), (‘resilient’, ‘fiber’), (‘identification’, ‘tag’), (‘incoming’, ‘call’), (‘electromagnetic’, ‘radiation’), (u’methanol’, u’fuel’), (‘dry’, ‘down’), (‘heat’, ‘dissipating’), (‘following’, ‘steps’), (‘field’, ‘sensitive’), (‘non’, ‘volatile’), (‘heat’, ‘exchanger’), (‘conductivity’, ‘type’), (‘ammunition’, ‘container’), (‘developer’, ‘carrying’), (‘air’, ‘humidifier’), (‘heat’, ‘sink’), (‘turned’, ‘off’), (‘magazine’, ‘well’), (‘fan’, ‘noise’), (‘semi’, ‘annular’), (‘page’, ‘table’), (‘non’, ‘recumbent’), (‘spring’, ‘accumulator’), (‘electrostatic’, ‘discharge’), (‘crank’, ‘angle’), (‘mos’, ‘transistor’), (‘voice’, ‘band’), (‘knife’, ‘blade’)]
With the list of bigrams every abstract is re tokenized individually so the reference to the patent ID is not lost. In order to do that, the tokens are grouped by patent ID, re tokenized and stored again in a data frame with the same structure.
In [19]:
# bigram re tokenization
lst_token_bigram = []
lst_patent_id_bigram = []
tokenizer = MWETokenizer(bigrams) # tokenize using the bigram tuples generated previously
for name, group in df_tokens.groupby([‘patentid’]): # group by patent ID
token_list = list(group.token) # get all tokens for current patent
tokens_bigrams = tokenizer.tokenize(token_list) # tokenize with bigrams
lst_token_bigram += tokens_bigrams # get new list of tokens
lst_patent_id_bigram += ([name] * len(tokens_bigrams)) # with their corresponding patent ID
dic_token_bigram = {}
dic_token_bigram[‘token’] = lst_token_bigram
dic_token_bigram[‘patentid’] = lst_patent_id_bigram
df_tokens_bigram = pd.DataFrame(dic_token_bigram) # new data frame tokens
print(df_tokens_bigram.shape)
(261382, 2)
The next step is to remove stop words from the data frame df_tokens_bigram.
In [20]:
# removing stop words
# filter tokens not in stop words list
df_tokens_no_sw = df_tokens_bigram[~df_tokens_bigram.token.isin(stopwords_list)]
print(df_tokens_no_sw.shape)
(164296, 2)
Now, let’s remove the top-20 most frequent words.
In [21]:
# removing top-20 most frequent words
freq = nltk.FreqDist(df_tokens_no_sw.token) # get distribution of tokens
common_20 = [token for token, freq in freq.most_common(20)] # get top 20 in tuples (token, frequency)
print(common_20)
df_tokens_no_20fw = df_tokens_no_sw[~df_tokens_no_sw.token.isin(common_20)] # filter previous data frame
print(df_tokens_no_20fw.shape)
[‘first’, ‘second’, ‘includes’, ‘one’, ‘device’, ‘portion’, ‘least’, ‘system’, ‘data’, ‘signal’, ‘surface’, ‘method’, ‘unit’, ‘plurality’, ‘layer’, ‘member’, ‘provided’, ‘may’, ‘control’, ‘circuit’]
(142292, 2)
Next, words appearing only in one abstract are also removed. To achieve this, the previous data frame is grouped by token and every group of patents is iterated to check if its length is one. Tokens with one patent are collected and removed from the data frame.
In [22]:
# removing tokens that appear in only one patent
one_occu_words = []
for name, group in df_tokens_no_20fw.groupby([‘token’]): # group by token
if len(set(group.patentid)) == 1: # if the group contains only one patent (token appearing in one patent)
one_occu_words.append(name)
df_tokens_no_onepat = df_tokens_no_20fw[~df_tokens_no_20fw.token.isin(one_occu_words)] # filter
print(df_tokens_no_onepat.shape)
(135174, 2)
In [26]:
[word for word in one_occu_words if “_” in word]
Out[26]:
[‘air_humidifier’,
‘ammunition_container’,
‘articulating_paper’,
‘bond_fingers’,
‘buy_order’,
‘capacity_utilization’,
‘clock_buffering’,
u’cooler_box’,
‘crank_angle’,
‘default_order’,
u’direct_methanol’,
‘diversity_combining’,
‘dry_down’,
‘dunnage_bag’,
‘fan_noise’,
‘hydrophobic_drug’,
‘impurity_diffused’,
‘knife_blade’,
‘lift_tappet’,
‘link_aggregation’,
‘magazine_well’,
‘millimeter_wave’,
‘moisture_removal’,
‘movable_stile’,
‘native_code’,
‘non_recumbent’,
‘packing_list’,
‘permeable_membrane’,
‘positron_emission’,
‘promotion_codes’,
u’raman_amplification’,
‘replaceable_lamp’,
‘resilient_fiber’,
‘sell_order’,
‘semi_annular’,
‘shipping_label’,
‘spring_accumulator’,
‘turbo_machine’,
‘unused_locations’,
‘voice_band’,
‘white_saturation’,
‘working_hose’]
With all tokens from all patents filtered according to the initial rules, the final vocabulary is generated as a dictionary with the word_id:word format.
In [ ]:
# final vocabulary dictionary
lst_vocabulary = list(set(df_tokens_no_onepat.token)) # removing duplicates
vocabulary_dic = {token:index for index, token in enumerate(lst_vocabulary)} # dictionary word:word_id
print(len(vocabulary_dic))
The last data frame is df_tokens_no_onepat. Let’s use it to count the frequency of words in each abstract (patent ID). One approach is to group the data frame by patent ID and token so the length of the groups correspond to the frequency of the pair (patentid, token) in the data frame. The grouping will also conform a new data frame with unique words per patent and a new column: frequency, which is simply the counting of each group. McKinney (2012) provides an elegant solution to extract data from a groupby object directly into a data frame without iteration. The size() function applied to the groupby object is equivalent to the count() aggregation in other languages like SQL. The new dictionary key frequency is calculated as the size of the groups which are also included by default in the new data frame.
In [ ]:
# generate new data frame grouping by patent ID and token, the groups count is a new column: frequency
df_tokens_freq = pd.DataFrame(
{‘frequency’: df_tokens_no_onepat.groupby([‘patentid’, ‘token’]).size()}
).reset_index()
print(df_tokens_freq.head())
print(df_tokens_freq.shape)
A mapping between each token and its corresponding index from the vocabulary is needed. The index is added as a new column for the data frame df_tokens_freq. The new column named tokenid is calculated as a mapping between the data frame column token and the vocabulary dictionary vocabulary_dic. To do this, the Series.map() function (The pandas project, 2016c) is applied to the column token with the vocabulary dictionary vocabulary_dic as argument. The result is a list of indices for the tokens that match. Since the dictionary is just a subset of unique tokens of the main list, all rows are matched.
In [ ]:
# final mapping to calculate the token ID according to the vocabulary dictionary
df_tokens_freq[‘tokenid’] = df_tokens_freq.token.map(vocabulary_dic)
print(df_tokens_freq.head())
print(df_tokens_freq.shape)
# validation
print([(df_tokens_freq.token[i], vocabulary_dic[df_tokens_freq.token[i]]) for i in range(0, 5)])
Finally, the vocabulary dictionary is exported into a .txt file.
In [ ]:
v_writer = open(‘vocab_test.txt’, ‘w’)
for token, tokenid in vocabulary_dic.items(): # for each key, value in the dictionary
v_writer.write(str(tokenid) + ‘:’ + token + ‘\n’)
v_writer.close()
# verify
with open(‘vocab.txt’,’r’) as infile:
print(‘\n’.join([infile.readline().strip() for i in range(0, 10)]))
And also the abstract list per patent as sparse count vectors. To achieve this, it’s necessary to group the data frame by patent ID and concatenate the tokenid and frequency values of each group member in a list separated by commas. The resulting list is then concatenated to the patent ID which would conform a line in the final .txt file. However, since the columns tokenid and frequency are numerical values, it’s required to convert these values to strings. Also, since the join() function only applies to lists, the columns tokenid and frequency for each group must be mapped to strings, hence the solution proposed by silvado (2013) about using .map(str) applies to this case.
In [ ]:
a_writer = open(‘count_vectors_test.txt’, ‘w’)
for patentid, group in df_tokens_freq.groupby([‘patentid’]):
a_writer.write(patentid + ‘,’ + ‘,’.join(group.tokenid.map(str) + ‘:’ + group.frequency.map(str)) + ‘\n’)
a_writer.close()
# verify the first 10 lines
with open(‘count_vectors.txt’,’r’) as infile:
print(‘\n’.join([infile.readline().strip() for i in range(0, 10)]))
6. Summary¶
This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:
XML parsing and data extraction. By using the built-in xml.etree.ElementTree module. With helpers such as XPaths and functions like find() and findall(), it was possible to access hierarchical data with only a few inspections.
Data frame manipulation. By using the pandas package, importing dictionaries into data frames was quite straightforward. Additional operations like filtering, slicing, grouping and mapping also made data transformation tasks more manageable.
Exporting data to specific format. By using built-in functions like DataFrame.to_csv() it was possible to export data frames into .txt files without additional formatting and transformations. In other cases, native file operations like open() and write() were required where data had to be processed line by line. Luckily, the use of Python’s functions like join() and in-line iterators made such tasks more easy and readable.
Tokenization, collocation extraction. By using the nltk and re package, regular expressions had to be used to tokenize text and obtain letter-only words with length greater than two and with characters appearing no more than two times in a row. 100 bigrams were also generated to further tokenize the initial corpus. The PMI measure was used to detect pairs of words with high probability of appearing together. In addition, bigram filters were also used to refine the bigrams even more.
Vocabulary and sparse vector generation. A vocabulary covering words from different abstracts was obtained by removing stop words, top 20 most frequent ones, and words that appeared in only one abstract. Basic series and data frame functions like Series.isin() and DataFrame.groupby(), filtering based on nltk’s frequency distribution function FreqDist() and also the built-in functions set() and enumerate() were used to get the final vocabulary dictionary. Finally, a sparse vector was calculated for every abstract by counting the frequency of vocabulary word occurrences.
7. References¶
Biek M. (2008, September 4). How would you make a comma-separated string from a list? [Response to]. Retrieved from http://stackoverflow.com/a/44781
Blank B. (2011, January 19). Regex non-consecutive chars [Response to]. Retrieved from http://stackoverflow.com/a/4739636
Bouma G. (2009). Normalized (Pointwise) Mutual Information in Collocation Extraction. Proceedings of the Biennial GSCL Conference.
McKinney W. (2012, April 29). Converting a Pandas GroupBy object to DataFrame [Response to]. Retrieved from http://stackoverflow.com/a/10374456
NLTK Project. (2017). NLTK 3.0 documentation: nltk.tokenize.regexp module. Retrieved from http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer
NLTK Project. (2015). Collocations. Retrieved from http://www.nltk.org/howto/collocations.html
Python Software Foundation. (2017). xml.etree.ElementTree — The ElementTree XML API. Retrieved from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
silvado. (2013, October 15). Combine two columns of text in dataframe in pandas/python [Response to]. Retrieved from http://stackoverflow.com/a/19378497
The pandas Project. (2016a). pandas 0.19.2 documentation: pandas.DataFrame.to_csv. Retrieved from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv
The pandas Project. (2016b). pandas 0.19.2 documentation: pandas.DataFrame.groupby. Retrieved from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby
The pandas Project. (2016c). pandas 0.19.2 documentation: pandas.Series.map. Retrieved from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
Whitmore T. (2013, October 11). recursive printing of tree structure from XML having strange behavior in java [Response to]. Retrieved from http://stackoverflow.com/a/19328949
World International Property Organization (2016). Guide to the International Patent Classification. Retrieved from http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf