CS计算机代考程序代写 python W01L2Python

W01L2Python

Basic Python¶
Beginning Python¶

In [1]:

def hello (who): # 1
“””Greet somebody””” # 2
print(“Hello ” + who + “!”) # 3

1 Defines a new function/procedure called hello which takes a single argument. Note that python variables are not typed, who could be a string, integer, array … The line ends with a colon (:) which means we’re beginning an indented code block.

2 Firstly note that there are no brackets delimiting the body of the procedure, Python instead uses indentation to delimit code blocks. So, getting the indentation right is crucial!

This line (2) is a documentation string for the procedure which gets associated with it in the python environment. The three double quotes delimit a multi-line string (could use ‘ or ” in this context).

3 This is the body of the procedure, print is a built in command in python. Note that the Python 2.x versions do not use round brackets, this is a major difference with Python 3.x. We also see here the + operator used on strings (I’m assuming whois a string) to perform concatenation — thus we have operator overloading based on object type just like other OO languages.

In [2]:

help(hello)

Help on function hello in module __main__:

hello(who)
Greet somebody

In [3]:

hello(“Steve”)

Hello Steve!

Here I’m calling the new procedure with a literal string argument delimited by “.

In [4]:

hello(‘world’)

Hello world!

And here delimited by ‘ — both of these delimiters are equivalent, use one if you want to include the other in the string, eg “Steve’s”.

In [5]:

people = [‘Steve’, “Mark”, ‘Diego’] # 6
for person in people: # 7
hello(person) # 8

Hello Steve!
Hello Mark!
Hello Diego!

6 This defines a variable people to have a value which is a list of strings, lists are 1-D arrays and the elements can be any python object (including lists).

7 A for loop over the elements of the list. Again the line ends with a colon indicating a code block to follow.

8 Call the procedure with the variable which will be bound to successive elements of the list.

Core Data Types¶
Strings
Numbers (integers, float, complex)
Lists
Tuples (inmutable sequences)
Dictionaries (associative arrays)

Lists¶

In [6]:

a = [‘one’, ‘two’, 3, ‘four’]

In [7]:

a[0]

Out[7]:

‘one’

In [8]:

a[-1]

Out[8]:

‘four’

In [9]:

a[0:3]

Out[9]:

[‘one’, ‘two’, 3]

In [10]:

len(a)

Out[10]:

4

In [11]:

len(a[0])

Out[11]:

3

In [12]:

a[1] = 2
a

Out[12]:

[‘one’, 2, 3, ‘four’]

In [13]:

a.append(‘five’)
a

Out[13]:

[‘one’, 2, 3, ‘four’, ‘five’]

In [14]:

top = a.pop()
a

Out[14]:

[‘one’, 2, 3, ‘four’]

In [15]:

top

Out[15]:

‘five’

List Comprehensions¶
List comprehensions are a very powerful feature of Python. They reduce the need to write simple loops.

In [16]:

a = [‘one’, ‘two’, ‘three’, ‘four’]
len(a[0])

Out[16]:

3

In [17]:

b = [w for w in a if len(w) > 3]
b

Out[17]:

[‘three’, ‘four’]

In [18]:

c = [[1,’one’],[2,’two’],[3,’three’]]
d = [w for [n,w] in c]
d

Out[18]:

[‘one’, ‘two’, ‘three’]

Tuples¶
Tuples are a sequence data type like lists but are immutable:

Once created, elements cannot be added or modified.

Create tuples as literals using parentheses:

In [19]:

a = (‘one’, ‘two’, ‘three’)

Or from another sequence type:

In [20]:

a = [‘one’, ‘two’, ‘three’]
b = tuple(a)

Use tuples as fixed length sequences: memory advantages.

Dictionaries¶
Associative array datatype (hash)
Store values under some hash key
Key can be any immutable type: string, number, tuple

In [21]:

names = dict()
names[‘madonna’] = ‘Madonna’
names[‘john’] = [‘Dr.’, ‘John’, ‘Marshall’]
names.keys()

Out[21]:

dict_keys([‘madonna’, ‘john’])

In [22]:

list(names.keys())

Out[22]:

[‘madonna’, ‘john’]

In [23]:

ages = {‘steve’:41, ‘john’:22}
‘john’ in ages

Out[23]:

True

In [24]:

41 in ages

Out[24]:

False

In [25]:

‘john’ in ages.keys()

Out[25]:

True

In [26]:

for k in ages:
print(k, ages[k])

steve 41
john 22

Organising Source Code: Modules¶
In Python, a module is a single source file wich defines one or more procedures or classes.
Load a module with the import directive.
After importing the module, all functions are grouped in the module namespace.
Python provides many useful modules.

In [27]:

import math
20 * math.log(3)

Out[27]:

21.972245773362197

Defining Modules¶
A module is a source file containing Python code Usually class/function definitions.

First non comment item can be a docstring for the module.

# my python module
“””This is a python module to
do something interesting”””

def foo(x):
‘foo the x’
print(‘the foo is ‘ + str(x))

NLTK¶
NLTK is a Python module

In [28]:

import nltk

Let’s do some simple statistics on the Gutenberg corpus

In [29]:

nltk.download(‘gutenberg’)
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /home/diego/nltk_data…
[nltk_data] Unzipping corpora/gutenberg.zip.

Out[29]:

[‘austen-emma.txt’,
‘austen-persuasion.txt’,
‘austen-sense.txt’,
‘bible-kjv.txt’,
‘blake-poems.txt’,
‘bryant-stories.txt’,
‘burgess-busterbrown.txt’,
‘carroll-alice.txt’,
‘chesterton-ball.txt’,
‘chesterton-brown.txt’,
‘chesterton-thursday.txt’,
‘edgeworth-parents.txt’,
‘melville-moby_dick.txt’,
‘milton-paradise.txt’,
‘shakespeare-caesar.txt’,
‘shakespeare-hamlet.txt’,
‘shakespeare-macbeth.txt’,
‘whitman-leaves.txt’]

In [30]:

emma = nltk.corpus.gutenberg.words(‘austen-emma.txt’)
len(emma)

Out[30]:

192427

In [31]:

nltk.download(‘punkt’)
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

[nltk_data] Downloading package punkt to /home/diego/nltk_data…
[nltk_data] Package punkt is already up-to-date!
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt

Counting Words¶

In [32]:

import collections
emma_counter = collections.Counter(emma)
emma_counter.most_common(10)

Out[32]:

[(‘,’, 11454),
(‘.’, 6928),
(‘to’, 5183),
(‘the’, 4844),
(‘and’, 4672),
(‘of’, 4279),
(‘I’, 3178),
(‘a’, 3004),
(‘was’, 2385),
(‘her’, 2381)]

In [33]:

emma_counter[‘Emma’]

Out[33]:

865

Exercises¶
Identify the 10 most common words in each file of the Gutenberg corpus. Can you see any similarities among them?
Find the most frequent word with length of at least 7 characters.
Find the words that are longer than 7 characters and occur more than 7 times.

Count Bigrams¶
A bigram is a sequence of two words.

In [34]:

list(nltk.bigrams([1,2,3,4,5,6]))

Out[34]:

[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

In [35]:

list(nltk.bigrams(emma))[:5]

Out[35]:

[(‘[‘, ‘Emma’),
(‘Emma’, ‘by’),
(‘by’, ‘Jane’),
(‘Jane’, ‘Austen’),
(‘Austen’, ‘1816’)]

A bigram is an ngram where n is 2
A trigram is an ngram where n is 3

In [36]:

list(nltk.ngrams(emma,4))[:5]

Out[36]:

[(‘[‘, ‘Emma’, ‘by’, ‘Jane’),
(‘Emma’, ‘by’, ‘Jane’, ‘Austen’),
(‘by’, ‘Jane’, ‘Austen’, ‘1816’),
(‘Jane’, ‘Austen’, ‘1816’, ‘]’),
(‘Austen’, ‘1816’, ‘]’, ‘VOLUME’)]

Exercises¶
Find the most frequent bigram in Austin’s Emma.
Find the most frequent bigram that begins with ‘the’.

Text Processing in Python¶

Sorting¶
The function sorted() returns a sorted copy.
Sequences can be sorted in place with the sort() method.
Python 3 does not support sorting of lists with mixed contents.

In [37]:

foo = [2,5,9,1,11]
sorted(foo)

Out[37]:

[1, 2, 5, 9, 11]

In [38]:

foo

Out[38]:

[2, 5, 9, 1, 11]

In [39]:

foo.sort()

In [40]:

foo

Out[40]:

[1, 2, 5, 9, 11]

In [41]:

foo2 = [2,5,6,1,’a’]
sorted(foo2)

—————————————————————————
TypeError Traceback (most recent call last)
in
1 foo2 = [2,5,6,1,’a’]
—-> 2 sorted(foo2)

TypeError: ‘<' not supported between instances of 'str' and 'int' Sorting with a custom sorting criterion¶ In [42]: l = ['a','abc','b','c','aa','bb','cc'] In [43]: sorted(l) Out[43]: ['a', 'aa', 'abc', 'b', 'bb', 'c', 'cc'] In [44]: sorted(l,key=len) Out[44]: ['a', 'b', 'c', 'aa', 'bb', 'cc', 'abc'] In [45]: sorted(l,key=len,reverse=True) Out[45]: ['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c'] In [46]: def my_len(x): return -len(x) In [47]: sorted(l,key=my_len) Out[47]: ['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c'] In [48]: sorted(l,key = lambda x: -len(x)) Out[48]: ['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c'] Exercises¶ You're given data of the following form: namedat = dict() namedat['mc'] = ('Madonna', 45) namedat['sc'] = ('Steve', 41) How would you print a list ordered by name? How would you print a list ordered by age? Strings in Python¶ String is a base type. Strings are sequences and can use operations like lists or tuples. In [49]: foo = "A string" len(foo) Out[49]: 8 In [50]: foo[0] Out[50]: 'A' In [51]: foo[0:3] Out[51]: 'A s' In [52]: multifoo = """A multiline string""" In [53]: multifoo Out[53]: 'A multiline \nstring' In [54]: "my string".capitalize() Out[54]: 'My string' In [55]: capitalize("my string") --------------------------------------------------------------------------- NameError Traceback (most recent call last) in
—-> 1 capitalize(“my string”)

NameError: name ‘capitalize’ is not defined

In [56]:

“my string”.upper()

Out[56]:

‘MY STRING’

In [57]:

“My String”.lower()

Out[57]:

‘my string’

In [58]:

a = “my string with my other text”
a.count(“my”)

Out[58]:

2

In [59]:

a.find(“with”)

Out[59]:

10

In [60]:

a.find(“nothing”)

Out[60]:

-1

Split¶
split(sep) is a central string operation.
It splits a string wherever sep occurs (blank space by default)

In [61]:

foo = “one :: two :: three”
foo.split()

Out[61]:

[‘one’, ‘::’, ‘two’, ‘::’, ‘three’]

In [62]:

foo.split(‘::’)

Out[62]:

[‘one ‘, ‘ two ‘, ‘ three’]

In [63]:

foo.split(‘ :: ‘)

Out[63]:

[‘one’, ‘two’, ‘three’]

In [64]:

“this is a test”.split()

Out[64]:

[‘this’, ‘is’, ‘a’, ‘test’]

Join¶
Join is another useful function/method in the string module.
It takes a list and joins the elements using some delimiter.

In [65]:

text=”this is some text to analyse”
words=text.split()
print(words)
words.sort()
print(words)
print(“, “.join(words))

[‘this’, ‘is’, ‘some’, ‘text’, ‘to’, ‘analyse’]
[‘analyse’, ‘is’, ‘some’, ‘text’, ‘this’, ‘to’]
analyse, is, some, text, this, to

Replace¶

In [66]:

def censor(text):
‘replace bad words in a text with XXX’
badwords = [‘poo’, ‘bottom’]
for b in badwords:
text = text.replace(b, ‘XXX’)
return text

In [67]:

censor(“this is all poo and more poo”)

Out[67]:

‘this is all XXX and more XXX’

Text Preprocessing with NLTK¶
Tokenisation¶

In [68]:

import nltk
nltk.download(“punkt”)
text = “This is a sentence. This is another sentence.”
nltk.sent_tokenize(text)

[nltk_data] Downloading package punkt to /home/diego/nltk_data…
[nltk_data] Package punkt is already up-to-date!

Out[68]:

[‘This is a sentence.’, ‘This is another sentence.’]

In [69]:

for s in nltk.sent_tokenize(text):
for w in nltk.word_tokenize(s):
print(w)
print()

This
is
a
sentence
.

This
is
another
sentence
.

Part of speech tagging¶
Often it is useful to know whether a word is a noun, or an adjective, etc. These are called parts of speech.
NLTK has a part of speech tagger that tags a list of tokens.
The default list of parts of speech is fairly detailed but we can set a simplified version (called universal by NLTK).

List of universal tagsets:

Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into, under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN noun year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing, would
. punctuation marks . , ; !
X other ersatz, esprit, dunno, gr8, univeristy

In [70]:

nltk.download(“averaged_perceptron_tagger”)
nltk.pos_tag([“this”, “is”, “a”, “test”])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/diego/nltk_data…
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!

Out[70]:

[(‘this’, ‘DT’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘test’, ‘NN’)]

In [71]:

nltk.download(“universal_tagset”)
nltk.pos_tag([“this”, “is”, “a”, “test”], tagset=”universal”)

[nltk_data] Downloading package universal_tagset to
[nltk_data] /home/diego/nltk_data…
[nltk_data] Unzipping taggers/universal_tagset.zip.

Out[71]:

[(‘this’, ‘DET’), (‘is’, ‘VERB’), (‘a’, ‘DET’), (‘test’, ‘NOUN’)]

In [72]:

nltk.pos_tag(nltk.word_tokenize(“this is a test”), tagset=”universal”)

Out[72]:

[(‘this’, ‘DET’), (‘is’, ‘VERB’), (‘a’, ‘DET’), (‘test’, ‘NOUN’)]

In [73]:

text = “This is a sentence. This is another sentence.”
text_sent_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
text_sent_tokens

Out[73]:

[[‘This’, ‘is’, ‘a’, ‘sentence’, ‘.’],
[‘This’, ‘is’, ‘another’, ‘sentence’, ‘.’]]

In [74]:

nltk.pos_tag_sents(text_sent_tokens, tagset=”universal”)

Out[74]:

[[(‘This’, ‘DET’),
(‘is’, ‘VERB’),
(‘a’, ‘DET’),
(‘sentence’, ‘NOUN’),
(‘.’, ‘.’)],
[(‘This’, ‘DET’),
(‘is’, ‘VERB’),
(‘another’, ‘DET’),
(‘sentence’, ‘NOUN’),
(‘.’, ‘.’)]]

Below is an implementation that has the same behaviour as pos_tag_sents. Hopefully this can help you understand how it works:

In [75]:

def my_pos_tag_sents(text_sent_tokens, tagset=”universal”):
return [nltk.pos_tag(s, tagset=tagset) for s in text_sent_tokens]

In [76]:

my_pos_tag_sents(text_sent_tokens, tagset=”universal”)

Out[76]:

[[(‘This’, ‘DET’),
(‘is’, ‘VERB’),
(‘a’, ‘DET’),
(‘sentence’, ‘NOUN’),
(‘.’, ‘.’)],
[(‘This’, ‘DET’),
(‘is’, ‘VERB’),
(‘another’, ‘DET’),
(‘sentence’, ‘NOUN’),
(‘.’, ‘.’)]]

Stemming¶
Often it is useful to remove information such as verb form, or the difference between singular and plural.
NLTK offers stemming, which removes suffixes. The Porter stemmer is a popular stemmer.

The remaining stem is not a word but can be used, for example, by search engines (we’ll see more of this in another lecture).

In [77]:

s = nltk.PorterStemmer()

In [78]:

s.stem(“books”)

Out[78]:

‘book’

In [79]:

s.stem(“running”)

Out[79]:

‘run’

In [80]:

s.stem(“run”)

Out[80]:

‘run’

In [81]:

s.stem(“goes”)

Out[81]:

‘goe’

In [82]:

[s.stem(w) for w in nltk.word_tokenize(“I’m running and he goes”)]

Out[82]:

[‘I’, “‘m”, ‘run’, ‘and’, ‘he’, ‘goe’]

Exercises¶
What is the sentence with the largest number of tokens
in Austen’s “Emma”?
What is the most frequent part of speech in Austen’s “Emma”?
What is the number of distinct stems in Austen’s “Emma”?
What is the most ambiguous stem in Austen’s “Emma”?
(meaning, which stem in Austen’s “Emma” maps to the
largest number of distinct tokens?)

In [ ]: