CS代考 EECS 485 Lecture 15

EECS 485 Lecture 15
Text Analysis

Copyright By PowCoder代写 加微信 powcoder

Groupwork strategies for P4
• Hard part of P4: debugging
• Don’t use your team to write code in parallel – none of the code can be
written in parallel
• Use your team to write code that’s more likely to work the first time
• My recommended strategy
• Code together (Zoom or in-person)
• Set a timer for 20 minutes, rotate who is doing the typing
• Everyone should be able to run the project on their own computer at all times
• Use gitlab.eecs.umich.edu

Mask policy reminder
• “Masks will remain required in classrooms and other instructional spaces, patient care areas, campus buses and in campus COVID-19 testing sites at least through the end of the winter term.”

Kids and search engines
“There’s a creepy guy on the other end at Google!: engaging middle school students in a drawing activity to elicit their mental models of Google”, Kodama et. al, 2017
https://link.springer.com/article/10.1007/s10791- 017-9306-x

Learning Objectives
• Analyze the text of documents to create an index used in web search
• Build Boolean retrieval models which track whether a term is present in a document
• Use vector space models which use the angle between vectors to quantify similarity
• Use the tf-idf metric to build more precise vectors for a vector space model

Boolean Retrieval

Key problem: ranking results
• 33% clicks on top result
• Different ranking methods
• Words on page
• Importance of page using links

Goal of ranking algorithms
• Which web pages (documents) does the person searching want to find?
Kangaroos live in Australia and jump.
Cows live all over the world. Unlike kangaroos, they cannot jump.
Aluminum foil is shiny.

Simplest ranking algorithm: Boolean retrieval
Kangaroos live in Australia and jump.
Cows live all over the world. Unlike kangaroos, they cannot jump.
Aluminum foil is shiny.

Index for Boolean retreival
• Inverted index: words to documents
Document 0
Kangaroos can jump.
Document 1
Cows can not jump.

Boolean search using inverted index

Exercise: repl.it
• • jkloosterman.net/485
• Lecture 15 Exercise 1
• Build a search engine!

Vector Space Model

Boolean index to vectors

Why vectors?
doc0 = [1, 1, 1, 0, 0] doc1 = [0, 1, 1, 1, 1] doc2 = [1, 0, 0, 1, 1]

Documents as vectors
• A document is a vector
• Each dimension represents a word
• # of dimensions: # of unique words in all documents

Investigation: repl.it
• jkloosterman.net/485 : Lecture 15 Ex 2
• Code provided for you to build vectors from documents
• Example output:

tf-idf: Term frequency-inverse document frequency

Adding more information
• Right now, vectors contain only 0 or 1 depending on whether they contain a word
doc1 = [1, 1, 1, 0, 0]
• Idea: use entire range [0, 1] to give more information
doc1 = [0.2, 0.1, 0.5, 0, 0]

Term frequency
Kangaroos can jump. Kangaroos live in Australia.
Australia is in the southern hemisphere.
One of the longest flights is from London to Sydney, Australia.

Document frequency
Kangaroos can jump. Kangaroos live in Australia.
Australia is in the southern hemisphere.
One of the longest flights is from London to Sydney, Australia.

tf-idf Formula
Kangaroos can jump.
Kangaroos live in Australia.

tf-idf Formula

TF-IDF normalization
• Normalize term weights
• Longer docs not given more weight • Normalize to sum-of-squares
tfik log(N /nk )
åt (tfik )2[log(N /nk )]2
• Some references use non-normalized tf-idf • wik = tfiklog(N/nk)

Vector space similarity
• Similarity of two docs is:
Sim(Di,Dj)=
Normalized ahead of time, when computing term weights.
Not normalized ahead of time

Learning Objectives
• Analyze the text of documents to create an index used in web search
• Build Boolean retrieval models which track whether a term is present in a document
• Use vector space models which use the angle between vectors to quantify similarity
• Use the tf-idf metric to build more precise vectors for a vector space model

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com