程序代写CS代考 python database decision tree COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Hope you all had a good semester break!
Interactive lecture week 7: Labs and assignments, and questions for topic 7 (Lecturer: )

Lecture outline
● Administrative matters – Assignments and Labs
● Questions on Wattle
● Q and A Session
● Quick recap on Python scripts used in labs

Labs
● We had lab 4 this week, continuation of the record linkage project
● Different comparison functions are discussed
● Sample solutions for lab will be released on Monday ● All of labs 3 to 7 are highly relevant to assignment 3

Assignments
● Assignment 1 marks are now released
● You have 2 weeks time to question your marks (until
Tuesday 5 October 5pm 2021)
● You must carefully read the marking feedback document provided

Assignments Cont.
● Some students did not properly submit their assignment 1 – we emailed those students their feedback
● For future assignments, we will apply a 20% penalty for students who do not properly submit their assignments!

Assignments Cont.
● If you believe we have made a mistake in marking, then email me (Anushka), do not email tutors and do NOT ask them in the lab sessions
● If you ask for your assignment to be remarked, then we will do so but your mark might go up or down
● So far we have identified one marking issue
– Different MAD calculations in R and Python. R uses a scaling factor of 1.4826 by default when calculating MAD

Assignments 2, 3, and 4, and Quiz 3
● All assignment specifications are available in Wattle
● Be careful when generating your data sets
(these are specific to each assignment)
● Submissions for assignments 2 and 3 are open,
assignment 4 (COMP8430 students only) submission will
be opened soon
● Start working on these assignments now
● Quiz 3 will be open on Monday 27th September

Assignments 2 and 3
● Assignment 2 – You need to merge (join) the two data sets using the common attribute Social Security Number (SSN)
● Assignment 3 – You need to perform a Record Linkage process (Blocking, Comparison, Classification, and Evaluation) on the two data sets

Assignment submissions – Important
● To save (partial) answers for an assignment, click ‘Finish attempt’ and then ‘Return to attempt’
● To submit your assignment, click on ‘Finish attempt’ and then ‘Finish all and submit’
(you can only do this once!)
● You MUST click on ‘Finish all and submit’ before the submission deadline

From last topic (Topic 6)
● Numerical value comparison with maximum percentage
difference
– Given two numerical values n1 and n2, the percentage difference pd can be calculated as:
pd = 100 * |n1 – n2| / max(n1, n2)
– Then we define pd_max and calculate the approximate similarity value as: sim = 1.0 – (pd / pd_max) if pd < pd_max sim = 0.0 else. From last topic (Topic 6) ● Levenshtein edit distance – Insertion of a character: “pete” → “peter” – Deletion of a character: “miller” → “miler” – Substitution of a character: “smith” → “smyth” ● Damerau-Levenshtein edit distance – Insertion of a character: “pete” → “peter” – Deletion of a character: “miller” → “miler” – Substitution of a character: “smith” → “smyth” – Transpositions of two adjacent characters: “sydney” → “sydeny” Summary of week 7 ● Record linkage / data matching / entity resolution – The process of identifying records in one or more databases that correspond to the same real-world entity (person, business, product, etc.) – We weeks 5 and 6 we talked about blocking and comparison ● Classification – Classify record pairs into matches and non-matches based on their calculated similarity vectors – Sometimes we also have potential matches – Different classification techniques can be used (threshold based classification, probabilistic classification, etc.) Questions from Wattle forum (1) ● My question is could I know the strengths and weaknesses for each classification method, such as Threshold-based classification, Probabilistic classification, Cost based classification, rule based classification, Machine learning based classification and so on, please? What are the quality and complexity of each method? For the quality, I mean does one method get higher accuracy or F1 score than another method in general. For the complexity, I mean what's the Big O notation for each method? Also, which method is prevalent in industry and research? Questions from Wattle forum (1) ● Threshold-based classification – record pairs are classified using user defined threshold values. ● Probabilistic classification – record pairs are classified by assigning probabilistic weights to attributes ● Cost based classification – record pairs are classified in a way that the overall cost for classification is minimised. ● Rule based classification – record pairs are classified using a set of rules. ● Machine learning based classification – ML techniques such as decision trees are used to classify record pairs Questions from Wattle forum (1) ● Threshold-based classification Simplest form of classification Assuming all similarity values are normalised, same importance is given to all attributes Detailed information is lost when summing all similarities together Accurate calculation of error estimates are required Finding optimal thresholds is difficult Assumption on independence is not always true ● Cost based classification Costs need to be determined A model needs to be build that minimises overall cost    ● Probabilistic classification      Questions from Wattle forum (1) ● Rule based classification Generally a small set of rules are preferable Order of the rules need to be taken into account Rules need to be highly accurate and have high coverage ● Machine learning based classification Both supervised and unsupervised techniques can be used Active learning is a on-going research area Need training data and/or domain expertise       Q and A Session ● Socrative - https://b.socrative.com/login/student/ - Room Name: COMP3430