Text, Web And Media Analytics
Week 1 – Lecture
Introduction
Copyright By PowCoder代写 加微信 powcoder
Yuefeng Li | Professor
School of Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email
Teaching Staff
Unit Coordinator:
Professor Yuefeng Li
Phone: 3138 5212
Office: 1024, S Block, GP
Name: Mr Darshika Koggalahewa
Web, Data Mining
Data Science
AI , Machine learning & Big Data
2. Text analysis
Data Formats
Information Retrieval (IR)
Text Analysis & Machine Learning
3. Web And Media Analytics
4. Application Examples
5. IFN647 Weekly Schedule
6. Teaching Methodology教学方法
1. Overview
Web, Data Mining
Data Science
AI , Machine learning & Big Data
Prof Yuefeng Li – IFN647
Prof Yuefeng Li – IFN647
Y Li: IFN647 Week 1
Y Li: IFN647 Week 1
Related Knowledge
Pre-requisite:
Assumed Knowledge
Programming languages experience (e.g., IFN554)
Basic statistics functions (required by IFN509)
Why this unit is important?
Big data is a collection of data so large and complex that it becomes very difficult to process. It extends beyond structured data, including unstructured data: text, audio, video, click streams, log files and more.
It is estimated that more than 80% of big data were stored in text format that contain a large amount of invaluable knowledge to be automatically extracted.
The dramatic increase in the availability of massive text data from various sources is creating a number of issues and challenges for text analysis such as the noisy and uncertain information, scalability and effectiveness.
Big data is more than a challenge for enterprisers. It is also an opportunity to find insight in new and emerging types of data, to make business more agile, and to answer questions that, in the past, were beyond reach.
Therefore, there is a significant need for more efficient and effective techniques to facilitate the process.
Technology is changing jobs
https://eab.com/insights/daily-briefing/workplace/the-top-10-emerging-jobs-for-2022/
Data Analysts and Scientists
AI and Machine Learning Specialists
Data Mining to Data Science
The aims of Data Science
According to Dhar:
“to find interesting and robust patterns that satisfy the data”, where
“interesting” – unexpected or of value, and actionable
“robust” – a pattern expected to occur in the future i.e. have to exhibit predictive power.
A famous example in the world of data mining:
We may find that beer and diapers are often bought together at the supermarket (unexpected?)
Then perhaps putting beer next to the diaper section will increase sales.
Key Processes of the DS workflow
Stage 1: define a question of interest
Stage 2: define the ideal dataset for answering the question
Stage 3: acquire the data (ideal or approximation)
Stage 4: clean the data
Stage 5: explore the data
Stage 6: statistical prediction and modelling of the data
Stage 7: interpret the results
Stage 8: communicate and distribute the
results through synthesis and write up
Pre-processing
Visualization and reporting
Job Titles
Data Analyst:
They develop insight and gain information through the collection, analysis and interpretation of data. They work for businesses and other types of organizations, identifying and helping to solve problems. As a data analyst, you’ll use programming and computer software skills to complete statistical analysis of data.
Data Scientist
A data scientist is in the same broad career stream as a data analyst (see above). Perhaps the main different is that data scientists are expected to use advanced programming skills more routinely. They don’t just gain insights from data, but also do things like building complex behavioural models using big data.
What type of data analyses?
Descriptive
Exploratory
Inferential
Predictive
Mechanistic
increasing level
of complexity
(and power)
AI for Data Science
Artificial intelligence (AI) is intelligence demonstrated by machines, which was founded as an academic discipline in 1956.
The famous AI system is MYCIN which uses expert medical knowledge to diagnose and prescribe treatment for spinal meningitis and bacterial infections of the blood.
Expert knowledge – Rule example
if (a) the infection is primary-bacteremia, (b) the site of the culture is one of the sterile sites, and (c) the suspected portal of entry is the gastrointestinal tract; then There is suggestive evidence (.7) that infection is bacteroid.
Manually expert knowledge acquiring is expensive. The problem is how to use useful information and knowledge in the huge amount of data. Now there are a lot of data mining and machine learning algorithms or tools that can be used for this purpose.
How to use these tools in AI-based applications and how to design (or implement) such a system?
Data mining refers to extracting or “mining” knowledge from data (or large amount of data), it is also called Knowledge Discovery in Databases (KDD).
Machine learning systems automatically learn programs from data. This is often a very attractive alternative to manually constructing them.机器学习系统可以从数据中自动学习程序They are all used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications.
Machine learning will be the driver of the next big wave of innovation.
How Does AI Contribute to Big Data?
It can contribute to the velocity of data by facilitating rapid decisions that lead to other decisions. E.g., the world’s stock trades are done using AI-based systems.
It can be used to mitigate the problem of variety by capturing, structuring, and understanding unstructured data or information. E.g., understanding the impact of unstructured data (text) issues such as a firm’s reputation, or intent analysis of comments.
AI can be used to identify and clean dirty data (an issue for volume) or use dirty data as a means of establishing context knowledge for the data.
It can make intelligent data visualization apps available, possibly for particular types of data.
AI can be used to process and interpret knowledge or insights in natural languages and/or other forms of data , e.g., audio, video or image data.
Big Data the 3 VS:
Big data is about volume. Volumes of data that can reach unprecedented heights in fact.
Velocity essentially measures how fast the data is coming in. Some data will come in in real-time, whereas other will come in fits and starts, sent to us in batches.
Variety discusses different data formats. E.g., Once taking the shape of database files – such as, excel, csv and access – it is now being presented in non-traditional forms, like video, text, pdf, and graphics on social media, as well as via tech such as wearable devices.
What’s this unit?
Data is becoming central to every organization’s decision making process, and the demand for data savvy software engineers is rapidly increasing.
Modern computational approaches to data analysis have to enable users to acquire, manage and interpret large volumes of heterogeneous data (Big data).
It is characterized by three “Vs”: volume, variety and velocity.
Left figure shows the rapid growth of the volumes of unstructured and structure data.
This unit provides an understanding of the principles and techniques underlying the development of Text, Web and social media analysis solutions to some of the varied and complex problems that involve big data.
It will introduce you to a wide range of methods for analysing text data in order to design advanced AI-based systems (or learning systems).
2. Text analysis
Data Formats
Information Retrieval (IR)
Applications of Text Analysis
Prof Yuefeng Li – IFN647
Prof Yuefeng Li – IFN647
Text Analysis
Data formats & Pre-processing
Representation
Information Retrieval (IR) models
Information Filtering and Classification
Web and Media Data Analysis
It is the process of deriving high quality information from a set of documents.
High-quality information is typically derived through the devising of topics, patterns and trends through means such as learning algorithms.
Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure text information for business intelligence, exploratory data analysis, research, or investigation.
Data Formats
There are many kinds of data formats
Unstructured: no pre-defined manner or data model; e.g., emails (Free-text), or HTML pages.
Semi-structured: self-described structure; e.g., XML files, JSON, or Delimiter-separated values files.
Structured formats: organized into a formatted repository; e.g., SQL databases. 格式化存储库
The four most common formats used to markup text are: HTML, XML, JSON and CSV
Examples of Free-text and HTML
. TrumpVerified account @realDonaldTrump
The NFL National Anthem Debate is alive and well again – can’t believe it! Isn’t it in contract that players must stand at attention, hand on heart? The $40,000,000 Commissioner must now make a stand. First time kneeling, out for game. Second time kneeling, out for season/no pay!
XML example
JSON (Javascript Object Notation) is a data interchange format
Readable to humans & can be transmitted/interpreted by computer
Data represented as attribute-value pairs
Attributes
Values can be:
strings (John)
digits (25)
objects (like for address)
arrays (like for children)
boolean (true)
the null value (as for spouse)
Converting JSON files
JSON can have nested structure
XML, HTML also have nested data
CLI tools treat files as plain text (i.e. no explicit structure, or only contain very simple delimiter-separated rows)
CLI commands seen so far may be difficult to apply to JSON & friends
Two solutions:
use command line tools designed to operate on such data
converted nested/structured data into tabular plain text
Read .json files in Python
Delimiter-separated files
They are used to store data expressed in two-dimensional arrays A(i, j)
Rows can be separated by new line character (\n)
Columns can be separated using a delimiter character
commas (CSV), space, tabs (TSV)
Please note column headers are sometimes included in the first row of the file.
Student ID First name Last name Grade
3050395 -Jones 6
4749573 Johnson 5
3729173 7
What is a Document?
web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.
Common properties
Significant text content
Some structure (e.g., an XML document may include the title, author, date for papers; subject, sender, destination for email)
Documents vs. Database Records
Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)数据库记录(或关系数据库中的元组)通常由定义良好的字段(或属性)组成。
e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in order to find matches
Text is more difficult !!!
Documents vs. Records
Example bank database query
Find records with balance > $50,000 in branches located in Amherst, MA.
Matches easily found by comparison with field values of records
Example search engine query
bank scandals in western mass
This text must be compared to the text of entire news stories, and consider semantic issues, e.g.,
Synonym of “scandal”
Disgrace, Dishonor or Indignity;
Stemming (e.g., plural)
Scandals, Disgraces, Dishonors, Indignities
Information Retrieval (IR) [计][图情] 信息检索;情报检索
IR is more than just text, and more than just web search
although these are central
People doing IR work with different media, different types of search applications, and different tasks
Comparing Text
Comparing the query text to the document text and determining what is a good match is the core issue of IR
Exact matching of words is not enough
Many different ways to write the same thing in a “natural language” like English
e.g., does a news story containing the text “bank director in Amherst steals funds” match the query?
Some stories will be better matches than others
Other Media
New applications increasingly involve new media
e.g., video, photos, music, speech
Like text, content is difficult to describe and compare
text may be used to represent them (e.g. tags)
IR approaches to search and evaluation are appropriate
Speech-to-Text
Accurately convert speech into text using an API powered by Google’s AI technologies.
Voice recognition technology, designed to automatically transcribe audio, is becoming steadily more advanced. Throw the concurrent advances in smartphones into the mix and you can now have dictation available whenever you need it.
Best paid for speech to text apps
Dragon Anywhere
Dragon Professional
Amazon Transcribe
Microsoft to Text
Watson Speech to Text
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
Big Issues in IR
What is it?
Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine
Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style
Topical relevance (same topic) vs. user relevance (everything else)
Example of relevance
Q: “Australia great barrier reef”
Example of relevance cont.
Big Issues in IR
Retrieval models define a view of relevance
Ranking algorithms used in search engines are based on retrieval models
Most models describe statistical properties of text rather than linguistic
i.e. counting simple text features such as words instead of parsing and analyzing the sentences
Statistical approach to text processing started with Luhn in the 50s
Linguistic features can be part of a statistical model
Big Issues in IR
Evaluation
Experimental procedures and measures for comparing system output with user expectations
Originated in Cranfield experiments in the 60s
IR evaluation methods now used in many fields
Typically use test collection of documents, queries, and relevance judgments
Most commonly used are TREC collections
Recall and precision are two examples of effectiveness measures
In pattern recognition, information retrieval and classification (machine learning), precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that were retrieved. Both precision and recall are therefore based on relevance.
Big Issues in IR
Users and Information Needs
Search evaluation is user-centered
Keyword queries are often poor descriptions of actual information needs
Interaction and context are important for understanding user intent
Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
Text Analysis & Machine Learning
Information Filtering System (information Gathering)
It monitors an incoming document stream to rank them or make binary decisions for relevant and non-relevant documents
Text Classification
Identify relevant labels (or categories) for documents
A process of classifying text data into categories (multi-classes or multi-label categorization) by using classifiers learned from training samples.
Text feature section for text mining or opinion mining
Feature selection is a technique that select a subset of the features available for describing the data.
Feature selection techniques intend to remove non-informative features according to corpus statistics and to reduce dimensionality of data.
Feature selection can increase the classification or clustering accuracy and decrease computational complexity by eliminating noise features.
Features: structures (words), complex linguistic structures (phrases or part of speech), statistical structures (n-grams, patterns or topics), supported information (word’s first position), or Named Entities (organization names)
3. Web And Media Analytics
It is collecting, analysing and reporting of web data and media information for business or market research. It also can be used to improve the quality of web pages.
Web Information & Data
The vast amount of information available on the Web has great potential to improve the quality of decisions and the productivity of consumers.
In many cases, manual browsing through even a limited portion of the relevant information obtainable through search engines is no longer effective.
However, the Web’s large number of information sources and their different levels of accessibility, reliability, completeness, and associated costs present human decision makers with a complex information gathering planning problem that is too difficult to solve without high-level filtering of information.
There are different types of web information and data, e.g., text and images (e.g., homepages, emails), graphs or logs, to describe Web content, structure and usage.
Web Mining
Web server log file analysis
Usually web servers record some of their transactions in a log file.
Log files contain information on visits.
A typical example is a web server log which maintains a history of page requests.
The client IP address User id Date, time and time zone The client’s request HTTP status code
The size of returned object
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326
It is the application of data mining algorithms to discover patterns or insights from Web information.
It normally includes Web content mining (text analysis), Web structure mining (linkage analysis) and Web usage mining (log analysis).
For example
W3C states a hyperlink and anchor text.
Social Search and Web mining
Social search
Communities of users actively participating in the search process
Goes beyond classical search tasks
Key differences
Users interact with the system
Users interact with other users either implicitly or explicitly
Social Search Topics
Searching within communities
Adaptive filtering
Recommender systems
Peer-to-peer and metasearch
Social Media Analysis
Social media platforms such as Twitter an
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com