Working with Data Data Science for Design, Week 2
Overview
● What is data?
● Data types
● Data formats
● Data shapes
● Operations on data
Data?
● Data – plural of datum
● Latin: dare – that which is given
Data, capta, information and knowledge. / Checkland, Peter; Holwell, S E. Introducing Information Management: the business approach. London, New York and Amsterdam : Elsevier, 2006. p. 47-55.
Data?
Avison and Fitzgerald (1995)
Data represent unstructured facts (p. 12)
Clare and Loucopoulos (1987)
Facts collected from observations or recordings about events, objects, or people (p. 2)
Galland (1982)
Facts, concepts or derivatives in a form that can be communicated and interpreted (p. 57)
Hicks (1993)
A representation of facts, concepts or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means (p. 668)
Knight and Silk (1990)
Numbers representing an observable object or event (fact) (p. 22)
Laudon and Laudon (1991)
Raw facts that can be shaped and formed to create information
Maddison (ed.) (1989)
Natural language: facts given, from which others may be deduced, inferred. Info. processing and computer science: signs or symbols, especially as for transmission in communication systems and for processing in computer systems; usually but not always representing information (sic), agreed facts or assumed knowledge; and represented using agreed characters, codes, syntax and structure
Martin and Powell (1992)
The raw material of organizational life; it consists of disconnected numbers, words, symbols and syllables relating to the events and processes of the business (p. 10)
Information?
Avison and Fitzgerald (1995)
Information has a meaning … [it] comes from selecting data, summarizing it and presenting it in such a way that it is useful to the recipient (p. 12)
Clare and Loucopoulos (1987)
A pre-requisite for a decision to be taken. Information is the product of the meaningful processing of data (p. 2)
Galland (1982)
Information is that which results when some human mental activity (observation, analysis) is successfully applied to data to reveal its meaning or significance (p. 127)
Hicks (1993)
Data that has been processed so that it is meaningful to a decision maker to use in a particular decision (p. 675)
Knight and Silk (1990)
Human significance associated with an observable object or event (p. 22)
Laudon and Laudon (1991)
Data that have been shaped or formed by humans into a meaningful and useful form (p. 14)
Maddison (ed.) (1989)
Understandable useful relevant communication at an appropriate time; any kind of knowledge about things and concepts in a universe of discourse that is exchangeable between users; it is the
meaning that matters, not the representation (p. 174)
Martin and Powell (1992)
Information comes from data that has been processed to make it useful in management decision making (p. 10)
Data Details
Quick Glossary
● Fields: individual pieces of data
● Types: what kind of thing is an individual piece of data?
● Records: set of a data about one thing
● Attributes: fields belonging to a record
● Schema: what are the types and values the data can take
● Formats: how data is structured so that fields are intelligible
● Forms: how the data is contained or transmitted
● Metadata: Data that describes the data
Binary: Bits and Bytes
● 01010010101011000101010
● Each binary value is a “bit” – a fundamental unit of
information (Shannon)
● Often grouped together into “bytes” – 8 bits
● One byte can have 256 values, often used to represent numbers from 0 to 255, or 00 to FF in hex
Counting in binary
Byte
Binary
Base 10
0000
0
0001
1
0010
2
0011
3
0100
4
0101
5
0110
6
0111
7
1000
8
0
1
1
0
0
0
0
1
128
64
32
16
8
4
2
1
(1 * 64) + (1 * 32) + (1*1) = 97
(base 10 version)
0
9
7
100
10
1
(0*100) + (9 * 10 ) + (7 * 1 ) = 97
Binary interpretations
01100001
The same “bit pattern” is:
● The character ‘a’ in ASCII
● 97 in decimal
● 61 in hexadecimal
● sometimes written 0x61
http://www.rapidtables.co m/convert/number/ascii-h ex-bin-dec-converter.htm
http://www.asciitohex.com/
Data Types
● Data stored within an individual field
● What do they look like, what operations might we do?
Quick Glossary
● Fields: individual pieces of data
● Types: what kind of thing is an individual piece of data?
● Records: set of a data about one thing
● Attributes: fields belonging to a record
● Schema: what are the types and values the data can take
● Formats: how data is structured so that fields are intelligible
● Forms: how the data is contained or transmitted
● Metadata: Data that describes the data
Boolean
● Values that are “true” or “false”, 1 or 0
● Used heavily in logic and conditionals
● Examples: has something been processed or checked? does a person have an attribute?
● Operations: AND, OR, e.g.
○ people who have been
vaccinated AND are showing symptoms
○ people who have a UK Visa OR are EU Citizens
Booleans
● True or False
● Combine with
○ “or”, “|”
○ “and”, “&”
● “Negated” with “not”
(sometimes written “!” or “^”
● “True” and “False” in Python
● Use “and” / “or” in Python
Truth Tables / Truth Functions
https://en.wikipedia.org/wiki/Truth_function
Example – student registers
id
registered
seen PT
registered & seen PT
! registered
reg’d & !seenPT
1
1
1
1
0
0
2
1
0
0
0
1
3
0
0
0
1
0
4
0
1
0
1
0
Numbers
● Integers: counting numbers / whole numbers
○ “int” in Python
○ Can: add, subtract, multiply, divide
○ But division may be funny!
■ 7/2 = 3 (Python 2)
■ 7/2 = 3.5 (Python 3)
○ Remainder: 7%2 = 1
● Floating Point: rational numbers
○ “float” in Python
○ Add, subtract, divide etc.
○ Division and remainder work
Enumerations
● Also called: categorical variables, controlled vocabulary
● When you have a precise set of values that something can take
● Possible values come from the data schema
● Examples: country, marital status, eye color, sex, lots of status variables, e.g. waiting, processing, complete
Strings
● Text – a sequence of characters
● Used for many things – names, descriptions, user feedback, tweets
● Operations: joining (concatenating), spitting, parsing (extracting structure), modifying, regular expressions
● Transformed (e.g. lowercase)
● Different representations (ASCII,
UTF8)
● Can have structure (e.g. dates)
Compound Types
● Dates/Times
○ many different formats, cause horrible problems!
● Location (Lat/Long)
● Tuples: ordered
collections of fields
● Colours: #FF0000, #00FF00, #FFFF00, #55C0D1
Falsehoods Programmers Believe About Names
● People have exactly one canonical full name.
● People have exactly one full name which they go by.
● People have, at this point in time, exactly one canonical full
name.
● People have, at this point in time, one full name which they
go by.
● People have exactly N names, for any value of N.
● People’s names fit within a certain defined amount of
space.
● People’s names do not change.
● People’s names change, but only at a certain enumerated
set of events.
● People’s names are written in ASCII.
● People’s names are written in any single character set.
● People’s names are all mapped in Unicode code points.
● People’s names are case sensitive.
● People’s names are case insensitive.
● People’s names sometimes have prefixes or suffixes, but
you can safely ignore those.
● People’s names do not contain numbers.
● People’s names are not written in ALL CAPS.
● People’s names are not written in all lower case letters.
● People’s names have an order to them. Picking any
ordering scheme will automatically result in consistent ordering among all systems, as long as both use the same ordering scheme for the same name.
● People’s first names and last names are, by necessity, different.
● People have last names, family names, or anything else which is shared by folks recognized as their relatives.
Names…
Scunthorpe problem
URI / URLs
● Unique Resource Identifier – URI
○ Slightly magic – there is only one “thing” in the universe that a URI “points” to
○ Different types:
■ https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
■ ISBN 0-486-27557-4 (Romeo and Juliet)
● A URL is simply a URI that happens to point to a resource over a network ● Structured
Ontologies
● Structured dictionaries of terms, and relations between them
● Define the kinds of data, their meanings and their structure
http://visualdataweb.de/webvowl/
Conversions
● Text to numbers
● Numbers to text
● Strings to dates (and back again)
● Text to codes/enumerations
● URLs to documents
● Terms in one ontology to another
Fields and Formats
● Fields: individual pieces of data
● Formats: how data is structured so that fields are intelligible
● Forms: how the data is contained or transmitted
● Metadata: Data that describes the data
Quick Glossary
● Fields: individual pieces of data
● Types: what kind of thing is an individual piece of data?
● Records: set of a data about one thing
● Attributes: fields belonging to a record
● Schema: what are the types and values the data can take
● Formats: how data is structured so that fields are intelligible
● Forms: how the data is contained or transmitted
● Metadata: Data that describes the data
Form (file) Format Fields Schema
c
Record
List Entry
Field
Field name
Value
Nested Object
Data Formats
● How the data is organised and made accesssible
● Long history of different kinds of format
● Relates to data representation, and “shape”
CSV, TSV
● Table of fields, separated by commas (CSV) or tabs (TSV)
● Easy to use, easy to parse
● Read and written by many
programs
● Only handles simple tables
● Complex values are often quoted
● Doesn’t say what the data is
● Represents tabular data
Excel
● Common
● Horrible
● (can export
CSV…)
https://www.theguardian.co m/politics/2020/oct/05/minis ters-accused-of-putting-live s-at-risk-with-covid-data-err or
XML
● Looks a bit like HTML
● Stricter
● Includes metadata
● Matched tags
● Tags can have attributes
● Represents heirarchical data
JSON
● Grew out of Javascript
● Widespread
● Hierarchical
● Similar structure to XML
● (but more ‘readable’, although no attributes)
● Returned by a lot of web services
● Close to computing data structures – arrays, dictionaries
Other formats
● Many, many, many more
● Good questions:
○ can I read it?
○ what kind of data can it hold? Datatypes? Tabular?
Hierarchical?
○ can other people read it?
○ how many different programs can read it?
○ do all programs read it the same?
○ how well does it map to my data?
○ how efficient is it? (Time, Space)
Data Shapes
● We’ve heard about tabular and hierarchical data – what does that mean?
● Are there other types?
Tabular Data
● Data is in a table, with rows and columns
● All the entries in a column have the same type
● All the entries in a row relate to the same thing
Name
Age
Phone Number
Jan
35
3545
Jim
25
22232
Kev
53
22353
Relational Data
● Data doesn’t fit in a single table
● Adds relations between tables – one to many, many to one etc.
● “Primary Key” – id of thing in this table – unique within table
● “Foreign Key” – id of a thing in another table
● Primary model of Relational Databases
id
Name
Age
Phone Numbe r
1
Jan
35
3545
2
Jim
25
22232
3
Kev
53
22353
One to Many
● In an online store, each person might have bought many items, but each transaction is made by one person
● Can’t add columns to the database – don’t know how many there might be
● Add a table for transactions, that, references the person who bought it
● “Foreign Key” – id of a thing in another table
id
Name
Age
Phone Numbe r
1
Jan
35
3545
2
Jim
25
22232
3
Kev
53
22353
id
person id
item
price
1
1
shoes
£5
2
1
jacket
£6
3
3
shoes
£9
primary foreign key key
Many to Many
● On Facebook, you can “poke” people. Each person can poke many people, and be poked by many people
● Add a table for Pokes, that points to the poker and the pokee
People
id
1
2
3
Name
Jan
Jim
Kev
id
1
2
3
4
Age
35
25
53
Phone Numbe r
3545
22232
22353
poker
1
1
2
3
pokee
2
3
3
1
date
shoes
jacket
shoes
Pokes
Many to Many
● At Uni, you are enrolled on many courses, and each course has many people on it
● Create a table for Enrollments, that points to both tables People
Courses
id
1
2
3
Name
Jan
Jim
Kev
id
1
2
3
4
Age
35
25
53
Phone Numbe r
3545
22232
Student
1
1
2
3
Course
DESI11100
DESI11073
DESI11100
DESI11100
Mark
NULL
NULL
NULL
NULL
id
Name
Credits
DESI11100
Data Science for Design
20
DESI11073
Histories and Futures
20
INFR11094
Case Studies 1
20
22353
Enrollments
(Joins)
Enrollments Joined with People and Courses
id
Student
Course
Mark
Person.Name
P.Age
Course.Credits
1
1
DESI11100
NULL
Jan
35
20
2
1
DESI11073
NULL
Jan
35
20
3
2
DESI11100
NULL
Jim
25
20
4
3
DESI11100
NULL
Kev
53
20
More info
● http://www.ucl.ac.uk/archaeology/cisp/database/m anual/node1.html
Hierarchical Data
● Data that contains other data
● Lists of items, items with variable numbers of fields,
fields that are items etc.
● Can be more or less defined
○ JSON on its own can have a huge range of structures
○ An application’s JSON probably has a particular structure, so people can make use of it
Translating
Graph Data
● Nodes with links between them
● Nodes are things, links are relations
● Image from: https://programminghistorian.org/lessons/graph -databases-and-SPARQL
● (lots of good information about graph queries there)
Graph Data
● Often stored as “triples”
● Subject, Predicate, Object
● Using URIs relates your data to other peoples
Subject
Predicate
Object
Rembrandt van Rijn
hasNationality
Dutch
Dave
likes
ice cream
d.murray-rust@e
d.ac.uk
food:likes
/resource/Ice_cre am
Semantic Web / Linked Open Data
Summary
● Tabular: simple, easy to work with and author, maps onto DataFrames
● Relational: adds links between tables, can represent more structure. Use SQL or other query languages to access
● Hierarchical: create complex nested structures. Maps onto programming concepts, new generation of databases (object- or document-databases)
● Graph: good for linking
Data Wrangling
● Read data – get it available within your system
● Clean data – missing values, incorrect fields, outliers – filtering, harmonising
● Bring data together
Tools
● Many many tools
● Excel – can do many things, but hit limitations
● Programming Languages vs Toolkits (some overlap)
○ R is a reasonable language, but its built in data structures make it popular, and good graphing tools
○ Python is a good all purpose language, but needs a good toolkit for serious data processing (but there are several)
○ Javascript (with D3) can be great for visualisation
○ Processing can be good for interactives
○ Other languages (C++, Perl, Scala, Java, Haskell etc.) get used
● Text editors – try Atom if you don’t have a favourite
● Command line: if you get good, you can do complicated things really quickly
Pandas
Pandas
pandas is well suited for many different kinds of data:
● Tabular data with heterogeneously-typed columns, as in an SQL table
or Excel spreadsheet
● Ordered and unordered (not necessarily fixed-frequency) time series data.
● Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
● Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
Two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional)
Pandas Features
● Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
● Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
● Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
● Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
● Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
● Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
● Intuitive merging and joining data sets
● Flexible reshaping and pivoting of data sets
● Hierarchical labeling of axes (possible to have multiple labels per tick)
● Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
● Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
Data Frames
• Data structures with named columns of specific types
Documentation
● 10 Minutes to pandas: http://pandas.pydata.org/pandas-docs/stable/10mi n.html
● Intro to data structures: http://pandas.pydata.org/pandas-docs/stable/dsint ro.html
● All of their tutorials: http://pandas.pydata.org/pandas-docs/stable/tutori als.html
● Pandas cookbook: https://github.com/jvns/pandas-cookbook ), chapters on slicing (Chapter 2,3)
Bringing data together
● The concat function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series.
● https://pandas.pydata.org/pandas-docs/stable/mer ging.html
df1
Vertical Concatenation
pd.concat([df1,df2])
df2
Same Fields, Same Order
Horizontal Concatenation
df1 df2
pd.concat([df1,df3],axis=1)
Same Rows, Same Order
Joining
?
Joining
pd.merge(df1,df4b,left_on=”Id”,right_on=”Ident”)
Use ‘how’ parameter to specify:
● inner (default): only keep rows that are in both tables
● outer: keep rows that are in either table
● left / right: keep all the rows in the left or right table
Join Types
visual explanation: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
Summary
● Philosophical/historical introduction to data
● Records, Fields, Schema, Format etc.
● Data types: boolean, numbers, strings, compound types
● Data formats: CSV, JSON, XML
● Data shapes: Tabular, Relational, Hierarchical, Graph
● Pandas
● Joining data together
Homework!
● Reading
● Coding
● git
● Start with Assignment 1
Reading
● Read: Thatcher, Jim, David O’Sullivan, and Dillon Mahmoudi. “Data colonialism through accumulation by dispossession: New metaphors
for daily data.” Environment and Planning D: Society and Space
34.6 (2016): 990-1006.
● Work through some of the Pandas cookbook (https://github.com/jvns/pa ndas-cookbook ) at least Chapters 1, 2, 3 possibly 4
Coding
● Data Visualisation Notebooks – essential for Assignment 1
● Defensive Programming Notebook – also essential for Assignment 1
● Join / Merge Notebooks
– Less essential, but very useful. Try the
exercises, and also check out the example joins
● Extra Learning: Pandas Cookbook
– https://github.com/jvns/pandas-cookbook
not essential right now, but if you want to get good, work through some of the Pandas cookbook – might mean you know the right things to do when you have data. Look at Chapters 1, 2, 3 possibly 4.
To get set up, download the whole thing as a zip file from their github, upload individual notebooks and data to your server as necessary
git – version control
● Git will save your life (or at least your code)
● Code Repository: stores every change you make, so when you break something, you’ve always got a previous version that works
● … and you don’t have 100 files all called real_final_version_1.4_finished.py
● Github is how proper software projects are managed
● You will be assessed on your use of git – start early!
● Step 1: go through Evan Morgan’s Intro to Git – short videos and tutorial sheet
● Step 2: make a private github repository for your group
● Step 3: make sure that everyone in your group has cloned it, added a file and pushed the results