3/8/22, 7:32 PM Data Formats
Data Formats CSCI-UA.0479-001
Data Formats
We’ll mainly be discussing data available in text files… Ummm… what do we mean by saying that? →
Copyright By PowCoder代写 加微信 powcoder
the content is only plain text (the text itself isn’t formatted in any way; it’s just character data) typically contrasted with binary files (images, video, etc.)
can be opened in text editor (notepad, textedit)
Data Formats – Examples
Can you think of examples of some plain text file formats and their file name extensions? →
TXT (general extension for plain text… or if tabular data: tab-delimited or fixed width) CSV (comma-separated values; comma-delimited)
XML (Extensible Mark-up Language)
HTML (Hyper Text Mark-up Language)
JSON (JavaScript Object Notation)
Are there any other data formats that you’ve encountered? → Let’s go through these formats in more detail! →
.TXT and .CSV Files
Tab-delimited and Comma Separated Values files (.txt and .csv respectively) are widely available
online and from a variety of data sources.
These files are “human-readable” and can be easily imported into spreadsheets and databases. … Some examples:
Tab-delimited: U.S. Census School districts (search for “School Districts – Unified”) Comma-separated values or comma-delimited: NYC Open data list of colleges
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/file-types.html?print 1/5
3/8/22, 7:32 PM Data Formats
.TXT and .CSV Files Continued
What are some observations that you can make about these files? →
may or may not have a first row of column headers
headers may be difficult to interpret without companion documentation or data dictionary
here’s a crazy example of some very thorough documentation for this set of data for the USDA National Nutrient Database
CSV can mean delimited by a character other than comma! ( )
what if you want the separator character in your actual data? (one strategy: “wrap it in quotation marks”)
.TXT and .CSV Summary
human readable, plain text format
language and platform independent
represents tabular data: columns represents fields, row represents entity (thing being stored) first row may be header (defines columns / fields)
values in rows are separated by a specific character
common delimiters (separator character) include: commas, tabs and pipes
field1, field2, field3
foo, bar, baz
qux, quux, corge
Let’s check out some of these files in a text editor, as well as a spreadsheet app like LibreOffice… →
Markup Languages
Before we talk about XML, let’s discuss markup languages:
system of annotating content (text, images, etc.)
the syntax of the annotations make the annotations easily distinguishable from the actual
content itself some examples:
text processing commands in LaTeX …
tags in both HTML (“view source on this page!) and XML
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/file-types.html?print 2/5
3/8/22, 7:32 PM Data Formats
Speaking of XML and tags…
eXtensible Markup Language
looks a lot like another markup language: html < >
known for being verbose…
particularly suited for modeling data that:
has a nested structure …
like a parent/child or a tree-like relationship among data
XML Continued
XML itself is the building block of other languages / file formats
it’s a general markup language in that it defines syntax, but not tags; you must define tags yourself through an XML Schema (or alternatively, through a Document Type Declaration) an XML Schema imposes user defined constraints on the structure and content of an XML
essentially, the schema defines the semantic rules that an XML file must conform to!
…which means other markup languages (and consequently other file formats) are built on top of XML
Note that HTML and XML use tags for annotation
a tag is an annotation where the annotation itself appears between less than (<) and greater than signs (>)
and the content often appears between a set of two tags (an opening and closing tag)
Formats Built on XML
Can you think of any file formats that are built on top of XML →
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/file-types.html?print 3/5
3/8/22, 7:32 PM Data Formats
Displaying XML
Typically, viewing XML in a text editor (or even your browser) results in all of the text
(including the tags) being displayed …
one way to format and style the display of an XML file is to use e Language
XSL is a family (XSL Transformation, XML Path Language, etc.) of powerful languages that transform and render styled and formatted XML
(or even transform XML into different formats, such as regular HTML)
XML Summary
human readable, plain text format
language and platform independent
self-describing (through a schema and tags) and great for nested or tree like data tags annotate text and provide structure, semantics and meta data to content
JSON: JavaScript Object Notation
a text-based data interchange format
it is language and platform independent; and, like XML, is also considered “self-describing”. JSON is known for its key-value pairs where keys are similar to names/labels/properties, and values can be numbers, text, “objects” or lists of values
for those with Python experience, this looks like a dictionary! (let’s see some examples… →)
SVG – Scalable Vector Graphics
let’s draw a rectangle or check out an example of some basic shapes) →
TEI – Text Encoding Initiative – markup that provides structural and descriptive meta data, like sentences, words and lines about text
see the examples on wikipedia or take a look at a poem by Williams)
and sooooo many others… including Resource Description Framework, Really Simple Syndication, XHTML, and MathML
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/file-types.html?print 4/5
3/8/22, 7:32 PM Data Formats
Unstructured Data
the previous data we saw was structured: it was organized in a predefined way
scholars who use literature for textual analysis will often refer to their sources as unstructured data.
for example, Tolstoy’s War and Peace might be considered unstructured data consisting of
what are some other unstructured datasets that you can think of? → (hint: see the wikipedia)
journals, health records, songs / popular music, film / tv, images,
portions of structured data may be unstructured: body of an email, parts of a web page,
or parts of a word-processor doc, a tweet, blog entry
JSON Examples
Unlike XML, JSON does not require a predefined schema. With that said, you’ll find the JSON
is still used as a common building block for other formats
…for example: GeoJSON for geo-spatial data, such as longitude and latitude coordinates other examples include
a complete set of U.S. zip codes in JSON from mongodb.org
stats.nba.com loads data from json files (check out Chrome’s developer tools) JSON is a very common data interchange format for the web
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/file-types.html?print 5/5
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com