Microsoft Word – CISC6325 Final ExamFall2018 updated.docx
CISC6325 Final Exam, Fall 2018
Part 1: (clarifications/corrections added in this version are in bold face)
Problem 1: Data Aggregation with SQLite, Python, R, Excel
Given the three tables below (same as those in Hw7. Note in Hw7, the blank line in Catalog table was
an accident. No blank line in Catalog), do the following:
-
Write SQL code and run it in SQLite without Python, R, Excel, to create a pivot table for the
number of different parts on sale, where the pivot table row heading is number of GB for the
part and column heading is city name. Note that if a part sold has no entry in supplier table, the
column heading for that part needs to be either blank or NA. If there is no part sold for any
given cell of the pivot table (any given combination of city and number of GB), it should show
either blank or NA, not 0. The code can be tailored to (or “hard coded for”) the exact four cities
in the supplier table. Other than that, use as little hard coding as possible (so even if the data in
the table changes, the code stays the same). Show both the SQL code and the screenshots (do
not pack SQL code into one long line. Break it to multiple lines in structured form).
Hint: You may need to use SUM, CASE to create the pivot columns (since SQLite does not allow pivot). - Repeat 1. except to replace number of parts by average price per 1GB.
-
Repeat 2 except using Excel only and not SQLite
- Use Excel formula only.
- Use Excel pivot table capability and any other formula when necessary
- Repeat 2 except using either Python/Pandas or R (either one). You get extra credit if you do both.
Note: The attached file “CISC6325 materials v1.xlsx” (an earlier version was sent to you a few weeks ago) contains useful information for pivoting
Note: If you cannot do any of the four parts of this problem, it should be prevent you from doing other parts. Report whether SQLite, Python/R, Excel produce the same outcome
Problem 2:
- Repeat problem 1 except to use the attached data file “SalesData1” and the pivot table row heading is month, column heading is year, and the pivot result is for profit (total aggregated profit) and for count. Note that profit=revenue – cost = (unit price – unit cost)* Units sold. Note that in case the version of your Python or R code/libraries methods does not like spaces within the names of the headings, you may need to pre-process the name to get rid of the spaces.
-
Use the forecast method shown in class and in two of the sheets/tabs the attached file
“CISC6325 materials v1.xlsx” to forecast 2012 total profit for each month.
- Use Excel only
- Use Python or R (extra credit if both)
Note: Since there is curving, if you cannot do any part does not mean you will get bad score.
Part 2:
For the final project, the more you satisfy items in the following list, the better (word doc report and pptx powerpoint slides and in-person in-class presentations are required):
- The problem you are addressing is interesting or useful
- The data you found is valuable and interesting (see list of data sources below but you can find others elsewhere). If you talk about (in your presentation and/or report) your journey of navigating the world of data and what you find and how you decide to choose this one (make sure you provide the link), that would be interesting and valuable.
- You provide interesting, insightful and useful results and conclusions based on your analysis.
- As part of your analysis, you use SQLite through R or Python. You articulate what R or Python offers you that SQLite alone could not. When possible, you show ways from both sides: with or without SQL (but through R or Python in both).
- You provide good analysis and visualization. Note that although models (such as regression and other predictive models) are useful, sometimes visualization of data (without any models) can reveal a lot.
- You write good reports, including good introduction of the problem and the state of the art (what others have done with respect to this problem if any), your plan of attack, clear description of how to clean and preprocess the data, how you do the analysis, and what your findings are, and what obstacles you encounter and how you tackle it, and what more could be done in the future.
7. Your in-person presentation and powerpoint slides in our last day of class (Dec 11) are effective, inspiring and informative.
Suggested data sources:
• UCI Machine Learning Repository, UCI Machine Learning Repository 351 Data Sets
oMachine Learning Datsets in R (mostly from UCI repository. Though you are not using R but SPSS Modeler, this link helps)
• Public datasets for machine learning
• List of Public Data Sources Fit for Machine Learning
• .sav SPSS Statistics files (can be used with source node “Statistics file”
• Delve datasets for regression and classifications
oFor regression, data includes bank applications, predicting house prices from census data, predict computer system activity from system performance measures, predict sea animal age from physical measurements
oFor classifications, data includes prediction of income based on census data, DNA splice functions prediction, prediction of Titanic passengers info.
• Datasets for Data Mining and Dat Science – KDnuggets
• CIA World Factbook (info on history people, government, economy, geography,
communications, transportation, military, transnational issues of 267 countries. See
attached CIA World Factbook Instructions for help on how to access these data. • Healthcare related data
o1. data.medicare.gov From this link, you can click Datasets to get a pull-down menu of difference choices, such as Hospital Compare Data, Nursing Home Compare Data, Physician Compare Data, Home Health Compare Data, Dialysis Facility Compare Data, Supplier Directory Data. During our last class, we looked at the Re-Admissions Complications and Deaths-Hospital (not the State or National), under the category of “Hospital Compare Data”
2. kff.org.statedata. From this link, you can choose many categories of datasets. For example, if you click “Health Reform”, and from there,, you can choose Marketplace and Medicaid/CHIP Enrollment (where you can download marketplace enrollment statistics), Health Insurance Marketplace, and many other categories
3. healthdata.gov/dataset/search. You can find additional data for healthcare related issues.
4. http://www.ideal.ece.utexas.edu/courses/ee380l/Health-Care-Data-Sets.xls This excel file lists many healthcare data sources.
5.Google search with key words: healthcare data download or these key words: list of healthcare data sets
6. www.bea.gov/regional This site allows you to download personal income and state-level GDP data. This is needed if you like to combine this data with the healthcare data in 1. and 2. to find correlation or relationship between income and healthcareperformance or score. A more concise table for regional/state level economy or income
from www.usgovernemnentrevenue.com
World
- The World Bank http://www.worldbank.org/
- Gapminder http://www.gapminder.org/data/
- United Nations Datasets http://data.un.org/
- International Monetary Fund http://www.imf.org/external/data.htm
- Open Spending https://openspending.org/
- CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/
- NOAA/NCEI weather and climate data http://www.ncdc.noaa.gov/
Government
- Data.gov http://www.data.gov/
- Federal Reserve Economic Research https://research.stlouisfed.org/fred2/
- U.S. Federal Statistics Data http://fedstats.sites.usa.gov/data-releases/
- U.S. Bureau of Labor Statistics http://www.bls.gov/
- U.S. Federal Agency Expenditures https://www.usaspending.gov/
- U.S. Energy Information Administration http://www.eia.gov/
- U.S. Census Bureau Data http://www.census.gov/
- U.S. Department of Health & Human Services https://www.healthdata.gov/
- U.S. Department of Education http://www2.ed.gov/
- OSCAR data.gov.uk https://data.gov.uk/dataset/oscar
- European Union Open Data Portal http://open-data.europa.eu/en/data/
• Eurostat European Statistics http://ec.europa.eu/eurostat/data/database Data Mining
- UC Irvine Machine Learning list of data mining sets
- KD Nuggets list of data mining sets
- BigML list of data mining sets
Miscellaneous and More Lists
- Tableau Public Sample Data https://public.tableau.com/en-us/s/resources
- Amazon Web Services Public Data http://aws.amazon.com/datasets
- Google Public Data https://www.google.com/publicdata/directory
- Google Trends http://www.google.com/trends/explore
- Freebase People, Places, and Things http://www.freebase.com/
- Datahub 10K+ collection of datasets https://datahub.io/
- Github Public Datasets https://github.com/caesar0301/awesome-public-datasets
- Million Song Data Set http://aws.amazon.com/datasets/6468931156960467
- ESPN Sports API http://espn.go.com/apis/devcenter/
- Sports Reference Data http://www.sports-reference.com/
1. Freebase
Freebase is an open platform for data sharing. It contains a wide range of topics from fictional characters to Modest Mouse. You can even curate your data with data plotting feature. You can plot your datasets in timeline or map.
2. UN Data
This database contains large datasets, consisting virtually all the public data collected by the United Nation. To access the API you have to sign up (it will only take a couple of minutes).
3. WorldBank
Where else to look for financial data of the world but the WorldBank? You can get virtually any country’s financial and economy standings here. Some other topics included are:
- Agriculture & Rural Development
- Aid Effectiveness
- Economic Policy and External Debt
- Education
- Energy & Mining
- Environment
- Financial Sector
- Health
- Infrastructure
- Labor & Social Protection
- Poverty
- Private Sector
- Public Sector
- Science & Technology
- Social Development
- Urban Development
4. Data.gov
Data.gov is leading the way in democratizing public sector data and driving innovation. This movement has spread throughout cities, states, and countries. 5 of 50+ categories:
• Agriculture
- Arts, Recreation, and Travel
- Banking, Finance, and Insurance
- Births, Deaths, Marriages, and Divorces • Business
5. Infochimps
Infochimps contains paid and free datasets just about anything. What’s cool about Infochimps is that you can download datasets into csv format. What’s more is that you can fiddle with the API to extract the data specific to your needs. Try Twitter as your search metric and you will see what I mean.
6. Google Public Data
The Google Public Data Explorer makes large datasets easy to explore, visualize
and communicate.
7. Google Scholar
The Google Scholar is a free search engine that contains all kinds of academic literatures. Citing journal publishers, universities research papers, and other scholarly materials do not just make your content looks smarter, but as well as more trustworthy.
8. Data Market
Data Market contains in-house and third party datasets. It’s a good place to explore data related to economics, healthcare, food and agriculture, and the automotive industry.
- Torrent downloads and uploads on Pirate Bay
- Social media & networks – from Stanford Uni
- Human Emotions by We Feel Fine: to allow other artists to more easily make pieces that explore these human emotions
- LittleSis profiles who’s who in the biggest organisations in the world
- NY Times bestseller
- Google Flu Trends
- NY Times People: User data for com, including the user profiles, activities, news feeds, and networks.
- CrunchBase: Plenty of information about startups and large tech companies
- Wunderground: Provides detailed weather info and lets you search historical data by zip code or city.
•• •
United States Census Data: The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean and very comprehensive.
- • FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.
- • CDC Cause of Death: The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
- • Medicare Hospital Quality: The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons.
- • SEER Cancer Incidence: The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program.
- • Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
- • Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates.
- • IMF Economic Data: For access to global financial statistics and other data, check out the International Monetary Fund’s website.
- • Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.
- • Data.gov.uk: The British government’s official data portal offers access to tens of thousands of data sets on topics such as crime, education, transportation, and health.
- • Enron Emails: After the collapse of Enron, a data set of roughly 500,000 emails with message text and metadata were released. The data set is now famous and provides an excellent testing ground for text-related analysis. You also can explore other research uses of this data set through the page.
- • Google Books Ngrams: If you’re interested in truly massive data, the Ngram viewer data set counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- • UNICEF: If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others.
- • Reddit Comments: Reddit released a data set of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller data set to work with Kaggle has hosted the comments from May 2015 on their site.
- • Wikipedia: Wikipedia provides instructions for downloading the text of English-language articles, in addition to other projects from the Wikimedia Foundation.
- • Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The data set lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan).
- • Walmart: Walmart has released historical sales data for 45 stores located in different regions across the United States.
- • Airbnb: Inside Airbnb offers different data sets related to Airbnb listingsin dozens of cities around the world.
- • Yelp: Yelp maintains a dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge.
Wikipedia:Database offers free copies of all available content to interested
users. data is available in multiple languages. Content along with images
could be downloaded.
Common crawl builds and maintains an open crawl of the web accessible to
everyone. The data is stored in amazon s3bucket and the requester may have
spend some money to access it.
EDRM File Formats Data Set, consists of 381 files covering 200 file formats.
Apache Mahout TLP project to create scalable, machine learning algorithms.
Mahout has many links to get free and paid corpus data.
EDRM Enron Email Data Set v2 consist of Enron e-mail messages and
attachments in two sets of downloadable compressed files: XML and PST.
ClueWeb09 dataset was created to support research on information retrieval
and related human language technologies. It consists of about 1 billion web
pages in ten languages that were collected in January and February 2009.
The dataset is used by several tracks of the TREC conference.
DMOZ – Open Directory Project is the largest, most comprehensive human-
edited directory of the Web. It has collections of URLs in different category.
Dmoz is one main source for internet search engines.
theinfo.org – This is a site for large data sets and the people who love them:
the scrapers and crawlers who collect them, the academics and geeks who
process them, the designers and artists who visualize them. It’s a place where
they can exchange tips and tricks, develop and share tools together, and
begin to integrate their particular projects.
Project Gutenberg offers over 36,000 free ebooks to download to your PC,
Kindle, Android, iOS or other portable device.
Million song data set, has data related to tracks and artist.
AWS (Amazon Web Services) Public Data Sets, provides a centralized
repository of public data sets that can be seamlessly integrated into AWS
cloud-based applications.
BigML big list of public data sources.
Bioassay data, described in Virtual screening of bioassay data, by Amanda
Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive
compounds) available for download.
Bitly 1.usa.gov data, anonymized clicks on gov links.
Canada Open Data, pilot project with many government and geospatial
datasets.
Causality Workbench data repository.
Corral Big Data repository at Texas Advanced Computing Center, supporting
data-centric science.
Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly
(Jan 2011).
Datacatalogs.org, open government data from US, EU, Canada, CKAN, and
more.
Data.gov.uk, publicly available data from UK (also London datastore.)
Data.gov/Education, central guide for education data resources including
high-value data sets, data visualization tools, resources for the classroom,
applications created from open data and more.
DataMarket, visualize the world’s economy, societies, nature, and industries,
with 100 million time series from UN, World Bank, Eurostat and other
important data providers.
Datamob, public data put to good use.
DataSF.org, a clearinghouse of datasets available from the City & County of
San Francisco, CA.
DataFerrett, a data mining tool that accesses and manipulates TheDataWeb,
a collection of many on-line US Goverment datasets.
Delve, Data for Evaluating Learning in Valid Experiments
EconData, thousands of economic time series, produced by a number of US
Government agencies.
Enron Email Dataset, data from about 150 users, mostly senior management
of Enron.
Europeana Data, contains open metadata on 20 million texts, images, videos
and sounds gathered by Europeana – the trusted andcomprehensive resource
for European cultural heritage content.
FEDSTATS, a comprehensive source of US statistics and more
FIMI repository for frequent itemset mining, implementations and datasets.
Financial Data Finder at OSU, a large catalog of financial data sets.
GDELT: The Global Data on Events, Location and Tone, described by
Guardian as “a big data history of life, the universe and everything.”
GEO (GEO Gene Expression Omnibus), a gene expression/molecular
abundance repository supporting MIAME compliant data submissions,and a
curated, online resource for gene expression data browsing, query and
retrieval.
GeoDa Center, geographical and spatial data.
Google ngrams datasets, text from millions of books scanned by Google.
Grain Market Research, financial data including stocks, futures, etc.
Hilary Mason research-quality Big Data sets collection – many text and image
datasets.
HitCompanies Datasets, comprehensive data on random 10,000 UK
companies sampled from HitCompanies, updated automatically using
AI/Machine Learning.
ICWSM-2009 dataset contains 44 million blog posts made between August
1st and October 1st, 2008.
Infochimps, an open catalog and marketplace for data. You can share, sell,
curate, and download data about anything and everything.
Investor Links, includes financial data
KDD Cup center, with all data, tasks, and results.
Kevin Chai list of datasets, for text, SNA, and other fields.
KONECT, the Koblenz Network Collection, with large network datasets of all
types in order to perform research in the area of network mining.
Linking Open Data project, at making data freely available to everyone.
MIT Cancer Genomics gene expression datasets and publications, from MIT
Whitehead Center for Genome Research.
ML Data, the data repository of the EU Pascal2 networks.
NASDAQ Data Store, provides access to market data.
National Government Statistical Web Sites, data, reports, statistical
yearbooks, press releases, and more from about 70 web sites, including
countries from Africa, Europe, Asia, and Latin America.
National Space Science Data Center (NSSDC), NASA data sets from
planetary exploration, space and solar physics, life sciences, astrophysics,
and more.
Open Data Census, assesses the state of open data around the world.
OpenData from Socrata, access to over 10,000 datasets including business,
education, government, and fun.
Open Source Sports, many sports databases, including Baseball, Football,
Basketball, and Hockey.
Peter Skomoroch dataset Bookmarks PubGene(TM) Gene Database and
Tools, genomic-related publications database
Quandl, a collaboratively curated portal to millions of financial and economic
time-series datasets.
qunb, a platform to find and visualize quantitative data.
Robert Schiller data on housing, stock market, and more from his
book Irrational Exuberance.
SMD: Stanford Microarray Database, stores raw and normalized data from
microarray experiments.
Jerry Smith dataset collection, with Finance, Government, Machine Learning,
Science, and other data.
SourceForge.net Research Data, includes historic and status statistics on
approximately 100,000 projects and over 1 million registered users’ activities
at the project management web site.
StatLib, CMU Datasets Archive.
STATOO Datasets part 1 and STATOO Datasets part 2
Time Series Data Library
Visual Analytics Benchmark Repository.
UCI KDD Database Repository for large datasets used in machine learning
and knowledge discovery research.
UCI Machine Learning Repository.
UCR Time Series Data Archive, offering datasets, papers, links, and code.
United States Census Bureau.
Wikiposit, a (virtual) amalgamation of (mostly financial) data from many
different sites, allowing users to merge data from different sources.
Wolfram Alpha disease and patient level dat.
Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and
Marketing, Competition
Yelp Academic Dataset, all the data and reviews of the 250 closest
businesses for 30 universities for students and academics to explore and
research.
data.world is a platform where data scientists can find and use a vast array of
high-quality open data, collaborate on data projects, and meet other like-
minded data nerds.