– – –
CSCI 4144/6405 – Data Mining and Data Warehousing
Assignment 2: Data Warehousing Techniques – ETL and OLAP Due: 11:55pm, Feb. 21, 2020
TA: Serikzhan Kazi (sr520720@dal.ca), Miheer Kulkarni (mh444464@dal.ca) Tutorial/Lab: 11:35am – 12:55pm, Wednesdays; Room: Goldberg 127 Additional TA Help Hours at CS Learning Center:
o Mondays (2pm-4pm): Zhenbang Wang (zh961592@dal.ca) o Wednesdays (2pm-4pm): Hui Huang (huihuang@dal.ca)
o Fridays (3:35pm-5pm): Lauchlan Toal (lc790935@dal.ca)
1. Assignment Overview
In this assignment, you need to write a program to accomplish a few ETL and OLAP operations. The major objective of this assignment is to get yourself familiar with the process of constructing and using a data warehouse.
2. Detailed Requirements
1) Overview: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making. The following figure illustrates the process of constructing and using a data warehouse.
2) ETL: The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation, and Loading. Note that ETL refers to a broad process, and not three well-defined steps. Normally, ETL includes extraction, cleaning, transformation, loading, and refreshing. Nevertheless, the entire process is known as ETL.
1
3) OLAP: Data warehouses provide online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining. Typical OLAP operations include roll up, drill down, slice, dice, and pivot.
4) Sample Data Set: A sample data set file titled “Car_Sales_Data_Set.csv” is used as the original data set for our data warehouse. Specifically:
a) CSV stands for Comma-Separated Values. A CSV file is a text file that uses a comma to separate values. Often, the first record in a CSV file is a header line including a list of field names. Therefore, it is very easy to dig into a CSV file and look for useful information. You can use any text editor to open a CSV file and view its content. More details about CSV can be found here: https://en.wikipedia.org/wiki/Comma-separated_values
b) The sample data set is provided via brightspace.
c) There are 64 records (not including the header line) in the data set.
d) The data set includes 5 fields: Country, Time_Year, Time_Quarter, Car_Manufacturer,
Sales_Units.
e) In this data warehouse, there are 3 dimensions (i.e. Country, Time, and Car_
Manufacturer) and 1 measure (i.e. Sales_Units). In addition, the dimension of Time is associated with a two-level concept hierarchy: Time_Quarter < Time_Year. The valid values for each of the dimensions are listed below:
a. Country: Canada, United States
b. Time_Year: 2017, 2018
c. Time_Quarter: 1, 2, 3, 4
d. Car_ Manufacturer: Toyota, Honda, Ford, BMW
5) Required ETL Operation: In practice, the ETL involves many different operations. In this assignment, we focus on the operation of sorting. Specifically, your program needs to sort the records three times.
a) The first sorting deals with the field of “Country”. Once it is done, records associated with Canada are in front of records associated with United States. Now we have two sequential groups of records: Canada_Set and US_Set. Note that both groups are still in the same CSV file.
b) The second sorting is based on the result of the first sorting. The second sorting deals with the field of “Time_Year”. Once it is done, records associated with Canada are still in front of records associated with United States; however, within the Canada_Set, records associated with 2017 are in front of records associated with 2018 (this also applies the US_Set). Now we have four sequential groups of records in the same CSV file: Canada_2017_Set, Canada_2018_Set, US_2017_Set, US_2018_Set.
c) The third sorting is based on the result of the second sorting. The third sorting deals with the field of “Time_Quarter” and it does not change the order of the data groups mentioned previously. However, within each data group (e.g. Canada_2017_Set), records are sorted according to the quarter values (in ascending order, i.e. 1, 2, 3, 4). Now we have
2
sixteen sequential groups of records: Canada_2017_1_Set, Canada_2017_2_Set, ..., US_2018_3_ Set, US_2018_4_ Set.
6) Required OLAP Operations: For an n-dimensional data cube with each dimension being associated with Li levels, the number of available cuboids is:
The data warehouse in this assignment involves 3 dimensions (i.e. Country, Time, and Car_ Manufacturer). In addition, the dimension of Time is associated with a two-level concept hierarchy: Time_Quarter < Time_Year. Therefore, the number of cuboids is (1+1) x (2+1) x (1+1) = 12. Specifically, the cuboids correspond to the following tuples:
1. ()
2. (Country)
3. (Time_Year)
4. (Time_Quarter - Time_Year)
5. (Car_ Manufacturer)
6. (Country, Time_Year)
7. (Country, Time_Quarter - Time_Year)
8. (Country, Car_ Manufacturer)
9. (Time_Year, Car_ Manufacturer)
10. (Time_Quarter - Time_Year, Car_ Manufacturer)
11. (Country, Time_Year, Car_ Manufacturer)
12. (Country, Time_Quarter - Time_Year, Car_ Manufacturer)
Note that “Time_Quarter - Time_Year” means the combo of quarter and year. One example value of this concept is 1-2017 (i.e. the first quarter of 2017). These cuboids could be utilized to perform varied OLAP operations.
7) Required Program: You need to write a program that sorts the sample data set and responds to varied OLAP queries. Here are the detailed requirements:
a) Bluenose is used as the testing/marking platform. Therefore, the TA should be able to compile and execute your program via the command-based interface on “bluenose.cs.dal.ca”. Note that you need to have a CSID in order to access bluenose via SSH. If you do not have a CSID or you have a problem with your CSID, please refer to the following webpage to set up your CSID: https://csid.cs.dal.ca/
b) You should place the sample data set file titled “Car_Sales_Data_Set.csv” in the directory where your program file is located.
c) The name of your program should be “ETL_OLAP”. After running ETL_OLAP via the command-based interface, ETL_OLAP should read “Car_Sales_Data_Set.csv”, sort the records according to the instructions in Section 2.5. After the first sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_First_Sorting.csv”. After the second sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_Second_Sorting.csv”. After the third sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_Third_Sorting.csv”. Note that all
T = n (L +1) iÕ=1 i
3
of the sorting result files should use the CSV format and be placed in the same directory
as ETL_OLAP.
d) After saving the sorting result, ETL_OLAP should display a list of 12 tuples (i.e. the tuples
specified in Section 2.6) on the screen. Note that the sequence number associated with each tuple should also be displayed. Each of these tuples corresponds to a cuboid (i.e. a possible OLAP query) in the data cube.
e) Then ETL_OLAP should prompt the user to enter a number in the range of 1-12. Namely, ETL_OLAP should prompt the user to select an OLAP query.
f) Once the user enters a valid number (i.e. a number in the range of 1-12), ETL_OLAP will process the corresponding OLAP query and display the result (including a header line, the corresponding dimension values, and aggregated Sales_Units values) on the screen. For example, if the user enters 2, then the first line of the result should be the header line:
Country Sales_Units
Then the detailed Country values and aggregated Sales_Units values should be displayed:
Canada XXX
United States XXX
Here, “XXX” represents the corresponding aggregated value. Note that when the user enters 11 or 12, the corresponding cuboids are 3D. In this case, ETL_OLAP will use two 2D tables to display the results: the first one corresponds to Canada and the second one corresponds to United States. Table 4.3 on page 137 of the textbook illustrates how to use 2D tables to display 3D cuboids.
8) Required Programming Language: You can use Java, C, C++, or Python as the programming language because bluenose supports these languages (note that you need to have a CSID to access bluenose via SSH). In addition, you can only use the following header file or libraries in your program:
a) Java: java.io.*, java.util.*
b) C: stdio.h, stdlib.h, string.h
c) C++: Standard Template Library (Refer to: https://en.wikipedia.org/wiki/Standard_
Template_Library)
d) Python: No library should be used. Namely, no “import” statement should be included in
your program.
9) Readme File: You need to complete a readme file named “Readme.txt”, which includes the instructions that the TA could use to compile and execute your program on bluenose.
10) Submission: Please pay attention to the following submission requirements:
a) You should place “Readme.txt” in the directory where your program file is located.
b) You should place “Car_Sales_Data_Set.csv” in the directory where your program file is
located.
c) Your program file, “Readme.txt”, and “Car_Sales_Data_Set.csv” should be compressed
into a zip file named “YourFirstName-YourLastName-ASN2.zip”. For example, my zip file 4
should be called “Qiang-Ye-ASN2.zip”. Finally, you need to submit your zip file for this assignment via brightspace.
a. Note that there is an appendix at the end of this document, which includes the commands that you can use to compress your files on bluenose.
3. Grading Criteria
The marker will use your submitted zip file to evaluate your assignment. The full grade is 22 points. The details of the grading criteria are presented as follows.
1) Does “Readme.txt” include enough information so that the TA can easily compile and execute the program on bluenose? (1 Point)
2) ETL Function: After running ETL_OLAP, “Car_Sales_Data_Set_Sorted.csv” (the file contains the sorted records) is generated and placed in the directory where the compiled program is located. (6 Points: each sorting is worth 2 points)
3) OLAP Function: After sorting, ETL_OLAP should display a list of 12 tuples. Then ETL_OLAP should prompt the user to enter a number in the range of 1-12. Once the user enters a valid number, ETL_OLAP will process the OLAP query and display the correct result on the screen. (12 Points: each of the cuboids is worth 1 point)
4) Overall Quality of the Program. (3 Points)
4. Academic Integrity
At Dalhousie University, we respect the values of academic integrity: honesty, trust, fairness, responsibility and respect. As a student, adherence to the values of academic integrity and related policies is a requirement of being part of the academic community at Dalhousie University.
1) What does academic integrity mean?
Academic integrity means being honest in the fulfillment of your academic responsibilities thus establishing mutual trust. Fairness is essential to the interactions of the academic community and is achieved through respect for the opinions and ideas of others. Violations of intellectual honesty are offensive to the entire academic community, not just to the individual faculty member and students in whose class an offence occur (See Intellectual Honesty section of University Calendar).
2) How can you achieve academic integrity?
- Make sure you understand Dalhousie’s policies on academic integrity. 5
- Give appropriate credit to the sources used in your assignment such as written or oral work, computer codes/programs, artistic or architectural works, scientific projects, performances, web page designs, graphical representations, diagrams, videos, and images. Use RefWorks to keep track of your research and edit and format bibliographies in the citation style required by the instructor. (See http://www.library.dal.ca/How/RefWorks)
- Do not download the work of another from the Internet and submit it as your own.
- Do not submit work that has been completed through collaboration or previously submitted for another assignment without permission from your instructor.
- Do not write an examination or test for someone else.
- Do not falsify data or lab results.
These examples should be considered only as a guide and not an exhaustive list.
3) What will happen if an allegation of an academic offence is made against you?
I am required to report a suspected offence. The full process is outlined in the Discipline flow chart, which can be found at: http://academicintegrity.dal.ca/Files/AcademicDisciplineProcess.pdf and includes the following: a. Each Faculty has an Academic Integrity Officer (AIO) who receives allegations from instructors. b. The AIO decides whether to proceed with the allegation and you will be notified of the process. c. If the case proceeds, you will receive an INC (incomplete) grade until the matter is resolved. d. If you are found guilty of an academic offence, a penalty will be assigned ranging from a warning to a suspension or expulsion from the University and can include a notation on your transcript, failure of the assignment or failure of the course. All penalties are academic in nature.
4) Where can you turn for help?
- If you are ever unsure about ANYTHING, contact myself.
- The Academic Integrity website (http://academicintegrity.dal.ca) has links to policies, definitions, online tutorials, tips on citing and paraphrasing.
- The Writing Center provides assistance with proofreading, writing styles, citations.
- Dalhousie Libraries have workshops, online tutorials, citation guides, Assignment Calculator, RefWorks, etc.
- The Dalhousie Student Advocacy Service assists students with academic appeals and student discipline procedures.
- The Senate Office provides links to a list of Academic Integrity Officers, discipline flow chart, and Senate Discipline Committee.
6
Appendix: How to Use Zip and Unzip on Bluenose
7