CSCI 4144/6405 – Data Mining and Data Warehousing
Assignment 2: Data Warehousing Techniques – ETL and OLAP Due: 11:55pm, Feb. 25 (Friday), 2022
TA Office Hours via MS Teams (Specifically, via the channel “Office Hour – TA”):
• Mondays (11:00am-11:59am, Java and Python): -MacLeod
Copyright By PowCoder代写 加微信 powcoder
• Tuesdays (11:00am-11:59am, Python Only):
• Thursdays (11:00am-11:59am, Java and Python):
• Fridays (11:30am-12:30pm, Java and Python):
1. Assignment Overview
In this assignment, you need to write a program to accomplish a few ETL and OLAP operations. The major objective of this assignment is to get yourself familiar with the process of constructing and using a data warehouse.
2. Detailed Requirements
1) Overview: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making. The following figure illustrates the process of constructing and using a data warehouse.
2) ETL: The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation, and Loading. Note that ETL refers to a broad process, and not three well-defined steps. Normally, ETL includes
extraction, cleaning, transformation, loading, and refreshing. Nevertheless, the entire process is known as ETL.
3) OLAP: Data warehouses provide online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining. Typical OLAP operations include roll up, drill down, slice, dice, and pivot.
4) Sample Data Set: A sample data set file titled “Car_Sales_Data_Set.csv” is used as the original data set for our data warehouse. Specifically:
a) CSV stands for Comma-Separated Values. A CSV file is a text file that uses a comma to separate values. Often, the first record in a CSV file is a header line including a list of field names. Therefore, it is very easy to dig into a CSV file and look for useful information. You can use any text editor to open a CSV file and view its content. More details about CSV can be found here: https://en.wikipedia.org/wiki/Comma-separated_values
b) The sample data set is provided via brightspace.
c) There are 64 records (not including the header line) in the data set.
d) The data set includes 5 fields: Country, Time_Year, Time_Quarter, Car_Manufacturer,
Sales_Units.
e) In this data warehouse, there are 3 dimensions (i.e. Country, Time, and Car_
Manufacturer) and 1 measure (i.e. Sales_Units). In addition, the dimension of Time is associated with a two-level concept hierarchy: Time_Quarter < Time_Year. The valid values for each of the dimensions are listed below:
a. Country: Canada, United States
b. Time_Year: 2017, 2018
c. Time_Quarter: 1, 2, 3, 4
d. Car_ Manufacturer: Toyota, Honda, Ford, BMW
5) Required ETL Operation: In practice, the ETL involves many different operations. In this assignment, we focus on the operation of sorting. Specifically, your program needs to sort the records three times.
a) The first sorting deals with the field of “Country”. Once it is done, records associated with Canada are in front of records associated with United States. Now we have two sequential groups of records: Canada_Set and US_Set. Note that both groups are still in the same CSV file.
b) The second sorting is based on the result of the first sorting. The second sorting deals with the field of “Time_Year”. Once it is done, records associated with Canada are still in front of records associated with United States; however, within the Canada_Set, records associated with 2017 are in front of records associated with 2018 (this also applies the US_Set). Now we have four sequential groups of records in the same CSV file: Canada_2017_Set, Canada_2018_Set, US_2017_Set, US_2018_Set.
c) The third sorting is based on the result of the second sorting. The third sorting deals with the field of “Time_Quarter” and it does not change the order of the data groups
mentioned previously. However, within each data group (e.g. Canada_2017_Set), records are sorted according to the quarter values (in ascending order, i.e. 1, 2, 3, 4). Now we have sixteen sequential groups of records: Canada_2017_1_Set, Canada_2017_2_Set, ..., US_2018_3_ Set, US_2018_4_ Set.
6) Required OLAP Operations: For an n-dimensional data cube with each dimension being associated with Li levels, the number of available cuboids is:
The data warehouse in this assignment involves 3 dimensions (i.e. Country, Time, and Car_ Manufacturer). In addition, the dimension of Time is associated with a two-level concept hierarchy: Time_Quarter < Time_Year. Therefore, the number of cuboids is (1+1) x (2+1) x (1+1) = 12. Specifically, the cuboids correspond to the following tuples:
2. (Country)
3. (Time_Year)
4. (Time_Quarter - Time_Year)
5. (Car_ Manufacturer)
6. (Country, Time_Year)
7. (Country, Time_Quarter - Time_Year)
8. (Country, Car_ Manufacturer)
9. (Time_Year, Car_ Manufacturer)
10. (Time_Quarter - Time_Year, Car_ Manufacturer)
11. (Country, Time_Year, Car_ Manufacturer)
12. (Country, Time_Quarter - Time_Year, Car_ Manufacturer)
Note that “Time_Quarter - Time_Year” means the combo of quarter and year. One example value of this concept is 1-2017 (i.e. the first quarter of 2017). These cuboids could be utilized to perform varied OLAP operations.
7) Required Program: You need to write a program that sorts the sample data set and responds to varied OLAP queries. Here are the detailed requirements:
a) You should place the sample data set file titled “Car_Sales_Data_Set.csv” in the directory where your program file is located.
b) The name of your program should be “ETL_OLAP”. After running ETL_OLAP via the command-based interface, ETL_OLAP should read “Car_Sales_Data_Set.csv”, sort the records according to the instructions in Section 2.5. After the first sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_First_Sorting.csv”. After the second sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_Second_Sorting.csv”. After the third sorting, ETL_OLAP should save the sorting result in a file named “Car_Sales_Data_Set_Third_Sorting.csv”. Note that all of the sorting result files should use the CSV format and be placed in the same directory as ETL_OLAP.
T = n (L +1) iÕ=1 i
c) After saving the sorting result, ETL_OLAP should display a list of 12 tuples (i.e. the tuples specified in Section 2.6) on the screen. Note that the sequence number associated with each tuple should also be displayed. Each of these tuples corresponds to a cuboid (i.e. a possible OLAP query) in the data cube.
d) Then ETL_OLAP should prompt the user to enter a number in the range of 1-12. Namely, ETL_OLAP should prompt the user to select an OLAP query.
e) Once the user enters a valid number (i.e. a number in the range of 1-12), ETL_OLAP will process the corresponding OLAP query and display the result (including a header line, the corresponding dimension values, and aggregated Sales_Units values) on the screen. For example, if the user enters 2, then the first line of the result should be the header line:
Country Sales_Units
Then the detailed Country values and aggregated Sales_Units values should be displayed:
Canada XXX
United States XXX
Here, “XXX” represents the corresponding aggregated value. Note that:
• When user enters 8, the corresponding cuboid is 2D. Table 4.2 on page 137 of the textbook illustrates how to use a 2D table to display a 2D cuboid.
• When the user enters 11 or 12, the corresponding cuboids are 3D. In this case, ETL_OLAP will use two 2D tables to display the results: the first one corresponds to Canada and the second one corresponds to United States. Table 4.3 on page 137 of the textbook illustrates how to use 2D tables to display 3D cuboids.
8) Testing Platform and Required Language: The details of the testing platform and the required programming language are presented as follows.
a) Testing Server: “timberlea.cs.dal.ca” is the computer used by the TA to evaluate your program. Therefore, you need to make sure that your program works on timberlea.
a. You can use your CS ID to log on to “timberlea.cs.dal.ca” in order to write your program. Alternatively, you can write your program on other machines, then transfer your program to timeberlea and thereafter test it on timberlea.
b. If you do not know your CS ID, you can visit the following webpage to get your CS ID. If your CS ID does not work or you have a question about your CS ID, please send an email to
https://csid.cs.dal.ca/
b) Required Programming Language: You need to use Java or Python as the programming language because timberlea supports these languages. Note that both Python 2 and Python 3 are available on timberlea.cs.dal.ca. You can use “python2 --version” and “python3 --version” to check the specific versions on timberlea. In addition, you can only use the following header file or libraries in your program:
a. Java: java.io.*, java.util.*, java.lang.Math
b. Python: csv, math, itertools
c) Compiling and running your program on timberlea.cs.dal.ca should not lead to errors or warnings. To compile and run your program on timberlea, you need to be able to
access the command-line interface of timerlea. In addition, you need to be able to upload a file to or download a file from timberlea.
a. To access command-line interface of timerlea, you can use the software tool “putty” on MS Windows computers. “putty” can be downloaded here: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html . On Mac and Linux computers, you can use the command “ssh” to access timberlea via the program called “Terminal”.
b. To transfer files between your computer and timberlea, several different methods could be used. Here are two methods for MS Windows and macOS/Linux computers.
i. MS Windows Computer: WinSCP is popular tool used to transfer files between two computers. You can download WinSCP from the following webpage: https://winscp.net/eng/download.php . The documentation for WinSCP can be found here: https://winscp.net/eng/docs/start. Specifically, you can focus on the “Uploading Files” and “Downloading Files” section of this document to understand how to transfer files.
ii. Mac and Linux Computer: On Mac and Linux computers, you can use the command “scp” to transfer files. Here is a tutorial on the command “scp”: https://www.linuxtechi.com/scp-command-examples-in-linux/.
9) Readme File: You need to complete a readme file named “Readme.txt”, which includes the instructions that the TA could use to compile and execute your program on timberlea.
10) Submission: Please pay attention to the following submission requirements:
a) You should place “Readme.txt” in the directory where your program file is located.
b) You should place “Car_Sales_Data_Set.csv” in the directory where your program file is
c) Your program file, “Readme.txt”, and “Car_Sales_Data_Set.csv” should be compressed
into a zip file named “ASN2-YourFirstName-YourLastName.zip”. For example, my zip file should be called “ASN2-Qiang-Ye.zip”. Finally, you need to submit your zip file for this assignment via brightspace.
a. Note that there is an appendix at the end of this document, which includes the commands that you can use to compress your files on timberlea.
3. Grading Criteria
The marker will use your submitted zip file to evaluate your assignment. The full grade is 22 points. The details of the grading criteria are presented as follows.
1) Does “Readme.txt” include enough information so that the TA can easily compile and execute the program on timberlea? (1 Point)
2) ETL Function: After running ETL_OLAP, “Car_Sales_Data_Set_Sorted.csv” (the file contains the sorted records) is generated and placed in the directory where the compiled program is located. (6 Points: each sorting is worth 2 points)
3) OLAP Function: After sorting, ETL_OLAP should display a list of 12 tuples. Then ETL_OLAP should prompt the user to enter a number in the range of 1-12. Once the user enters a valid number, ETL_OLAP will process the OLAP query and display the correct result on the screen. (12 Points: each of the cuboids is worth 1 point)
4) Overall Quality of the Program (i.e. whether the structure of the program is clear and reasonable, whether the program is properly commented, whether the indentation is appropriate, etc). (3 Points)
4. Academic Integrity
At Dalhousie University, we respect the values of academic integrity: honesty, trust, fairness, responsibility and respect. As a student, adherence to the values of academic integrity and related policies is a requirement of being part of the academic community at Dalhousie University.
1) What does academic integrity mean?
Academic integrity means being honest in the fulfillment of your academic responsibilities thus establishing mutual trust. Fairness is essential to the interactions of the academic community and is achieved through respect for the opinions and ideas of others. Violations of intellectual honesty are offensive to the entire academic community, not just to the individual faculty member and students in whose class an offence occur (See Intellectual Honesty section of University Calendar).
2) How can you achieve academic integrity?
- Make sure you understand Dalhousie’s policies on academic integrity.
- Give appropriate credit to the sources used in your assignment such as written or oral work, computer codes/programs, artistic or architectural works, scientific projects, performances, web page designs, graphical representations, diagrams, videos, and images. Use RefWorks to keep track of your research and edit and format bibliographies in the citation style required by the instructor. (See http://www.library.dal.ca/How/RefWorks)
- Do not download the work of another from the Internet and submit it as your own.
- Do not submit work that has been completed through collaboration or previously submitted for another assignment without permission from your instructor.
- Do not write an examination or test for someone else.
- Do not falsify data or lab results.
These examples should be considered only as a guide and not an exhaustive list.
3) What will happen if an allegation of an academic offence is made against you?
I am required to report a suspected offence. The full process is outlined in the Discipline flow chart, which can be found at: http://academicintegrity.dal.ca/Files/AcademicDisciplineProcess.pdf and includes the following: a. Each Faculty has an Academic Integrity Officer (AIO) who receives allegations from instructors. b. The AIO decides whether to proceed with the allegation and you will be notified of the process. c. If the case proceeds, you will receive an INC (incomplete) grade until the matter is resolved. d. If you are found guilty of an academic offence, a penalty will be assigned ranging from a warning to a suspension or expulsion from the University and can include a notation on your transcript, failure of the assignment or failure of the course. All penalties are academic in nature.
4) Where can you turn for help?
- If you are ever unsure about ANYTHING, contact myself.
- The Academic Integrity website (http://academicintegrity.dal.ca) has links to policies, definitions, online tutorials, tips on citing and paraphrasing.
- The Writing Center provides assistance with proofreading, writing styles, citations.
- Dalhousie Libraries have workshops, online tutorials, citation guides, Assignment Calculator, RefWorks, etc.
- The Dalhousie Student Advocacy Service assists students with academic appeals and student discipline procedures.
- The Senate Office provides links to a list of Academic Integrity Officers, discipline flow chart, and Senate Discipline Committee.
Appendix: How to Use Zip and Unzip on Timberlea
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com