MFIN6201 Lecture 2
Programming and Data Management
Leo Liu
February 26, 2020
About Me
• Leo Liu
• leo.liu@unsw.edu.au
• Consultation hours: Tuesday 5-6pm from week 7 to week 10 • West Wing, Level 3 of Business School
• Email me before you come, please
• During exam time, I should be more flexible
2
This lecture
Will NOT be examed in the final. But to help you with your assignment and also career going forward.
1. Data structure in Finance
2. Data processing techniques
3. Tools for data processing and analysis 4. Play with Stata
3
Common type of finance data
• Company reports: financial reports, proxy statements, management proposals, etc.. Usually at low frequency at yearly and quarterly level.
• Price series: Public company stock price – daily open, close, high, low …
• Intra-day price: trade-by-trade, quote-by-quote data, can be at very high frequency at the nano-second level.
• We will also come across experiment dataset, survey data and other types in this course.
Econometric treatment for each type can be very different due to the different data structure.
4
What do we get out of data
• Get insights
1. Understand the pattern of the data to understand its behavior 2. Uncover relationships
• Answer economic questions
1. Testing relationship suggested by theory (mostly likely want to
uncover causal relationship)
2. Testing theoretical models (whether the theory fits the pattern
of the data)
• Make predictions
5
Time-series, cross-section and panel data
• We can look at the data points for this same unit at different time. For example, Apple’s stock price for today and tomorrow. This is characterized as time-series data.
Company
Date
Price
Apple Apple Apple Apple Apple
2010-10-01 2010-10-02 2010-10-03 2010-10-04 2010-10-05
$100
$102
$104
$99
$95
Analysing time-series data presents huge econometric issues because the correlation structure among observations. For example, recall in OLS, one of the assumption is the error terms are i.i.d. This is often not the case in a time-series.
6
Time-series, cross-section and panel data
• We can look at the data points for the different unit at the same time periods. This is characterized as cross-sectional data.
Company
Date
Price
Apple Microsoft Tesla
2010-10-01 2010-10-01 2010-10-01
$100 $50 $200
It is often difficult to make economic inference in cross-sectional data because of the individual-specific omitted variable bias. i.e. You are not comparing apple to apple. Each individual has many different characteristics that could drive the cross-sectional differences.
7
Panel data
Combing time-series and cross-section.
Company
Date
Price
Apple Apple Apple Microsoft Microsoft Microsoft Tesla Tesla Tesla
2010-10-01 2010-10-02 2010-10-03 2010-10-01 2010-10-02 2010-10-03 2010-10-01 2010-10-02 2010-10-03
$100
$110
$92
$50
$60
$82
$210
$200
$220
We could apply econometric treatment to panel data to mitigate those concerns with time-series or cross-section. Those will be discussed in Lecture 7.
8
Data processing techniques
Let’s first talk about computer hardware issues:
• Three things in your computer matter: CPU, Memory and Disk
• All computations are done by CPU; it fetches data from memory, processes them, and write back to memory
• Most of the data processing and statistical software only allows loading the whole dataset into the memory. However, the physical memory in your computer is often quite limited.
• In this course, we will not use large datasets, so everything should fit into your memory quite nicely. In real-world applications, often in serious data works, one must rely on distributed systems to manage large datasets.
• memory cannot permanently hold data so that data will be saved to disk
9
Data types
• Most of the statistical software will have to specify data types for each column of your datasets.
• The major types are int, float, str, DateTime, boolean etc..
• The software will force you to have only one data type for each column. In the data world, we often call it schema. This is to ensure consistency.
• Some of the more advanced systems do not have pre-set schema to allow some freedom.
• Depending on the software, the datatype can be very specific; for example, a float can be float32 or float64 or double, which allows a different number of digits. This is often trivial for us as normal users.
10
Getting data into/out of the software
• Many software supports importing Excel files, but in data science, we will try to avoid using software such as Excel, because it is not made for the purpose of managing big datasets. In addition, In many cases, what you see in the Excel is not the “true” value it represents due to complex formatting that Excel applies to the cell. We would like to avoid such a case.
• Instead, you should use a plain text file such as txt, CSV(comma separated values) etc. For Excel files, you should export them into CSV files.
• Text files are much quicker for a computer to digest, and file size is smaller and easier to inspect.
11
Get data into/out of the software
• Datasets are better to be exported as binary format, such as Stata file format .dta, this can avoid accidental modification to the file and also allow remembering the schema. Otherwise each time software will have to parse and infer the schema again.
• Many binary formats also have available options to compress the file. So the disk space can be saved.
12
Common data processing operations
• Change datatypes
• Append, merge datasets
• Groupby operations (aggregations) • Lead lag operations
• Deal with missing values
• Pivot tables (transposing)
13
Change data types
Data types can be changed interchangeably, as long as it is compatible. For example, an integer can be casted as a string (text), but a string with characters cannot be casted as integers.
• 123 -> “123” OK!
• “hello world” -> int will report an error…!
• We can perform arithmetic operations on int, but not string
• 123+123=246OK!
• “123” + “123” will report an error…! Or some program will concatenate the strings
• i.e. “123” + “123” = “123123”
14
concatenate or append datasets
15
merge datasets, basics
16
Merge (also called join) datasets
Suppose we want to find the Country of the customer for each order
Order table:
OrderID
CustomerID
OrderDate
10308
10309
10310
1 2 7
1996-09-18 1996-09-19 1996-09-20
Customer table:
CustomerID
CustomerName
Country
1 2 5
Alfreds Ana Antonio
Germany Mexico Mexico
17
Left join, inner join, right join, full outer join
Left join by CustomerID:
Right join by CustomerID:
Order table:
OrderID
CustomerID
OrderDate
CustomerName
Country
10308
10309
10310
1 2 7
1996-09-18 1996-09-19 1996-09-20
Alfreds Ana
.
Germany Mexico
.
Order table:
OrderID
CustomerID
OrderDate
CustomerName
Country
10308 10309 .
1 2 5
1996-09-18 1996-09-19 .
Alfreds Ana Antonio
Germany Mexico Mexico
18
Left join, inner join, right join, full outer join
Inner join by CustomerID:
Full outer join by CustomerID:
Order table:
Order table:
OrderID
CustomerID
OrderDate
CustomerName
Country
10308 10309
1 2
1996-09-18 1996-09-19
Alfreds Ana
Germany Mexico
OrderID
CustomerID
OrderDate
CustomerName
Country
10308 10309 10310 .
1 2 7 5
1996-09-18 1996-09-19 1996-09-20 .
Alfreds Ana
. Antonio
Germany Mexico
.
Mexico
19
Merging
What if we have duplicated key in one of the table Order table:
OrderID
CustomerID
OrderDate
10308
10309
10311
10310
1 2 2 7
1996-09-18 1996-09-19 1996-09-20 1996-09-20
Customer table:
CustomerID
CustomerName
Country
1 2 5
Alfreds Ana Antonio
Germany Mexico Mexico
20
Merging
Inner join by CustomerID:
Order table:
OrderID
CustomerID
OrderDate
CustomerName
Country
10308
10309
10311
1 2 2
1996-09-18 1996-09-19 1996-09-20
Alfreds Ana Ana
Germany Mexico Mexico
21
Merging
What if we have duplicated key in both tables Order table:
OrderID
CustomerID
OrderDate
10308
10309
10311
10310
1 2 2 7
1996-09-18 1996-09-19 1996-09-20 1996-09-20
Customer table:
22
CustomerID
CustomerName
Country
1 2 2 5
Alfreds Ana Chris Antonio
Germany Mexico US Mexico
Merging
They will create every possible combination of these duplicated value. Inner join by CustomerID:
Order table:
OrderID
CustomerID
OrderDate
CustomerName
Country
10308
10309
10309
10311
10311
1 2 2 2 2
1996-09-18 1996-09-19 1996-09-19 1996-09-20 1996-09-20
Alfreds Ana Chris Ana Chris
Germany Mexico US Mexico US
23
Missing values
• Missing values are those data points we cannot find values
• When you run regression, any X variables with missing value
will cause the entire row to be dropped
suppose you estimate an OLS regression Y = β1X1 + β2X2
Y
X1
X2
10 20 30 40 .
1 3 5 . 9
8 0 . 3 2
24
Missing values
• Data often contains missing values, how to treat them?
• Delete? Fill with zero? Fill with some other values?
• Treatment depends on the “why” they are missing.
• Often you need to exercise your own judgment and there is no single answer to it.
25
Missing values
Case 1:
US firms are not obligated to report their R&D in their annual reports. You used dataset from S&P to collect firm’s R&D expenses data. S&P uses a crawler to gather the numbers and will report a missing if the crawler does’t find a R&D entry in the annual report. You found one half of the observations are missing. You also obtained a independent report saying that majority of these firms did not do R&D.
26
Missing values
Case 2:
Suppose you run a consulting firm and today you are helping a Fortune 500 firm to evaluate how R&D promotes firm’s sales growth. You find many of the firms have missing sales figures. You later discover majority of the firms are shell companies which do not have actual operations.
27
Missing values
Case 3:
Suppose you are helping university professors to evaluate how local job satisfaction promotes local employment rates. Due to the data limitation, you only have 2012 and 2016 job satisfaction data, but the employment rate is in 2014. Your research experience tells you that job satisfaction is quite sticky and does not vary much over time.
28
Tools in Data Science – Statistical
• Python is the number one language in data science due to its popularity, easiness to use, rich set of computational and machine learning library
• R language has rich econometric library that allows one to estimate complicated models
• Eviews is also a pupular econometric software but it is proprietary, it specialised in time-series data
• Stata is an easy-to-use statistical software can accommodate most of the general usage
29
Tools in Data Science – Data processing
• SAS is good data processing software, including out-of-memory computation, with many add-on functions, but lack econometric library
• Apache Hadoop is a data processing system that can be distributed to many computers
• Apache Spark is a next-generation distributed data processing system that is faster, smarter, has rich functions
• If you have terabytes of data, the only hope is to use a database (such as PostgreSQL, Google BigQuery MongoDB etc..)
30
What should you learn: it depends on your job perspective
• Commercial Banks use SAS primarily + SQL database
• IB / Analyst firms / prop trading firms use a large range of
languages from C++, Java, python and R
• A more general data analyst role requires database (SQL and NoSQL) skills
31
My advice if you would like to be a Data Scientist / Econome- trician
• Learn Python, it is (almost) a must. Probably also Java (Many data library written in Java as well).
• Learn SQL. It is (almost) a must for any database related jobs.
• Learn machine learning, deep learning libraries in Python, such
as TensorFlow and Keras
• Master statistical techniques (such as those in this course)
32