1 Instructions
2
Ob jectives
There are three main objectives in this assignment
- Is there indirect evidence suggestive of data manipulation across the universe of publicly traded US companies?
- Could we have forecasted the accounting scandal at Satyam based on the Beneish M-Score model?
- Despite concerns about data quality, is accounting information incrementally useful for pre- dicting future cash flows across the universe of publicly traded US companies?
MSBA 6030: Data Assignment 1
This is an individual assignment but you are permitted to talk to other students in case you get stuck. Download the associated Compustat zip file + codebook which contains annual financial information for US companies over the time period of January 1950 to July 31 2018. The dataset contains all quantitatively reported financial statement variables (over 700+) but for this homework we will be focused on a restricted subset.1
When you submit the homework, include the output+answers to the questions along with the code used to generate the output (we will run the code to verify the output). Ideally, you should use R in conjunction with Markdown to solve this homework.2
1The entire dataset is provided so that you can decide what you wish to use for the final project. 2If you strongly prefer to use Python, Jupyter notebooks are acceptable as well.
1
3 Homework 3.1 Pre-processing
- The first step is to make sure the data is ready for analysis. For this assignment we only need to use gvkey (company identifier) datadate (reporting period) fyear (fiscal year) revt rect ppegt epspi ni at oancf sic rdq. Use the codebook to familiarize yourself with what these variables represent.
- Restrict sample to firms with fiscal years 1988 to 2017 inclusive.
- Drop any observations with missing assets, revenue, net income, EPS, accounts receivables, or operating cash flows (data errors or odd accounting rules).
- If PPE is missing set it as zero (often the dataset reports missing when it’s zero).
- Drop any observations where revenue is negative (data errors or odd accounting rules).
- The dataset has some duplicate observations at the firm-fiscal year level where all variables values are identical except rdq. Drop duplicates at the firm-fiscal year level based on every variable except rdq.
- Your final dataset should have approximately 248,288 observations for all variables except rdq where it is 215,786.3
3Depending on how you dropped the duplicates, you may have a slightly different count. 2
3.2 Descriptive statistics
1. The first step to any analysis is to understand the raw descriptive statistics so report. Note that all variables are reported in millions (USD).
- What is the average revenue, net income, and total assets for a firm in fiscal year 2017?
- Plot the average revenue, net income, and total assets by fiscal year (1988-2017) – com- ment on any trends.
- The data you generated from the preprocessing steps is obviously different from the raw data. Is there any potential bias in the “cleaned” data? What would you do to learn whether the cleaned data is representative or non-representative of the raw data? Provide some simple analysis to establish the degree of representativeness.
3
3.3 Indirect evidence of earnings management
2. Are firms relatively more likely to report performance just better or just worse than last year’s performance?
- Calculate the change in EPS from year t to t − 1 for each firm-year.4 Plot the histogram of the change in EPS restricting change in EPS ∈ [−.10, +.10] in 1 cent increments.
- Calculate the change in ROA from year t to t − 1 for each firm-year. ROA (return-on- assets) is another common performance metric and is defined as net income scaled by lagged total assets. Plot the histogram of the change in ROA restricting change in ROA ∈ [−.10, +.10] in 1 % increments.
- Compare the two distributions around zero; which variable is relatively more asymmetric and in which direction of asymmetry?5 Note that both EPS and ROA share the same numerator (net income). Conjecture possible explanations for the relative symmetry difference between the two variables.
4If the lagged variable is missing then set change in EPS as missing. Follow this rule throughout the exercise.
5You do not need to establish a formal statistics test here, just raw summary statistics will be enough. The formal method here involves local linear regressions.
4
3. Feel free to use a R package such as benford.analysis to generate χ2 test statistics and p- values.6 For the following variables: revenue, operating cash flows, and total assets
- Without looking at the data, rank-order which variables you think are most likely to be manipulated in the data and why.
- Plot the distribution of the first digit (relative to Benford’s law distribution on the same graph). Report the χ2 test statistic and interpret its significance. Are you surprised at which variables are significant or insignificant?
- Repeat part b for the second digit (note that this test only applies to numbers that have at least two digits).
- For this question, pool the digits from all three variables together for analysis [imagine they all came from a single variable]. Focusing on the first digit’s distribution, calculate the difference between the actual frequency and Benford’s Law’s expected frequency. Take the maximum absolute value of this difference across the 9 digits, and call it MAD (maximum absolute deviation). Calculate MAD by fiscal year and plot it across time. Discuss any patterns e.g., overall trends or periods associated with corporate scandals and regulation.
- Thus far we have analyzed violations of Benford from the perspective of all firms across the entire sample period or by fiscal year. Can you think of a way to test at the firm-year level instead? If so, try it on your project firm [hint: use more data].
6 https://cran.r-pro ject.org/web/packages/benford.analysis/README.html 5
4. The SEC uses a discretionary accruals model to screen firms for improper earnings manage- ment
- Estimate the following OLS regression for each year separately
Accrualsit = β0 + β1Cash Revenue Growthit + β2Gross PPEit + ψSICi + εit (1)- where accrualsit is net income – operating cash flows for firm i in fiscal year t
- where cash revenue growth is ∆ revenue – ∆ rect
- SIC an industry classifier and should be coded as a vector of indicator values which equal one for each SIC code and zero otherwise.7
- where all variables above (except SIC) including gross PPE is scaled by prior year’s total assets
- based on the estimated parameters, obtain the fitted residuals εˆ for each firm-year it
observation. Check to make sure that the fitted residuals have zero mean or you
must have done something wrong.
- firms may be interested in earnings management upwards or downwards; all we are
interested in is any evidence of earnings management. Create a new variable which is the absolute value of εˆ . This is commonly known as the unsigned discretionary
accruals. We will use this in the next step.
- The SEC prefers to detect earnings management with multiple signals – one such signal is that firms that tend to earnings manage also tend to delay their financial reporting (to give them time to manage presumably…) The variable rdq measures the date of the earnings announcement (where financial data is released). Datadate is the date on which the firm’s operating period ends so rdq–datadate represents the delay in reporting. Firms are obligated to report within a certain time period (mostly 120 days). Set delay as missing any observation where delay is negative or delay is more than 180 days as those are unusual circumstances or data errors. What is the average delay (in days) in the sample?
- Estimate the following pooled OLS regression
Delayit = β0 + β1Unsigned Discretionary Accrualit + ψF irmi + εit (2)- where Firmi represents a vector of indicators equaling 1 for each gvkey and zero otherwise.
- what is the source of variation used in the data to identify β1?
- provide the regression results (do not report the ψs) and interpret the results.
it
7The regression software should automatically omit one SIC identifier which is the baseline group. 6
5. On January 8, 2009, Sanyam Computer Services, a leading Indian outsourcing company that serves more than a third of the Fortune 500 companies, announced that it had systematically falsified accounts over the past several years. It is not hard to spot that fateful moment on the price chart above. [For more detail, see attached New York Times article Satyam Chief Admits Huge Fraud].
- We want to see whether the Beneish Earnings Manipulation Detection model could have picked up some early warning signs from the companys financial statements. Because it is an Indian firm that trades as an ADR in the NYSE, the firm files an annual 20-F rather than a 10-K.
- Using the Beneish Earnings Manipulation Model, compute the company’s M-Score for both 2008 and 2007. How does the company score each year on this model? Would these irregularities have been flagged in advance? Use the provided spreadsheet to fill in the numbers. The M-Score components’ formulas have been written for you so the calculations will be generated automatically once you provide the raw data. All the information needed to make the calculations are provided in the attached exhibits (You will be able to find the depreciation expense on the statement of cash flows). Note that in 2007 the firm classified investment in bank deposits as a non-current asset. In 2008, the bank deposits were re-classified as a current asset. To increase comparability across time, treat the bank deposits as current assets in both periods in your calculations.
- Interpreting the individual components of the M-Score, which input factors seem to suggest warning signs?
- The Sarbanes Oxley Act of 2002 required board of directors to form an audit commit- tee staffed by an financial expert. Did Satyam disclose such an expert? Reading the background characteristics of the various board members, who in your mind would have been the most qualified expert?
- Overall, would you say this “Enron of India” was potentially detectable in advance?
7
6. The objective of an accrual-based accounting system is to provide more relevant information investors at the expense of worse reliability, in comparison to pure cash flows. How can we evaluate the relevancy of the information? One of the principle tasks of an investor is to determine a firm’s future cash flows for the purposes of valuation. Using the post-processed data you’ve assembled previously, estimate the following predictive regression (note the lag operator) which asks how well current year’s accounting accruals and cash flows predict next year’s cash flows
operating cash flowsi,t+1 = β0 + β1accrualsi,t + β2operating cash flowsi,t + εi,t+1 (3) where all variables are defined as previously (e.g., scaled by prior year’s total assets).
- Before you run the regression, provide an economic interpretation to the following three cases
i. ifβˆ =0,ifβˆ <0,ifβˆ >0 111
ii. Show the actual regression results. Are accounting accruals useful in forecasting future cash flows? Are cash flows a useful predictor?
- The sample period is 29 years (1988-2017); divide into period 1(1988-2002) and period 2(2003-2017) and repeat the regression exercise. Comment on any differences between the two sample periods. Are you surprised?
8