CS计算机代考程序代写 python finance COMP9321 Data Services Engineering

COMP9321 Data Services Engineering
Term1, 2021
Week 2: Exploring your Data in Pandas

2
What are Pandas DataStructures
• Series: A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.
Example:
myseries = pd.Series([4, 7, -5, 3]) myseries
04
17
2 -5
33
dtype: int64

3
What are Pandas DataStructures
DataFrame:A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index;
Example:
data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001, 2002, 2001, 2002, 2003],
‘pop’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

4
Understanding the Data (ask the right Questions)
• What is this dataset?
• What should I expect within this dataset?
• Basic concepts (e.g., domain knowledge)
• What are the questions that I need to answer?
• Does the dataset have some sort of a schema? (utilize domain knowledge)

5
Understanding the Data using Python
• You can use the describe() function to get a summary about the data excluding the NaN values. This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.
• Use pandas .shape attribute to view the number of samples and features we’re dealing with
• it’s also a good idea to take a closer look at the data itself. With the help of the head() and tail() functions of the Pandas library, you can easily check out the first and last 5 lines of your DataFrame, respectively.
• Use pandas .sample attribute to view a random number of samples from the dataset

6
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
Understanding your Data
>>> df = pd.read_csv(‘MyLovelyDataset.csv’)
>>> df.head()
Identifier
0 206
1 216
2 218
3 472
4 480
Type of Company NaN
Law
n/a Finance
Health
#you can also use df.tail to get the last 5 rows
Location
Boston
London; Virtue & Yorston
Sydney London
NY

7
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
Understanding your Data (Cont’d)
• If you have many columns and you want to understand what you have
>>> df = pd.read_csv(‘MyLovelyDataset.csv’)
>>> list(df) # gets list of column names
[‘Identifier’, ‘Type of Company’, ‘Location’]

8
Useful Resource
• Book: Python for Data Analysis, Second Edition, Wes McKinney