程序代写代做代考 Excel Java python 5. Practical week 4, part A (More about Data Frames) 5.1. Changing and manipulating pandas’ data frame objects

5. Practical week 4, part A (More about Data Frames) 5.1. Changing and manipulating pandas’ data frame objects
In this practical we will cover some basic operations with data frames:
– How to add new (computed) columns in a data frame
– How to sort data frame tables by column values
– How to do simple statistics on data frame columns
– How to make simple visualizations of the data
– How to generate test (fake) data
5.2. Why knowing pandas is important
Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. This is encouraging because it means pandas is not only helping users to handle their data tasks but also that it provides a better starting point for developers to build powerful and more focused data tools. The creation of libraries that complement pandas’ functionality also allows pandas development to remain focused around its original requirements.
For example Statsmodels is the prominent Python “statistics and econometrics library” and it has a long-standing special relationship with pandas. Statsmodels provides powerful statistics, econometrics, analysis and modeling functionality that is out of pandas’ scope. Statsmodels leverages pandas objects as the underlying data container for computation.
Even if not built on top of pandas, other libraries are part of the “ecosystem” of panda. For example Seaborn is a Python visualization library based on the library matplotlib. It provides a high- level, dataset-oriented interface for creating attractive statistical graphics. The plotting functions in Seaborn understand pandas objects and leverage pandas grouping operations internally to support concise specification of complex visualizations. Seaborn also goes beyond matplotlib and pandas with the option to perform statistical estimation while plotting, aggregating across observations and visualizing the fit of statistical models to emphasize patterns in a dataset.
5.3. To start with a simple example
First, we consider a csv file that contains the restaurant list we used before (download it from Nestor, ‘restRanking.csv’), structured only on two columns (restaurant name, and address), using this time the ‘;’ (semicolon) as a separator. (
NOTE: csv’s are many times called “comma
separated value files”, but in fact, the abbreviation stands for “CHARACTERS separated value

files”, meaning that almost any character or set of characters can be used as separators – of course,
commas and semicolons are make the textual content of the files easier to read with the human eye
).
EXERCISE 1: (write the code in a .py file named for example practical_week4_ex_1):
Read the content of this file in a pandas data frame object (you should know now how to do it). The read_csv() method has multiple parameters, and you need now the sep (indicating which separator character is used in the file), which has the default value ‘,’ (comma). As you would expect, this parameter has to be set for this file on the semicolon character. Display in the console the shape, columns, and dtypes properties of the newly created data frame. Display the whole data frame (all rows).
We want now to add some information to this data frame. For example, two columns, one representing the number of positive reviews by customers, and another one, representing the negative reviews, this giving us enough information to compute a 0-5 ranking of each restaurant. We consider here that customers respond to a review request, and they can answer in three ways: “good”, “bad”, “not sure”. That means that we need a third column for this kind of responses.
Initially, we do not know these results, and we need a placeholder – like the number 0.
A solution would be to generate a list full of zeroes as long as the table’s length in rows (you have to complete the name you gave to your data frame object, without the <> characters below):
zeroes = [0] * .shape[0]
…and we simply use the same technique as used in a dictionary Python type. Just assign the zero
values in the list to the data frame, indexed with a new column name. [‘nrGoodReviews’] = zeroes [‘nrBadReviews’] = zeroes [‘nrUndecided’] = zeroes
Try this, and print the data frame’s content again. However, doing it like this is a bad idea and ugly programming style in Python. Due to that, comment out these 4 lines above in your code.
However, customers might have responded to the review request only with “I am not sure”, and therefore we could have leave in the dataset values like 0 good and 0 bad reviews. It is obvious that 0 is not a good value to use as a default for any of these columns. Programmers tend to use -1 in these situations, but pandas’ users typically employ in such situations a special value named ‘NaN’ (meaning Not a Number, and not granny). To add three columns filled with these values, we can write code that is more elegant than the above:
for newCol in [‘nrGoodReviews’,’nrBadReviews’, ‘nrUndecided’]: [newCol] = np.nan
# where np is the alias for the numpy library
# which explicates the .nan value for Python
if necessary

There are many ways to add new columns, but this has the advantage that takes a list of new column names as a sort of input, and it works for a list of any length.
INTERESTING NOTE: You can also use the Python built-in placeholder None instead of the typical numpy and the pandas related np.nan value. We need these placeholders to show in data repositories that we do not have yet a value for that particular data point or record. In other programming languages (C, C++, Java), this “nothing” placeholder value is called null. Which is a very different thing from having a 0 or a -1. To understand better the nuances and implications of this particular kind of value for data repositories, watch the following excellent educational video: https://www.youtube.com/watch?v=bjvIpI-1w84
Display again the whole data frame content in the console. You see that the restaurants are in a random sequence, and if somebody would like to input manually the test results, by overwriting the NaN values, it would be slow to find the right row. For example, if the students appear in the alphabetical order of their surnames, the task to find a specific student’s row would be easier. Sorting the content of a data frame, based on one or more columns, is very easy. The DataFrame class has a sort_values() method, and you can use it as to order the restaurants on two criteria, name first, and address second:
=

Related Posts