CS计算机代考程序代写 python data structure lab-checkpoint

lab-checkpoint

Part 2: Python exercises ( 15 points )¶
Look out for the ‘YOUR CODE HERE:’ comment in the following cells

2.a) Operators ( 3 points )¶
Write a function that applies the logistic sigmoid function to all elements of a list.
The logistic sigmoid function is defined as :

$f(x) = \frac{1}{1+e^{-x}}$

A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve.

The output of the sigmoid function is in the range [0,1]

In [ ]:

# Loading libraries

import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd

In [ ]:

# Write a function that applies the logistic sigmoid function to all elements of a NumPy array
# The NumPy and Math libraries have useful predefined mathematical functions

def sigmoid(inputArray):

modifiedArray = np.zeros(len(inputArray))

#YOUR CODE HERE:

return(modifiedArray)

def test_2a():

inputs = np.arange(-10,10,0.2)
outputs = sigmoid(inputs)
plt.figure(1)
plt.plot(inputs)
plt.title(‘Input’)
plt.xlabel(‘Index’)
plt.ylabel(‘Value’)
plt.show()
plt.figure(2)
plt.plot(outputs,’red’)
plt.title(‘Output’)
plt.xlabel(‘Index’)
plt.ylabel(‘Value’)
plt.show()

test_2a()

2.b) Noise addition ( 3 points )¶
Write a function that adds random noisy values to all elements of a NumPy array.

Gaussian noise, or white noise, has a mean of 0 and a standard deviation of 1.

The addition of Gaussian noise is referred to as ‘jittering’

Choose your noisy values such that they are randomly sampled from a Gaussian distribution that has a mean of 0 and a standard deviation of 1.

In [ ]:

# Write a function that adds a random noise to all elements of a NumPy array
# The NumPy and math libraries have useful predefined mathematical functions
# The numpy.random module maybe useful

def addNoise(inputArray):

modifiedArray = np.zeros(len(inputArray))

#YOUR CODE HERE:

return(modifiedArray)

def test_2b():

inputs = np.arange(-10,10,0.2)
outputs = addNoise(inputs)
plt.figure(1)
plt.plot(inputs)
plt.title(‘Input’)
plt.xlabel(‘Index’)
plt.ylabel(‘Value’)
plt.show()
plt.figure(2)
plt.plot(outputs,’red’)
plt.title(‘Output’)
plt.xlabel(‘Index’)
plt.ylabel(‘Value’)
plt.show()

test_2b()

2.c) Text processing ( 9 points )¶
For data analyses involving raw text data, different preprocessing tasks are applied to clean the data.

This section uses some of the basic processes involved in text cleaning.

Find the total number of alphanumeric characters in the original input text
Find the total number of unique numeric characters in the original input text
Remove all non alphabetic characters (don’t worry about punctuation rules for this task eg. “don’t” would become “dont” ). Remember to preserve the space ” ” between words.
Convert all text to lowercase
Find the total number of words after preprocessing
Find the total number of unique words after preprocessing

In [ ]:

# This cell contains the sample text string

sampleText = “Northeastern University (NU or NEU) is a private research university in Boston, Massachusetts. Established in 1898, the university offers undergraduate and graduate programs on its main campus in Boston as well as satellite campuses in Charlotte, North Carolina; Seattle, Washington; San Jose, California; San Francisco, California; Portland, Maine; and Toronto and Vancouver in Canada. In 2019, Northeastern purchased the New College of the Humanities in London, England. The university`s enrollment is approximately 19000 undergraduate students and 8600 graduate students.[5] It is classified among R1: Doctoral Universities – Very high research activity[6].””Northeastern University (NU or NEU) is a private research university in Boston, Massachusetts. Established in 1898, the university offers undergraduate and graduate programs on its main campus in Boston as well as satellite campuses in Charlotte, North Carolina; Seattle, Washington; San Jose, California; San Francisco, California; Portland, Maine; and Toronto and Vancouver in Canada. In 2019, Northeastern purchased the New College of the Humanities in London, England. The university’s enrollment is approximately 19000 undergraduate students and 8600 graduate students.[5] It is classified among R1: Doctoral Universities – Very high research activity[6].”

In [ ]:

# Your function should return 4 variables: alphaNumCount, uniqueNumCount, wordCount, uniqueWordCount

def textPreprocessing(textString):

alphaNumCount, uniqueNumCount, wordCount, uniqueWordCount = (0,0,0,0)

#YOUR CODE HERE:

def test_2c(sampleText):

alphaNumCount, uniqueNumCount, wordCount, uniqueWordCount = textPreprocessing(sampleText)
print(“Total number of alphanumeric characters in the original input text: “, alphaNumCount)
print(“Total number of unique numeric characters in the original input text: “, uniqueNumCount)
print(“Total number of words after preprocessing”, wordCount)
print(“Total number of unique words after preprocessing”, uniqueWordCount)

test_2c(sampleText)

Part 3: Pandas: data processing ( 10 points )¶
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. In this section, we will be looking at some of the preprocessing steps involved with analysing data.

We will be using the ‘cars.csv’ dataset [ this csv file is provided inside your lab zipfile download ]

Dataset resource: https://www.kaggle.com/abineshkumark/carsdata

About the dataset: Cars Data has Information about 3 brands/make of cars. Namely US, Japan, Europe. Target of the data set to find the brand of a car using the parameters such as horsepower, Cubic inches, Make year, etc.

Look out for the ‘YOUR CODE HERE:’ comment in the following cells

3.a) Loading a .csv file ( 2 points )¶

In [ ]:

# Loading a .csv file
# Use a pandas fuction to read a csv file (cars.csv)
# Store the csv file as a pandas dataframe called ‘df’

# YOUR CODE HERE

In [ ]:

# Viewing a sample of the dataframe

df.head(10)

3.b) Removing columns ( 2 points )¶
Find a Pandas function that can be used to remove columns based on the column name.

Store the modified dataframe in a new dataframe ‘df_new’

Remove the cubicinches, weightlbs, brand columns

In [ ]:

# Removing columns
# Remove the ‘cubicinches’,’weightlbs’,’brand’ columns
# Store the modified dataframe in a new dataframe ‘df_new’

#YOUR CODE HERE:

In [ ]:

# Generate descriptive statistics after column-dropping

df_new.describe()

3.c) Min-max scaling ( 3 points )¶
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units.In this approach, the data is scaled to a fixed range – usually 0 to 1.

A min-max scaling is typically done via the following equation:

$X_{scaled} = \frac{X – X_{min}}{X_{max} – X_{min}}$

Eg. Given a sequence [ 1, 2, 3, 4, 5 ]

The min-max scaled version would be [ 0, 0.25, 0.50, 0.75. 1.0 ]

In [ ]:

# Perform min-max scaling scaling on all the columns of ‘df_new’
# Store the modified dataframe in a new dataframe ‘df_scaled’

#YOUR CODE HERE:

In [ ]:

# Generate descriptive statistics after min-max scaling
# Observe that the min and max statistics have been modified

df_scaled.describe()

3.d) Standardization ( 3 points )¶
It is a very effective technique which re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1.

Standardization rescales data to have a mean (𝜇) of 0 and standard deviation (𝜎) of 1 (unit variance).

$X_{std} = \frac{X – \mu}{\sigma}$

Standardization can be visualized as a shifting and stretching/shrinking process.

In [ ]:

# Perform standardization on all the columns of ‘df_new’
# Store the modified dataframe in a new dataframe ‘df_std’

#YOUR CODE HERE:

In [ ]:

# Generate descriptive statistics after standardization
# Observe that the mean and standard deviation statistics have been modified

df_std.describe()

In [ ]:

# Displaying histogram plots for the ‘hp’ column

plt.figure()
plt.subplot(3,1,1)
plt.title(“Histogram: Original data – ‘hp'”)
ax1 = df_new[“hp”].plot.hist(bins=50, alpha=1)
plt.subplot(3,1,2)
plt.title(“Histogram: After min-max scaling – ‘hp'”)
ax2 = df_scaled[“hp”].plot.hist(bins=50, alpha=1)
plt.subplot(3,1,3)
plt.title(“Histogram: After standardization – ‘hp'”)
ax3 = df_std[“hp”].plot.hist(bins=50, alpha=1)
plt.tight_layout()

In [ ]: