COMP132 Assignment #6
Columns
Explanations
age
Age of primary beneficiary.
sex
Insurance contractor gender, female, male.
bmi
Body Mass Index. Objective index of body weight (kg/mˆ2) using the ratio of height to weight.
children
Number of children covered by health insurance.
smoker
Regular smoker?
region
The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges
Individual medical costs billed by health insurance.
group. What could you conclude from the plots? Briefly explain your observation. [8 pts]
HINT: You may find some useful Python packages like “seaborn” to draw the plot in a very short command. However, for practicing your programming skills, try to only use the “matplotlib” package you have learned in the lecture to make the plots. HINT: The first command you may start with is “groupby” to get the dataset grouped by “smoker”. The ”groupby” method returns a special object which is a dictionary. Print its keys and values of the dictionary (Please refer to lecture Week9.3, slide 6 for help). Also try to print the data you need (two columns) first, and then use the data to make the plot.
Your plot should look like this:
1.2 Simple Linear Regression (24 pts)
Page 2 of 6
COMP132 Assignment #6
Simple linear regression is a linear regression model with a single explanatory variable (fea- ture). In our case, let us investigate the relationship between the feature “bmi” and the target “charges” firstly.
(a) Extract the two columns from the dataframe, and save them to a variable X and y, separately. Then split the feature X and the target y into training and test sets with the ratio of 70:30. Print the size of the training and test set. [4 pts]
(b) Build a training model using the LinearRegression algorithm and fit your model to the training data. [4 pts]
(c) Before touching the test data, make predictions on the training set firstly. Print the training data and your predicted data together. [4 pts]
(d) Using the training set, calculate Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). [4 pts]
(e) Make predictions on the test set. Evaluate your model using the same error measure- ments. [4 pts]
(f) Make a scatter plot using the original values and the predicted values of y of your test set. [4 pts]
1.3 Multiple Linear Regression (24 pts)
Apart from the feature “bmi”, there might be linear relationships between other features and the medical cost “charges”. Now we will create a model considering all the features in the dataset.
(a) Check the data types of the features and state the di↵erences between “bmi” and other features. Briefly explain why some features can not be directly fed into the learning model. [4 pts]
(b) Create the dummy variables for the categorical features and print the data information of the dataframe. [4 pts]
HINT: You will need to change the data type to ”category” type for some features, and then call a function like this:
(c) Extract all features you get now and put them in X2. Repeat all steps you have done in simple linear regression and build a new multiple linear regression model. Evaluate your model using the same measurements on test data and compare it with the simple model in which “bmi” is the only feature. [12 pts]
new_dataframe_name= pd.get_dummies(dataframe_name, drop_first=True)
Page 3 of 6
COMP132 Assignment #6
NOTE: To make a fair comparison, we need to use the same test dataset. So when you split the training and test dataset, keep “random state” the same as the number you used in 1.1(a).
(d) Make the plot of original y values and predicted y values again. Do you see any changes of the pattern? Briefly explain your observations. [4 pts]
Question 2: K-Nearest Neighbors (KNN) (40 pts)
The classification algorithm we use in this assignment is the K-Nearest Neighbors (KNN) algorithm, and the dataset consists of several medical features including the number of preg- nancies the patient has had, their BMI, insulin level, age, etc., and one target “Outcome”.
(a) Download the dataset diabetes.csv from the assignment page. Extract the features and target to two variables, and print the first 5 records separately. [4 pts]
(b) Make a plot showing the distribution of the target. Use proper legend and add a title in your plot. Briefly comment on your observation of the distribution. [4 pts] NOTE: Be aware of the axis information. The x axis of the plot should be either 0 or 1, and y axis is the counts of the corresponding outcome group.
(c) Create the training and test set with the ratio of 60:40. [4 pt]
(d) Create an instance (estimator object) of the KNN class in which K = 1 and fit it to the training set. [4 pts]
(e) Predict the outcome of the test set. Measure the mean accuracy for your KNN model using the score function in KNeighborsClassifier package and print the result. [4 pts]
(f) Create a new KNN model with K = 30 and repeat the above steps. Which model has a better accuracy score? [5 pts]
(g) Measure the accuracy score of KNN models on both training set and test set with K ranging from 1 to 10. Store the accuracy scores in two lists “training accuracy” and “testing accuracy”. Make a plot with K as the x axis and accuracy scores as y axis. Explain how to choose K to get the highest accuracy score. [15 pts]
NOTE: Make sure you keep using the same training and test data for training and evaluating the models.
Page 4 of 6
COMP132 Assignment #6
Question 3: BioPython (40 pts)
This question is to learn the fundamental features of BioPython. Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
Please note that you will need to install BioPython on your own computer.
All packages you may need to import to Jupyter Notebook for thsi question are as follows:
5
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord from Bio.SeqUtils import GC
from Bio.Seq import transcribe
from Bio.Data import CodonTable
3.1 Sequence Basics [16 pts]
(a) Create a sequence object seq “GATCGATGGGCCTATATAGGATCGAAAATCGC”.
Print the length of the sequence. [2 pts]
(b) Print a slice of seq, which contains the 4th–14th letters. [2 pts]
(c) Count the number of pattern ‘TA’ in the seq. [2 pts]
(d) Print seq using all lower case letters and then print it again using all capital case
letters. [2 pts]
(e) Change the 5th letter of seq to ‘T’. [2 pts]
(f) Check and print the GC content in seq. The GC content is the number of GC nucleotides divided by the total nucleotides. [2 pts]
(g) Count the proportions of A, T, G and C in seq. [2 pts]
(h) Print the complement and reverse complement of seq. [2 pt]
3.2 Advanced Sequence Operations [10 pts]
(a) Read the file cor6 6.fasta, and save the output as record1. Print the id, descrip-
tion, and sequence of the first 3 records in record1. [2 pts]
(b) Print the maximum sequence length of record1. [2 pts]
(c) Find all sequences in record1 with a length less than 300. Print their ids. [2 pts]
(d) Make a line plot to show the length of each sequence in record1. [2 pts]
Page 5 of 6
COMP132 Assignment #6
(e) Make a plot to show the number of counts that ‘AAAAAAA’ appears in each sequence in record1. [2 pts]
3.3 Transcription and Translation [14 pts]
(a) Read the file ls orchid.gbk, and save the output as record2. Define a function that finds a particular record with the given id, and returns its sequence length and index. Call your function three times with the parameters Z78514.1, Z78507.1 and Z78476.1. [4 pts]
(b) Print the transcription of the first 10 records of record2. [2 pts]
(c) Create a sequence record of DNA, RNA and Proteins using AGTGCACAGT. [2 pts]
(d) Connect the two sequences GATC and AGTACACTGGT and print the new sequence. [2 pts]
(e) Reverse translate the sequence “AUGCCGAUCGUAU” into a DNA sequence. [2 pts]
(f) Translate the RNA saved in rosalind prot.txt into proteins based on the standard password table. [2 pts]
Page 6 of 6