程序代写代做代考 Excel algorithm python INSTRUCTIONS: If you prefer to work on the bonus question, please create a folder called cgi-bin under final folder and put all the source code / setup to host an interactive website into cgi-bin. Also, if you prefer to complete the bonus question, please email me the link to your website, or show me your website in person.

INSTRUCTIONS: If you prefer to work on the bonus question, please create a folder called cgi-bin under final folder and put all the source code / setup to host an interactive website into cgi-bin. Also, if you prefer to complete the bonus question, please email me the link to your website, or show me your website in person.

To increase your chances of earning partial credit, comment your code thoroughly and organize it well using best principles to aid readability and decrease redundancy.

Problem 1 Training a Machine Learning Classifier (50pts)

NOTE: The main thing that you will be graded on here is your overall performance measured by ROC_AUC when your classifier is used to make predictions on the testing set. MAKE SURE you do not accidentally train on the testing set or your performance metric will be invalid and you will lose points.

That being said, I have a very lengthy description below. DO NOT feel that you HAVE to include all of these things. Most of this is only important for evaluating partial / bonus credit. If your classifier performs well and everything checks out I won’t specifically look for all the intermediate work. Just organize a Notebook / python script to generate the classifier that is readable (GOOD COMMENTS) / makes sense, and I won’t be too strict about the exact intermediate material that gets you to your final classifier.

Problem / Data Sets¶
For the first chunk of your final, we will train a random forest classifier to predict secondary structure (namely alpha-helices) from sequence data and any additional features that can be derived purely from knowledge of protein sequence. I have provided you with a training set and a testing set containing UniProt ID / Protein Position pairs along with their label. A zero here indicates that the position has either unknown structure or is known not to form an alpha-helix. A one here indicates that the position is known to form an alpha-helix.
All of this information (and any other information I provide you) is included in the public/final_files directory.
• Training_Labels – A table containing the labels for protein positions that you may use to train your classifier.
• Testing_Labels – A table containing the labels for protein positions that you ARE ONLY TO USE FOR FINAL PERFORMANCE EVALUATION. You SHOULD NOT touch this data until you have COMPLETELY TRAINED YOUR CLASSIFIER. The only reason you have access to this data is because I know if I try to automatically apply your classifier to the testing set I will have to manually fix the code for a majority of you. I maintain the right to change this data-set at the last minute to make sure that no-one has trained their model using it.

Feature Starting Points¶
To train your classifier, you are encouraged to come up with whatever features may be informative. I have provided you with a number of amino acid scales which you may wish to use to to derive some basic sequence features. These scales are all obtained from the following site (https://web.expasy.org/protscale/) and are stored under public/final_files:
• Expasy_AA_Scales – A table containing all of the amino acid scales included on Expasy. Rows indicate the 20 amino acids, columns indicate the value for that amino acid in the given scale.
You may use this lookup for instance to convert the AA value for each position in the protein sequences to the corresponding quantitative value on the Expasy Scale.
You may additionally want to do transformations that take into account the sequence context surrounding a position. For instance, looking in a window x residues around a position, what is the average value on one of the amino acid scales (blurring the feature)? The max? The min? How close / far away is the nearest residue that falls above or below a certain threshold? Maybe it also makes sense to rank these features to figure our what positions have the highest values (specifically using the blurred values) relative to the protein they belong in. Or maybe you want to do some normalization on the features to normalize values per protein or globally.
This is a difficult machine learning problem. By that I do not mean that it is difficult to receive credit for the assignment, but this is not a dataset that is easy to classify. You should not expect to see excellent performance. I do not expect to see any ROC_AUC values above 0.75. However, working with the potential features that I have provided / detailed how to generate, you should be able to achieve a performance around 0.70.
Anyone who comes up with additional (and informative) features not dependent on the Expasy AA Scales / neighboring sequence context will receive extra credit for their ingenuity. You may UNDER NO CIRCUMSTANCES fetch the actual labels and use them as a feature. If you go for instance to UniProt, you could scrape all of the secondary structure annotations off of the website (https://www.uniprot.org/uniprot/P30615). Any attempt to do this will result in NO CREDIT.
I have also given you two additional tables…
UniProt2Seq – Contains UniProt, Position, Amino Acid pairs (this is technically redundant with your Testing_Labels and Training_Labels tables). UniProtPos2PDBPos – Contains a mapping between UniProt and PDB Positions. This SHOULD ONLY be used if you try to visualize the protein and your predictions for extra credit, and MAY NOT be used for any part of your features.

Feature Saving¶
All code used to generate any tables you create should be available to me somewhere. If done through python, I want to see the code.
You can also achieve this by keeping a text file if you are more comfortable with text files.

Feature Selection¶
I will not require you to do any formal feature selection / filtering of features, but you are encouraged to do some data exploration to give me an approximate estimate which features are contributing most to your classifier. Some things you may want to consider…
• How correlated are different features to one another (are there many redundant features)? Can you weed out redundant features?
▪ https://stackoverflow.com/questions/29432629/correlation-matrix-using-pandas
• How correlated is each feature with the final label?
• How much predictive power can you get from a limited number of splits on each feature individually?
If you want to do feature selection, that may gain you bonus points / favor with me. A rough algorithm for VERY GREEDY (i.e. not optimal) feature selection is as follows…
1. Start with an emtpy bag of features
2. Train your / evaluate the classifier using each of the candidate features alone.
• Or in combination with the current feature bag.
3. Select the single feature that gives the best performance and add it to the feature bag.
4. Repeat steps 2 and 3 until desired number of features is selected or the addition of new features no longer considerably increases performance.
5. Use final selection of features for your final classifier.
If you do anything like this include a plot that shows how your performance changes with 1 to n features (Should be a curve that flattens out at the top).

Performance Metrics¶
Use whatever performance metrics you consider necessary to judge the performance of your model. The only formal metric that I expect to see is a ROC curve with an AUC value, but I will appreciate seeing anything else that is relevant to you working through problem.
You MUST be able to get an accurate measure of your performance. If this value is considerably off from your performance on the testing set, (for instance, 0.95 vs. 0.68) then you have overfit your model. Since you are not allowed to evaluate your performance on the testing set, you must implement some method of splitting the training data. You may either generate a single train-test split on the training data, of you may implement a cross validation method to get an accurate estimate of your performance.
Make sure that you do not mix your labels or other uninformative information (e.g. UniProt ID or Position) info your features. If you suddenly see yourself with > 0.90 performace make sure you didn’t mess something up and train with your labels or train on your testing / validation set.

The Classifier¶
Your classifier should be a basic random forest. You should modify the random forest to include several helper functions.
1. A function to generate all of your features given a UniProt ID.
2. A function a evaluate your performance / generate whatever performance visualizations you feel appropriate.
3. A function to make predictions (both binary classifications and probabilities) for all positions in a UniProt given only its ID.
• Must call your function to generate features and then call its own predict or predict_proba functions
• This is the only one that I definately want to see
You can add a function to a classifier as follows…
# Create Function
def add(x, y):
return x + y
# FUNCTION END

# Create Classifier
clf = ensemble.RandomForestClassifier()

# Add function as an attribute of the function
clf.add = add

# Use this added function
clf.add(4, 5)
To accomplish this you may need to pass the classifier itself into the function (so that it can call its own functions). So your code may look something like this…
# Create Function for obtianing features
def get_features(UniProt):
# Get / return the features
# FUNCTION END

# Create Function for obtaining predicitons
def predict_uniprot(self, UniProt):
# Get features
X = self.get_features(UniProt)

# Make / return predictions
return self.predict(X)
# FUNCTION END

clf = ensemble.RandomForestClassifier()
clf.get_features = get_features
clf.predict_uniprot = predict_uniprot

clf.predict_uniprot(clf, “P30615”)

Formal Expectations¶
1. Your overall notebook should contain all of the code required to generate your model as well as any visualizations / data explorable that you did. This notebook should be commented EXTREMELY well so that I do not have to read any of your code to understand what you are doing.
• You do not have to work in THIS notebook, you may make a clean notebook as long as you name it Final.ipynb and I can find it.
• Technically you could also just make one python script (.py file rather than .ipynb file). But make sure it is extremely clear to me where this is (name it Final.py) and that it is a VERY READABLE python script if you opt not to construct your classifier in a Notebook.
2. You must derive features on your own, explain what you did to derive them, and why you think that may be a valuable feature.
3. You must include some data exploration to see how (at least some of) these features separate the data by positives and negatives (colored scatter-plot, violin plot, boxplot, or similar). You should also include some evaluation of correlation between features.
4. You should store all features in a table (or tables) so that they can be looked up rather than re-calculated every time. Not technically required, but if you don’t do this and your code / web-page take an obscene amount of time to load, I won’t be happy.
5. For performance, you must at the very least provide a ROC curve showing your performance on the training set, on whatever internal testing / validation set you use, and finally on the testing set.
• Depending on what kind of internal validation you use, it may not be possible / necessary to show a ROC curve, but you should at least be able to provide an average AUC.
6. Your final classifier must be saved / exported as an object so that I can load it an evaluate it if necessary. Save your classifier as “final_clf.joblib”.
• https://scikit-learn.org/stable/modules/model_persistence.html
7. Your classifier should include a function called predict_uniprot that generates predictions for a given UniProt. And one called predict_uniprot_proba that generates prediction probabilities for a given UniProt.
• This is required both so that you can use your classifier on your web-site without having to re-train it from scratch, and so that I can work with your classifier easilly if I need to.

In [ ]:
In [ ]:
In [ ]:

Bonus Question (25pts): A Website to Access Your Predictions

(There are still some technical issues to be solved related to setting up web server on the teaching server, and it will take about 3 more days to fix them. I will post an announcement when the problems are solved.   
If you plan to start this question right now, I would recommend you to set up a web server on your local computer (ask google how to do that) and start working on this question, and later on you can transfer all the files to the teaching server after the server has been fixed. )  
After the teaching server has been fixed:   The teaching server has been configured to use Apache to display web pages. Create a folder called ‘final’ in your home directory, and within this folder, create another folder called ‘cgi-bin’. It is essential that you name these folders correctly, otherwise your site will not display. If you have configured this correctly, you will be able to view .html and .shtml files through a browser using addresses of the following form:
• http://bscb-teaching.cb.bscb.cornell.edu/~sdw95/index.shtml
• http://bscb-teaching.cb.bscb.cornell.edu/~sdw95/cgi-bin/Results.cgi?gene_search=MAD2
Where you will replace sdw95 with your own. The examples linked above are from a previous TA. You need to be on campus or connected to the VPN to view these links in a web browser.

Your task is to design a website to make predictions / display results for given UniProt IDs based on your classifier. Your website should be designed however you want but should roughly at least have the following functionality and pages…
NOTE: Again, these are kind of loose guidlines. Make a website I will be happy with and if you add creative / neat features that are not exactly the ones I describe below, as long as it is a good website I won’t be too strict about specifics.
Index.shtml¶
A basic home page that provides at the minimum a text box where users can input UniProt IDs, click a submit button, and be redirected to a results page (python cgi-bin script). There should be additional options to obtain results as raw text, and to obtain results as binary predictions or predicted probabilities.
Your website may opt to only accept UniProt IDs included in the UniProt2Seq.txt (so that you do not need to fetch any additional sequences). But you may choose to try to fetch new sequences yourself for extra points. You may also add the possibility to make predictions on a user-provided sequence rather than a sequence looked up from a UniProt ID.
Should also include links to all of the other pages in a header or other easilly navigable format.
Results Page¶
A basic cgi-bin script that takes in at least one parameter (probably provided through POST), which indicates which UniProt ID is being queried, and generates results for that protein. Results should be reported in a tabular format including at minimum Position and Prediction columns. You may also include a Prediction Probabilities column and may add a parameter to select which columns you want included in the output. You may also want to include a column containing the true label where applicable (do so only for the training set if you do, so that I can compare with a testing ID that does not have the true label available).
You may include additional text / information about the UniProt (you would have to fetch from UniProt’s website since I don’t explicitly provide you this info) to make this results page more appealing / informative.
May include download links to obtain the raw features used in clasification.
Performance Page¶
A page that includes information on your performance. Should include whatever metrics you feel appropriate (at least a ROC curve with AUC vlaues). This information should be included for both your training set, the testing set (and if relevant, whatever internal control you used).
Help Page¶
A basic page with instructions on how to use your website / outlining any hidden features that I might not know about.
About Page¶
A basic page including details on what the purpose of the page is. Should include details on the features that you calculate / use. This is a main summary for you to tell me what you did, why, and how it worked out for you for the final. You should also describe any extra features that you added / worked on beyond the expected scope of the final (i.e. anything that you would like me to take particular note for in evaluating potential bonus credit).
In [ ]:

Related Posts