程序代写 Biostat 625 Homework #4

Biostat 625 Homework #4

In this homework, you will use R to access and analyze a dataset on Google Cloud (https://cloud.
google.com/) via BigQuery (https://cloud.google.com/bigquery) by following the steps below:

1. Get familiar with the Google Cloud Console and BigQuery by going through the online
tutorial “Query the Wikipedia dataset in BigQuery” at https://codelabs.developers.
google.com/codelabs/cloud-bigquery-wikipedia/#0.

2. Get familiar with using R to work with BigQuery by going over the online tutorial “Data
science with R on Google Cloud: Exploratory data analysis tutorial” at https://cloud.
google.com/architecture/data-science-with-r-on-gcp-eda.

Note: Do NOT follow this tutorial word by word as it will require you creating a paid
account on Google Cloud to use the JypyterLab environment in the Vertex AI Workbench
product (although the $300 free credits provided to new customers will likely cover all the
cost). Instead, you should avoid creating such a paid account, understand this tutorial and
then directly use the BigQuery API within your local R, RStudio or Jupyter installation to
conduct all the analyses.

3. Based on the tutorial, write a BigQuery SQL query to extract the baby’s weight, year, gender,
plurality, gestation weeks, mother’s age, cigarette and alcohol uses, from 1000 babies for each
year from 2001 to 2008. (10 pts)

4. Split the data you extracted in 3) into random training and testing datasets each with 50%
of the data. Based on the training dataset, train a prediction model using a method of your
choice (e.g., regression model, SVM, random forest, etc) to predict the baby’s weight based on
all other variables. Test your prediction model using the testing dataset. Report both training
and testing errors using the RMSE (root mean squared error, https://en.wikipedia.org/
wiki/Root-mean-square_deviation) metric. (10 pts)

5. Based on the tutorial, write a BigQuery SQL query to extract the number of babies and
average baby’s weight for each year from 2001 to 2008, and plot the both variables against
year. (10 pts)

6. Conduct the same analyses from 3) to 5), but use dplyr and DBI to extract data instead of
directly writing SQL queries. (10 pts)

7. Print your R source code (e.g. in R, Rmd or ipynb) together with all run-time output and
figures as a single PDF file and submit it. You code and output should be clearly written,
well organized and documented.

Total points: 40 pts.

https://cloud.google.com/
https://cloud.google.com/
https://cloud.google.com/bigquery
https://codelabs.developers.google.com/codelabs/cloud-bigquery-wikipedia/#0
https://codelabs.developers.google.com/codelabs/cloud-bigquery-wikipedia/#0
https://cloud.google.com/architecture/data-science-with-r-on-gcp-eda
https://cloud.google.com/architecture/data-science-with-r-on-gcp-eda
https://en.wikipedia.org/wiki/Root-mean-square_deviation
https://en.wikipedia.org/wiki/Root-mean-square_deviation

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts