R语言代写

Final Project

Final Project Guidelines and Evaluation Criteria

Sections:
1. Standard final projects guidelines
2. Data set selection
3. Include three supervised learning models taught in class
4. Provide and describe a variety of quality graphics used for data exploration. 5. Optional inclusion of clustering methods and graphics

6. What to include in the report
7. What to avoid
8. Some general suggestions for writing the paper and/or the presentation outline 9. Standard project evaluation
10. Special project

1. Standard final projects guidelines

In a standard final project the team selects, explores and describes different aspects of from one to three data sets using a combination of models, graphics and well written text. The selection of one data set may suffice.

The data exploration project is to feature a linear regression model and two other supervised learning models of those listed in Section 3. The project is also to show variety of quality exploratory graphics based on the methodology taught in class.

The class presentation time limited to 7 minutes The project report is limited to 16 pages.

The report may include an appendix that does not count against the paper limit. This may include documentation of extra effort for data gathering and preparation. Such may have small impact of the grade. Other inclusions such as R scripts and enlarged versions of key project graphics are nice for archival purposes but are not considered in the grading process.

2. Data set selection

A team can select a single data set of sufficient size and complexity (in terms of cases and variables), to support the full variety of models and exploratory graphics needed for this project. The whole data set does not need to be used. It is also okay for each team member to select a different data set.

Some data sets take a lot of pre-processing to be suitable for use in analysis tools. Pre- processing is not the focus of this class and this class project. Little credit is given for this so it is advisable to select data that involves little or no pre-processing.

If much pre-process is done an appendix entry can provide evidence concerning the level of effort.

3. The project is to include a linear regression model and two supervised learning models taught in class

For count data supervised learning model options one of three generalized linear models for count data (logit, probit, Poisson models). Random forests models can be used for either classification or regression. A lasso model can be used for regression.

For example project a might use a linear regression model, a random forest regression model and the lasso regression model. Typically each model will address different part of large data set or a different data set. The direct comparison of two models using the same data set is less than ideal but okay.

For linear regression models a project with a cross-validated linear regression model will likely score bit higher than one using ordinary linear regression model. Similar linear regression use interaction terms and polynomials and fit significantly better that linear regression with origin predictor values score high. In this based both model tabular summaries should be included.

Linear regression models have four diagnostic plots. The spread-location and Q-Q normal plots should be included and briefly discussed.

All the modeling methods have tabular summaries that should be included in the report and briefly discussed. Some brief. The model summaries should be included and interpreted in the project report. A preliminary and final

The generalized linear models don’t need plot but a coefficient or fitted model plot can be included. If chosen a random forest model have a variable importance plot that should be include and briefly discussed. If chosen, lasso regression has the two plots related to variable coefficient shrinkage. At least one of these should be include and briefly discussed.

A student seeking to use a different modeling method in the ISLR or RFE texts can usually get instructor permission.

4. Provide and describe a variety of quality graphics used for data exploration.

Scatterplots matrices are often good to used as they may reveal outliers, bivariate functional relationships, and density patterns, and provide information such as the range of variable values. Maps can show geospatial patterns.

There are many class graphics from which to choose. The partial description here is to serve as reminders.

Row-labeled plots were taught in class. These are not suitable for showing a large number of many cases may be used feature selected cases.

The simplest row-labeled dot plots use the y-axis to plot horizontal case labels at implicit counting integer locations. They use x-axis scale to the determine position of dot that encode the value of ordered numeric variable.

This design has many extensions. When dots represent estimates, estimate confidence intervals and also be shown using line segments. A pair of variables for two different times can be shown as arrows. A whole distribution for each case can be summarize as using box plot or shown a density or violin plot. Additional variables or distribution can be shown in additional columns.

The sorting of cases can based on the estimate values. This simplifies appearance, supports rapid identification of cases with similar values and more accurate comparison of the cases.

When there are an additional one or two categorical variables with a few labels (factor levels), juxtaposed and superposed row-labeled plots can encode the categorical variables. Row and/or column juxtaposed panel plots use panel containment and panel strip labels to show category specific value of other variables. In superposed plots, symbol shape and/or symbol color plus legend(s) provide the categorical encodings.

The family type and race dot plots used conditioned (juxtaposed) panels. These examples introduced the sorting of both cases and panels to simply appearance. Examples also introduced including a reference value to support visual comparisons against a common standard. It can be useful to show explicitly show deviations from such reference value.

Long lists of cases can be visually intimidating. After sorting based on a selected variable, cases can be partitioned into small and less intimidating groups. Focusing on the cases of a small group across multiple columns of statistics in plot or a table becomes easy. The grouping also supports inclusion columns of small maps and icons that feature the cases of the group, and even cumulative cases from reading top down or bottom up. Linked micromaps provide examples.

Many graphics forego case labels to encode more data and provide higher resolution encodings. Scatterplots of two continuous variables may involve millions cases. Scatterplots often serve to convey data density and functional relationships. Statistical graphics commonly encode computed density values and functional relationships because human visual impressions of these hard to communicate and don’t address overplotting.

For showing local data density, hexagon (or square) bin counts address overplotting issues can be encoded using symbol size, color or both. Bivariate kernel density estimates can be convey of contours, perspectives views of lines on the surface, and lighting model rendered rotatable 3D surfaces.

Various scatterplot smoothes with or without confidence bands can suggest functional relationships.

Scatterplot matrices can quick overview several variable and convey patterns related data density and functional relations. The class included a shiny version that supported interactive change of density and function relationship computation and rendering parameters.

When data sets have a geospatial context the graphics can include maps. In this class linked, conditioned and comparative micromaps often preferred to conventional choropleth maps since they often provide more context for interpretation and/or communication. Still, inclusion of conventional choropleth map with good legend is okay.

Interactive and dynamic graphics using shiny, plotly, and leaflet also counts toward graphics diversity. If the graphics use R packages other than those used in class this should be stated. This may warrant a little additional credit for taking initiative make it clear that other software did not produce the plot.

One plot using Tableau, or other software not used in the class, is permitted provided that the use of such software is clearly stated. Expect deductions for include additional plots using other software.

5. Optional inclusion of clustering methods and graphics

The class provided a very short introduction the unsupervised learning methods for clustering (for case reduction) and principal components for variable reduction. These can be included in the project for graphics diversity, but projects don’t them to be outstanding.

Clustering methods can be based on distance or dissimilarity matrices and address either cases or variables. When the matrix is of modest size it can be shown as a cell colored matrix with legend. (That is being increasingly called a heat map.) The results of both agglomerative and divisive clustering can be represented using trees shown as dendrograms or as containment trees. The dendrograms can be place row and column edges of the matrix to provide labels and motivate the ordering. (Often heat maps are of little use by themselves. They may be useful in a dynamic interactive setting that is not address in this class.)

This class focused a simple k-means approach to clustering. This and related variants of are widely used because they are fast. The task of selecting the number of clusters

doesn’t have an all-purpose solution. The course content includes one algorithm and plot for selecting the number of clusters. It includes a diagnostic plot calling to attention cases that are suspiciously similar to another cluster.

6. What to include in the report.

Include the source of each data set used in the project and describe the data set briefly. Providing a coherent written description of the facets of the data addressed in modeling and data exploration. Section 3 describes what to report for models and their associated graphics. Exploratory plot should fill the rest of the report and each plot should have at least one comment about what it shows. A comment might be that there isn’t much of patter. However it is better pick plot that has pattern worth of a comment.

There is no need for the models or the exploratory graphics to find a result of scientific or social importance.

7. What to avoid

Avoid the use of bulleted lists in the report. The report is not a powerpoint presentation.

Avoid misspelling and grammar errors that are easily caught with grammar checker.

8. Some general suggestions for writing the paper and/or the presentation outline

For each data set the report can provide a quick answer to the question, why this data set? If a particular interest motivated the data set choice, it is good to express this. If you have questions you hoped to answer using the data it is fine to mention this even if the data set didn’t provide an answer. Some students pick data sets because classmates will likely find them interesting. That is fine.

Briefly describe the data set. Think of questions to answer. What does the data set address? Does it include geospatial and/or temporal data? What variable may be related based on prior knowledge. What is the source of the data? How was the original data obtained? If a significant amount of pre-processing was involved indicate briefly in the take indicate provide a little more description in the report appendix.

The data exploration and data modeling descriptions can be interwoven. If the data sets context suggests a dependent variable to model, univariate graphics might show its density and the density of couple of explanatory variables. A scatterplot matrix with dependent variable and more explanatory variables could show smoothes that suggest function relationships. It could also use hexagon bins to show bivariate density patterns. The scatterplot matrix may show explanatory variables that highly correlated.

If there is no obvious choice for a dependent variable, the scatterplot matrix smoothes may suggest a variable. Otherwise it may be best to choose a different data set.

If the team uses different data sets the report might address each data set in turn.

The challenge is often to limit the presentation examples based on the presentation time limit and to project description based on the page limit.

9. Standard Project Evaluation

The thoughtful investigation or description of a data set or data sets following the criteria below can warrant an A grade whether or not some exciting result is found.

There are four broad evaluation areas: 1) Appropriate use, description and interpretation of models and their output, 2) the variety and quality of graphics use to explore the data as shown in the report, 3) the level of effort, and 4) the writing quality.

 Level of effort areas
o Effort to find a better model, for example using variable transformations in

linear regression.
o Use of class related or suggest methods such as plotly that were little

taught in class.
o Development of a GUI to access data or facilitate graphics production
o Development and/or integration visual analytics
o Substantial adaptation of existing graphics or using graphics for new data o Development of new graphics
o Creation of R functions or even an R package
o Data cleaning and preparation (counts a little)

 Adequate variety and correct use of methods taught in class
o The different kind of models are to be from the following list

 Multiple linear regression, preferable a cross-validation model  Generalize linear regression (logit, probit or Poisson models)  Random forest regression and/or classification
 Lasso regression model

o Exploratorygraphics
 One, two or more factors

 One, two or more continuous variables.
 Time series
 Map showing spatial context
 Data distribution and functional relationship graphics

 Graphic labeling and following guideline o Provide the units of measure, etc. o Making accurate comparisons
o Adding context for interpretation o Strive for simple appearance.

o Engage the reader

  •   Interactive and dynamics graphics, such as Shiny, CCmaps and TCmap can work presentations.
  •   The project can show a print and include comments referencing the dynamic graphics shown in class
  •   Increasing dynamic html document will reach more people  Avoid graphics deprecated in class. The deprecated methods in some cases go against common practice and perhaps even some perceptual studies.  Pie charts
  •   Perspective bars without watermarks
  •   Symbols plotted on the plot outline (exceptions will be made for ggplot facetted graphics because the axis limits cannot be controlled).
  •   Axes with missing labels and/or units of measure   Writing Quality
    o Papers are to be well written. Non-native English speakers are advised to get help with their writing. There are GMU resources for this. Native English speakers are also encouraged to help with their writing. Some students are already excellent writing so need no help, but most people can benefit the input.

o Indicate the nature of the data set and its source
o Indicate the goals for the graphics and/or analysis
o Logical and/or systematic description and reasoning o Statements of results or conclusions
o Clear labeling of figures and tables

10. Special projects

A team can suggest a special project for consideration. This a project that deviate from focusing directly on methods taught in class. Projects that have strong education merit and are reasonable close to class learning objectives are likely to be approved. For example topics in the class texts are likely to be approved. Also there are numerous R packages of potential interest.