PREAMBLE
STAT 430/830: Final Project
DUE: Friday August 14 by 11:59pm EST
Netflix, at one time just an online DVD rental service, has become a titan in the entertainment industry. While predominantly a streaming service, Netflix has also become well-known for its original programming such as the Stranger Things television series, or the Oscar-nominated film Marriage Story.
The success of Netflix is due, in part, to their well-known data-driven culture. Enmeshed within this culture is a strong appreciation for, and exploitation of, designed experiments. Netflix’s home-grown ABlaze experimentation platform is well-known in the industry for its sophistication and the “wins” it has helped them achieve. It is perhaps unsurprising, then, that Netflix is a leader in online-experimentation. Though not recent, this job ad from 2016 for a Senior Data Scientist illustrates the organization’s experimental maturity. In this role, you would “design, run, and analyze A/B and multivariate tests”, “analyze experimental data with statistical rigor”, and “adapt existing methods such as Response Surface Methodology (RSM) to online A/B testing”.
In this project you will embark on a Netflix-inspired experimental journey with a hypothetical problem and a web-based response surface simulator.
THE PROBLEM
In this project you will be concerned with optimizing the www.netflix.com homepage by way of minimizing browsing time. For those unfamiliar with Netflix, a screenshot of the homepage is included above. As is depicted in the screenshot, the homepage is laid out in a grid system in which movies and TV shows appear as tiles with rows differing with respect to some categorization. Though not depicted in the screenshot, when
1
one hovers their mouse over a tile, its size is enlarged and a preview of the show/movie is automatically played in the enlarged window.
When faced with so many viewing options, Netflix users often experience choice-overload and can be overcome by a psychological phenomenon known as decision paralysis. The problem is that it becomes harder to make a decision, and it takes longer to make a decision, when faced with a large number of options to choose from. Decision paralysis negatively impacts Netflix because a user may become overwhelmed by all of the options and fatigued by the prospect of making a choice, and may ultimately lose interest and not watch anything.
To overcome this, Netflix tries to help you choose what to watch, and by a variety of mechanisms tries to help you choose quickly. Of relevance is browsing time – the length of time a user spends browsing (as opposed to watching) Netflix. Ideally, browsing time and, in particular, average browsing time would small. In this project you will conduct a series of experiments to learn what influences browsing time and how that may be exploited in order to minimize average browsing time. There are infinitely many things that likely influence the amount of time someone spends browsing Netflix, but just four factors will be explored in this project. Each of these is described below.
• Tile Size: The ratio of a tile’s height to the overall screen height. Note the tile’s aspect ratio is fixed so changing this factor changes the size of the tile, but not its shape. Smaller values correspond to a larger number of tiles visible on the screen, and larger values correspond to fewer visible tiles.
• Preview Size: The ratio of the preview window’s height to the overall screen height. Note the preview window’s aspect ratio is fixed so changing this factor changes the size of the window, but not its shape. Smaller values correspond to a smaller viewing window, and larger values correspond to a larger viewing window.
• Preview Length: The duration (in seconds) of a show or movie’s preview.
• Top Row: The viewing category of a user’s first row of tiles.
The table below summarizes the design space for each of these factors, and the default values they take on
when not being experimented with.
† For purposes of experimentation Prev.Length can only be changed in increments of 5 seconds ‡ TC stands for Top 10 in Canada and NO stands for Netflix Originals
Through a series of experiments you will seek to determine which of these factors significantly influences browsing time, and you will attempt to find an optimal configuration of them that minimizes expected browsing time. You will do this by interacting with a web-based simulator, into which you will submit experimental designs and out of which you will receive response observations.
The remainder of this document provides guidelines for using the simulator, an overview of the sequential experimentation process you will undertake, and a description of the deliverable that you must submit. An outline of the marking scheme is included as an Appendix to make clear my expectations and to make transparent the manner in which you will be graded.
Factor
Code Name
Region of Operability
Default Value
Tile Size Preview Size Preview Length Top Row
Tile.Size
Preview.Size
Preview.Length
Top.Row
[0.1,0.5] [max(Tile.Size, 0.2), 0.8] [30, 120]†
TC, NO‡
0.2 0.5 75 TC
2
THE SIMULATOR
The response surface simulator can be accessed at the following URL:
https://nathaniel-t-stevens.shinyapps.io/Netflix_Simulator/
The interface (pictured above) and the manner in which you interact with it is straightforward: you upload a design matrix and then collect your results. Interaction with the simulator should include three distinct steps:
1. Upload a .csv file containing your design matrix. The .csv file must adhere to the following formatting guidelines:
• The file name must be your 8-digit student number, i.e., 20208083.csv. Any file name other than this will result in an error.
• The columns correspond to design factors with headings Tile.Size, Prev.Size, Prev.Length, Top.Row. Any heading other than these will result in an error. The order of the headings does not matter. You do not need to experiment with every factor in every experiment.
• Each row corresponds to a distinct experimental condition, and each element indicates the level of the corresponding factor.
• Factor levels must be in natural units.
2. Click the “Visualize my Design” button. This will render a plot of the design space and indicate the experimental conditions you plan to run.
• If the design is not the one you intended, you may reset the simulator (by clicking the “Reset” button) and upload a different design matrix.
• If there is anything amiss with the file you uploaded, an error (instead of a plot) will be returned.
3. Supposing you are happy with the design, click the “Run the Experiment” button. This will generate n = 250 browsing times (in minutes) for each condition. The results will be automatically downloaded in a .csv file.
• Remark 1: This mimics the random assignment of n = 250 users to each condition and the observation of their response variable.
• Remark 2: You may assume without justification that n = 250 is a sufficient sample size in each condition for the task at hand.
• Remark 3: You may assume that browsing time observations do not include the amount of time spent watching previews; browsing time is simply the number of minutes spent scrolling and searching.
3
THE EXPERIMENTS
Your experimental journey will consist of three phases as outlined below. Note that STAT 430 students may ignore the Top.Row factor for the entirety of this project. The STAT 830 students, however, must consider all four factors.
PHASE I: Factor Screening
Use a two-level experiment (i.e., 2K factorial or 2K−p fractional factorial) to determine which factors significantly influence the response. A factor deemed insignificant can be ignored in all subsequent phases of experimentation.
STAT 430 Instructions
You will experiment with three factors: Tile.Size, Prev.Size, Prev.Length. The low and high levels of these factors (for this experiment) are shown below.
Using the data collected from your two-level experiment, determine which factors significantly influence browsing time. Be sure to include formal hypothesis tests and main effect plots in your analysis.
STAT 830 Instructions
You will experiment with all four factors: Tile.Size, Prev.Size, Prev.Length, Top.Row. The low and high levels of these factors (for this experiment) are shown below.
Using the data collected from your two-level experiment, determine which factors significantly influence browsing time. Be sure to include formal hypothesis tests and main effect plots in your analysis.
PHASE II: Method of Steepest Descent
Considering only those factors deemed to significantly influence browsing time in Phase I, perform a method of steepest descent analysis to move from the initial region of experimentation toward the vicinity of the optimum. Note that this may require intermediate two-level designs to reorient toward the optimum. You will find tests for curvature and a plot of average browsing time vs. step number useful.
PHASE III: Response Optimization
Once you are confident that you are in the vicinity of the optimum, conduct a central composite design and use a second order response surface model to identify the location of the optimum (i.e., the factor levels that minimize expected browsing time). Report the estimate and a 95% confidence interval for the expected browsing time at this location.
Factor
Low
High
Tile.Size
Preview.Size
Preview.Length
0.1 0.3 30
0.3 0.5 90
Factor
Low
High
Tile.Size
Preview.Size
Preview.Length
Top.Row
0.1 0.3 30 TC
0.3 0.5 90 NO
4
THE DELIVERABLE
You will prepare and submit a report (saved as a five separate .pdf files) via Crowdmark by the due date listed at the top of this document. The five files constituting the report will be based on the following elements:
• File
– –
• File
– –
• File
– –
–
• File
– – –
• File
– –
–
#1: Executive Summary (1 page max)
Summary of the problem, your experimental journey, and the ensuing findings. Be sure to state the location and value of the optimum.
#2: Introduction (2 pages max)
Describe in your own words the problem you are trying to solve
Describe in your own words the goals of response surface methodology
#3: Factor Screening (2 pages max)
Explain your factoring screening experiment through the lens of QPDAC. State the objective, explain your design, collect the data, analyze the data, and draw a conclusion.
Be sure to justify any decisions you made in either the design or the analysis. For instance, why did you use a 2K factorial experiment as opposed to a 2K−p fractional factorial experiment (or vice versa)?
Be sure to include visual and/or tabular summaries of the experiment.
#4: Method of Steepest Descent (2 pages max)
Explain your MSD experiments through the lens of QPDAC. State the objective, explain your design, collect the data, analyze the data, and draw a conclusion.
Be sure to justify any decisions you made in either the design or the analysis. For instance, how did you choose your step sizes? How did you know when to stop?
Be sure to include visual and/or tabular summaries of the experiment.
#5: Response Optimization (2 pages max)
Explain your response surface experiment through the lens of QPDAC. State the objective, explain your design, collect the data, analyze the data, and draw a conclusion.
Be sure to justify any decisions you made in either the design or the analysis. For instance, how did you choose low and high levels of the factors? How did you choose where to place your axial conditions?
Be sure to include visual and/or tabular summaries of the experiment.
IMPORTANT: Your report will not contain R code or R output. Discussion of your analyses should be succinct, and analysis results should be included as figures and/or nicely formatted tables. Note that figures and tables count toward the page limit. Include only that which is necessary to tell your story and to justify your decisions.
5
APPENDIX: Marking Scheme
Your project will be marked out of 50 points. The points are allocated as follows. • Executive Summary [5 points]
– [2] Grammar, professionalism – [3] Clarity, relevance
• Introduction [10 points]
– [2] Grammar, professionalism
– [3] Clarity of problem recapitulation
– [5] Clarity, coverage/depth, relevance of RSM discussion
• Factor Screening [10 points]
– [2] Grammar, professionalism
– [1] Clarity of question
– [3] Suitability of design and clarity of design choices
– [3] Suitability of analysis and clarity of analysis choices – [1] Suitability and clarity of conclusions
• Method of Steepest Descent [10 points]
– [2] Grammar, professionalism
– [1] Clarity of question
– [3] Suitability of design and clarity of design choices
– [3] Suitability of analysis and clarity of analysis choices – [1] Suitability and clarity of conclusions
• Response Optimization [10 points]
– [2] Grammar, professionalism
– [1] Clarity of question
– [3] Suitability of design and clarity of design choices
– [3] Suitability of analysis and clarity of analysis choices – [1] Suitability and clarity of conclusions
• Accuracy of Optimum [3 points]
– [3] If the percent difference between the true minumum browsing time and the estimate at your stationary point is less than or equal to 1%.
– [2] If the percent difference between the true minumum browsing time and the estimate at your stationary point is more than 1% but less than or equal to 5%.
– [1] If the percent difference between the true minumum browsing time and the estimate at your stationary point is more than 5% but less than or equal to 10%.
– [0] If the percent difference between the true minumum browsing time and the estimate at your stationary point is more than 10%.
• Efficiency of Experimentation1 [2 points]
– [2] If the total number of experimental conditions performed is ≤ 20 (STAT 430), ≤ 40 (STAT 830)
– [1] If the total number of experimental conditions performed is > 20 and ≤ 25 (STAT 430), > 40 and ≤ 50 (STAT 830)
– [0] If the total number of experimental conditions performed is > 25 (STAT 430), > 50 (STAT 830)
1If you want to play around with the simulation without sacrificing your condition count, feel free to play with 20203083, but note that it has a different underlying response surface than yours. Exploring it will not provide any insight for your surface.
6