Project
该项目的数据来自手机广告。 为了鼓励消费者安装其应用程序(例如游戏),应用程序开发人员通过移动广告平台在其他应用程序(例如其他游戏)上宣传其应用程序。 在其他应用程序上查看这些广告的消费者可以单击广告以从开发人员处安装该应用程序。 我们会将广告应用开发者称为广告商。 参见下图。
The data for this project comes from the mobile advertising space. In order to encourage consumers to install its app (e.g. a game), an app developer advertises its app on other apps (e.g., other games) through a mobile advertising platform. Consumers viewing these ads on these other apps can click on the ad to install the app from the developer. We will refer to the advertising app developer as the advertiser. See figure below.
Advertiser
Advertising Platform
.
.
.
Publisher 1
Publisher 2
Publisher k-1
Publisher k
Consumer
Consumer
Consumer
Consumer
.
.
.
Install
Not Install
Advertiser
Advertising Platform
.
.
.
Publisher 1
Publisher 2
Publisher k-1
Publisher k
Consumer
Consumer
Consumer
Consumer
.
.
.
Install
Not Install
该项目的数据集包含有关来自一个特定广告商通过多个发布商的广告的数据。 每个观察结果对应于在特定发布者应用上向消费者显示的一个广告。 该观察结果包含有关发布者ID,消费者的设备特征以及是否安装了广告客户的应用程序的信息。 变量的说明如下。
The dataset for this project contains data about ads from one particular advertiser through multiple publishers. Each observation corresponds to one ad shown to a consumer on a particular publisher app. The observation contains information about the publisher id, consumer’s device characteristics, and whether the advertiser’s app was installed or not. The description of the variables are given below.
Variable
Type
Description
publisher_id_class
Categorical
Publisher Id
device_make_class
Categorical
Device Manufacturer
device_platform_class
Categorical
Phone OS Type (iPhone / Android)
device_os_class
Categorical
Phone OS Version
device_height
Numerical
Display Height (in pixels)
device_width
Numerical
Display Width (in pixels)
Resolution
Numerical
Display Resolution (pixels per inch)
device_volume
Numerical
Device Volume when Ad was displayed
Wifi
Numerical
Whether WiFi was enabled when ad was displayed (Yes = 1, No = 0)
Install
Binary
Whether Consumer Installed Advertiser’s App (Yes = 1, No = 0)
Part I.
广告商需要根据发布商和消费者特征确定支付多少广告来放置广告。 最佳付款与看到广告的消费者安装广告的可能性成正比。
The advertiser needs to determine how much to pay for placing ad depending on the publisher and on the consumer characteristics. The optimal payment is proportional to the probability that a consumer seeing the ad will install the ad.
1. 开发一个线性概率模型,以根据发布者和消费者特征估算安装广告的概率。 仅介绍最终模型,并说明该模型所用的过程和不同的措施。
Develop a linear probability model to estimate the probability of installing the ad based on publisher and consumer characteristics. Present only the final model and explain the procedure and different measures you have used to come up with this model.
2.开发一个逻辑回归模型,以根据发布者和消费者特征估算安装广告的可能性。 仅介绍最终模型,并说明该模型所用的过程和不同的措施。 在这种情况下,您是否需要考虑对罕见事件建模–为什么/为什么不呢? 展示两种方法的结果–即(i)在不考虑稀有事件的情况下估计模型,以及(ii)使用过采样方法处理稀有事件,然后通过应用校正获得校正后的截距来估计模型(请参见讲座,另请参见 http://support.sas.com/kb/22/601.html,了解如何在SAS中直接处理此问题。
Develop a logistic regression model to estimate the probability of installing the ad based on publisher and consumer characteristics. Present only the final model and explain the procedure and different measures you have used to come up with this model. Do you need to consider modeling of rare events in this case – why / why not? Present the results of both approaches – that is (i) estimate the model without considering rare events, and (ii) estimate the model using oversampling approach for handling rare events and then applying the correction to obtain the corrected intercept (see lecture, also see http://support.sas.com/kb/22/601.html for how to directly handle this in SAS).
3. 绘制所有模型的ROC曲线。 (提示:您可以使用PROC LOGISTIC来绘制线性概率模型的ROC,而无需拟合模型–请参见讲座。类似地,您可以在对Logistic回归模型的原始数据集进行过采样后估算出罕见结果,从而使用ROC来绘制ROC 使用ROC的过采样数据会给您带来错误的比较)。
Plot the ROC curves for all models. (Hint: You can use PROC LOGISTIC to plot the ROC for the linear probability model without fitting the model – see lecture. Similarly, you can plot the ROC using the original dataset for the logistic regression model for rare outcomes after estimating it with the oversampled data. Using the oversampled data for the ROC will give you a wrong comparison).
4. 上面哪个模型的AUC(曲线下面积)在95% confidence level最高,此模型的AUC是否高于其他模型的AUC? (提示:您需要查看针对AUC报告的置信区间)。
Which of the above models has the highest AUC (area under the curve). At the 95% confidence level, is the AUC of this model higher than those of the other models? (Hint: You need to look at the confidence intervals reported for AUC).
Part II
广告平台希望根据发布者和消费者特征来确定是否显示来自该广告商的广告。 特别地,广告平台需要提出阈值,使得如果安装广告的可能性高于该阈值,则将广告显示给消费者。
The advertising platform would like to determine whether to show the ad from this advertiser depending on the publisher and consumer characteristics. In particular, the advertising platform needs to come up with a threshold such that if the probability of installing the ad is above that threshold, the ad is shown to the consumer.
向不会安装该应用的消费者展示广告会给消费者带来一些不便的费用,进而导致参与度降低,并导致平台损失1美分。 另一方面,不向要安装该应用程序的消费者展示广告会导致平台错失100美分的机会成本。 该平台希望将总预期成本降至最低。
Showing an ad to a consumer who would not install the app results in some inconvenience cost to the consumer which in turn leads to less participation and causes a loss of 1 cent to the platform. On the other hand, not showing an ad to a consumer who would have installed the app results in a missed opportunity cost of 100 cents to the platform. The platform would like to minimize the total expected cost.
对于您估计的上述每个模型,使用SAS生成ROC表,并绘制不同阈值的总成本。 请注意,对于线性概率模型(与逻辑回归模型不同),SAS不会自动生成ROC表。 您将需要编写一个proc或data步骤来自己创建表。 为了简化工作,您可以按照以下阈值计算总成本:
0.001 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050
For each of the above models you estimated, generate the ROC table using SAS, and plot the total cost for different threshold values. Note that for the linear probability model (unlike the logistic regression model), SAS does not generate the ROC table automatically. You will need to write a proc or data step to create the table yourself. To make your job easier, you can calculate the total cost at these thresholds:
0.001 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050
以下哪个模型提供了最低的总成本? (对于罕见结果的逻辑回归模型,您不能使用过采样的数据来计算成本,因为这不能代表结果的实际分布。)
Which of these model provides the lowest total cost? (For the logistic regression model for rare outcomes, you cannot use the oversampled data to calculate the cost since this is not representative of the actual distribution of outcomes.)
Deliverables
•项目报告:对于上述每个问题,请描述您遵循的模型构建和选择过程,并根据需要描述适当的表格和图形。
•SAS代码:包括一个带有详细注释的SAS文件,以重现报告中的所有结果,表格和图形。 必须清楚地标记代码,以便直接看到如何再现特定的结果/表格/图形。 如果代码不执行,则扣分。 该代码应假定它将在包含数据集的文件夹中执行。
• Project Report: For each question above, describe the model building and selection process that you followed, along with suitable tables and graphs as necessary.
• SAS code: Include a SAS file with detailed comments to reproduce all the results, tables and figures in the report. The code must be clearly labeled so that it is straightforward to see how to reproduce a particular result / table / figure. If the code will not execute, then points will be deducted. The code should assume that it will be executed in the folder containing the dataset.