程序代写代做代考 algorithm Bayesian python Workshop 7 Baynesian Approach on AB Testing¶

Workshop 7 Baynesian Approach on AB Testing¶
`Original content adopted from Cam Davidson-Pilon in his book Baynesian Methods for Hackers
This worhsop introduces more PyMC3 syntax (a package dedicated to Bayesian analysis) and variables and ways to think about how to model a system from a Bayesian perspective. It also contains tips and data visualization techniques for assessing goodness-of-fit for your Bayesian model.
In [1]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
import scipy.stats as stats

A little more on PyMC3¶
Model Context¶
In PyMC3, we typically handle all the variables we want in our model within the context of the Model object.
In [3]:
import pymc3 as pm
with pm.Model() as model:
parameter = pm.Exponential(“poisson_param”, 1.0)
data_generator = pm.Poisson(“data_generator”, parameter)
with model:
data_plus_one = data_generator + 1

—————————————————————————
ModuleNotFoundError Traceback (most recent call last)
in
—-> 1 import pymc3 as pm
2 with pm.Model() as model:
3 parameter = pm.Exponential(“poisson_param”, 1.0)
4 data_generator = pm.Poisson(“data_generator”, parameter)
5 with model:

ModuleNotFoundError: No module named ‘pymc3’

This is an extra layer of convenience compared to PyMC. Any variables created within a given Model’s context will be automatically assigned to that model. If you try to define a variable outside of the context of a model, you will get an error.
In [4]:
with pm.Model() as ab_testing:
p_A = pm.Uniform(“P(A)”, 0, 1)
p_B = pm.Uniform(“P(B)”, 0, 1)

—————————————————————————
NameError Traceback (most recent call last)
in
—-> 1 with pm.Model() as ab_testing:
2 p_A = pm.Uniform(“P(A)”, 0, 1)
3 p_B = pm.Uniform(“P(B)”, 0, 1)

NameError: name ‘pm’ is not defined

PyMC3 Variables¶
All PyMC3 variables have an initial value (i.e. test value). Using the same variables from before:
In [5]:
print(“parameter.tag.test_value =”, parameter.tag.test_value)
print(“data_generator.tag.test_value =”, data_generator.tag.test_value)
print(“data_plus_one.tag.test_value =”, data_plus_one.tag.test_value)

—————————————————————————
NameError Traceback (most recent call last)
in
—-> 1 print(“parameter.tag.test_value =”, parameter.tag.test_value)
2 print(“data_generator.tag.test_value =”, data_generator.tag.test_value)
3 print(“data_plus_one.tag.test_value =”, data_plus_one.tag.test_value)

NameError: name ‘parameter’ is not defined

The test_value is used only for the model, as the starting point for sampling if no other start is specified. It will not change as a result of sampling. This initial state can be changed at variable creation by specifying a value for the testval parameter.
In [6]:
with pm.Model() as model:
parameter = pm.Exponential(“poisson_param”, 1.0, testval=0.5)
print(“\nparameter.tag.test_value =”, parameter.tag.test_value)

—————————————————————————
NameError Traceback (most recent call last)
in
—-> 1 with pm.Model() as model:
2 parameter = pm.Exponential(“poisson_param”, 1.0, testval=0.5)
3 print(“\nparameter.tag.test_value =”, parameter.tag.test_value)

NameError: name ‘pm’ is not defined

Deterministic variables¶
Calling pymc3.Deterministic is the most obvious way, but not the only way, to create deterministic variables. Elementary operations, like addition, exponentials etc. implicitly create deterministic variables. For example, the following returns a deterministic variable:
In [ ]:
with pm.Model() as model:
lambda_1 = pm.Exponential(“lambda_1”, 1.0)
lambda_2 = pm.Exponential(“lambda_2”, 1.0)
tau = pm.DiscreteUniform(“tau”, lower=0, upper=10)

new_deterministic_variable = lambda_1 + lambda_2

Theano¶
The majority of the heavy lifting done by PyMC3 is taken care of with the theano package. The notation in theano is remarkably similar to NumPy.
In [ ]:
import numpy as np

import theano.tensor as tt

with pm.Model() as theano_test:
p1 = pm.Uniform(“p”, 0, 1)
p2 = 1 – p1
p = tt.stack([p1, p2])

assignment = pm.Categorical(“assignment”, p)

Including observations in the Model¶
At this point, it may not look like it, but we have fully specified our priors. For example, we can ask and answer questions like “What does my prior distribution of $\lambda_1$ look like?”
In [ ]:
%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
import scipy.stats as stats
figsize(12.5, 4)

samples = lambda_1.random(size=20000)
plt.hist(samples, bins=70, normed=True, histtype=”stepfilled”)
plt.title(“Prior distribution for $\lambda_1$”)
plt.xlim(0, 8);

To frame this in the notation of the first chapter, though this is a slight abuse of notation, we have specified $P(A)$. Our next goal is to include data/evidence/observations $X$ into our model.
PyMC3 stochastic variables have a keyword argument observed. The keyword observed has a very simple role: fix the variable’s current value to be the given data, typically a NumPy array or pandas DataFrame. For example:
In [ ]:
data = np.array([10, 5])
with model:
fixed_variable = pm.Poisson(“fxd”, 1, observed=data)
print(“value: “, fixed_variable.tag.test_value)

This is how we include data into our models: initializing a stochastic variable to have a fixed value.
To complete our text message example, we fix the PyMC3 variable observations to the observed dataset.
In [ ]:
# We’re using some fake data here
data = np.array([10, 25, 15, 20, 35])
with model:
obs = pm.Poisson(“obs”, lambda_, observed=data)
print(obs.tag.test_value)

Modeling approaches¶
A good starting thought to Bayesian modeling is to think about how your data might have been generated. Position yourself in an omniscient position, and try to imagine how you would recreate the dataset.
Asking how your observations may have been generated:
1. Started by thinking “what is the best random variable to describe the data?” A Poisson random variable is a good candidate because it can represent count data. So by modelling the number of sms’s received as sampled from a Poisson distribution.

2. Next, “Ok, assuming sms’s are Poisson-distributed, what do you need for the Poisson distribution?” Well, the Poisson distribution has a parameter $\lambda$.

3. Do you know $\lambda$? No. In fact, you have a suspicion that there are two $\lambda$ values, one for the earlier behaviour and one for the later behaviour. We don’t know when the behaviour dominates or on the stage, but call the switchpoint $\tau$.

4. What is a good distribution for the two $\lambda$s? The exponential is good, as it assigns probabilities to positive real numbers. Well the exponential distribution has a parameter too, call it $\alpha$.

5. Do you know what the parameter $\alpha$ might be? No. At this point, you could continue and assign a distribution to $\alpha$, but it’s better to stop once you reach a set level of ignorance: whereas you have a prior belief about $\lambda$, (“it probably changes over time”, “it’s likely between 10 and 30”, etc.), you don’t really have any strong beliefs about $\alpha$. So it’s best to stop here.

6. What is a good value for $\alpha$ then? You think that the $\lambda$s are between 10-30, so if we set $\alpha$ really low (which corresponds to larger probability on high values) we are not reflecting your prior well. Similar, a too-high alpha misses your prior belief as well. A good idea for $\alpha$ as to reflect your belief is to set the value so that the mean of $\lambda$, given $\alpha$, is equal to your observed mean.

7. You have no expert opinion of when $\tau$ might have occurred. So you will suppose $\tau$ is from a discrete uniform distribution over the entire timespan.


Below a graphical visualization of above thinking, where arrows denote the sequence logic relationships. (provided by the Daft Python library )

PyMC3, and other probabilistic programming languages, have been designed to tell these data-generation stories.
The Philosophy of Bayesian Inference¶
You are a skilled programmer, but bugs still slip into your code. After a particularly difficult implementation of an algorithm, you decide to test your code on a trivial example. It passes. You test the code on a harder problem. It passes once again. And it passes the next, even more difficult, test too! You are starting to believe that there may be no bugs in this code…

A/B testing¶
A/B testing is a statistical design pattern for determining the difference of effectiveness between two different treatments.
For example: front-end web developers are interested in which design of their website yields more conversion. They will route some fraction of visitors to site A, and the other fraction to site B, and record if the visit yielded a sale or not. The data is recorded (in real-time), and analyzed afterwards.
Often, the post-experiment analysis is done using something called a hypothesis test like difference of means test or difference of proportions test using “p-values”. If you still do not understand the derivation, the Bayesian approach may be right.

A Simple Case¶
As this is a hacker book, we’ll continue with the web-dev example. For the moment, we will focus on the analysis of site A only. Assume that there is some true $0 \lt p_A \lt 1$ probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Currently, this quantity is unknown to us.
Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude hastily that $p_A = \frac{n}{N}$. Unfortunately, the observed frequency $\frac{n}{N}$ does not necessarily equal $p_A$ — there is a difference between the observed frequency and the true frequency of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\frac{1}{6}$. Knowing the true frequency of events like:
• fraction of users who make purchases,
• frequency of social attributes,
• percent of internet users with cats etc.
are common requests we ask of Nature. Unfortunately, often Nature hides the true frequency from us and we must infer it from observed data.
The observed frequency is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, $\frac{1}{6}$. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.
With respect to our A/B example, we are interested in using what we know, $N$ (the total trials administered) and $n$ (the number of conversions), to estimate what $p_A$, the true frequency of buyers, might be.
To setup a Bayesian model, we need to assign prior distributions to our unknown quantities. A priori, what do we think $p_A$ might be? For this example, we have no strong conviction about $p_A$, so for now, let’s assume $p_A$ is uniform over [0,1]:
In [ ]:
import pymc3 as pm

# The parameters are the bounds of the Uniform.
with pm.Model() as model:
p = pm.Uniform(‘p’, lower=0, upper=1)

There are no stronger beliefs expressed in the prior distribution.
In [ ]:
#set constants
p_true = 0.05 # remember, this is unknown.
N = 1500

# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = stats.bernoulli.rvs(p_true, size=N)

print(occurrences) # Remember: Python treats True == 1, and False == 0
print(np.sum(occurrences))

The observed frequency is:
In [ ]:
# Occurrences.mean is equal to n/N.
print(“What is the observed frequency in Group A? %.4f” % np.mean(occurrences))
print(“Does this equal the true frequency? %s” % (np.mean(occurrences) == p_true))

We combine the observations into the PyMC3 observed variable, and run our inference algorithm:
In [ ]:
#include the observations, which are Bernoulli
with model:
obs = pm.Bernoulli(“obs”, p, observed=occurrences)
step = pm.Metropolis()
trace = pm.sample(18000, step=step)
burned_trace = trace[1000:]

We plot the posterior distribution of the unknown $p_A$ below:
In [ ]:
figsize(12.5, 4)
plt.title(“Posterior distribution of $p_A$, the true effectiveness of site A”)
plt.vlines(p_true, 0, 90, linestyle=”–“, label=”true $p_A$ (unknown)”)
plt.hist(burned_trace[“p”], bins=25, histtype=”stepfilled”, normed=True)
plt.legend();

Our posterior distribution puts most weight near the true value of $p_A$, but also some weights in the tails. This is a measure of how uncertain we should be, given our observations. Try changing the number of observations, N, and observe how the posterior distribution changes.
A and B Together¶
A similar analysis can be done for site B’s response data to determine the analogous $p_B$. But what we are really interested in is the difference between $p_A$ and $p_B$. Let’s infer $p_A$, $p_B$, and $\text{delta} = p_A – p_B$, all at once. We can do this using PyMC3’s deterministic variables. (We’ll assume for this exercise that $p_B = 0.04$, so $\text{delta} = 0.01$, $N_B = 750$ (significantly less than $N_A$) and we will simulate site B’s data like we did for site A’s data )
In [ ]:
import pymc3 as pm
figsize(12, 4)

#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04

#notice the unequal sample sizes — no problem in Bayesian analysis.
N_A = 15000
N_B = 10000

#generate some observations
observations_A = stats.bernoulli.rvs(true_p_A, size=N_A)
observations_B = stats.bernoulli.rvs(true_p_B, size=N_B)
print(“Obs from Site A: “, observations_A[:30], “…”)
print(“Obs from Site B: “, observations_B[:30], “…”)

Suppose you apply the two designs: A and B in the Google Optimizer, and observe that they are 15000 visitors in A and 10000 vistors in B as the code above specified. Here is the conversion yuou observed.
In [ ]:
print(f”Conversion rate in Site A version =”, np.mean(observations_A))
print(f”Conversion rate in Site B version =”, np.mean(observations_B))

Q1 Please conduct a simple classical AB Test based on a significance level of 5%, assume that your company will stay with conventional design A (used to be attracted to exist customers unless there is a 15% increase in conversion rate. After the test, what is the decsion? what is the required sample of the test if the power ot the test is required to be 80%?¶
In [ ]:
# Set up the pymc3 model. Again assume Uniform priors for p_A and p_B.
with pm.Model() as model:
p_A = pm.Uniform(“p_A”, 0, 1)
p_B = pm.Uniform(“p_B”, 0, 1)

# Define the deterministic delta function. This is our unknown of interest.
delta = pm.Deterministic(“delta”, p_A – p_B)

# Set of observations, in this case we have two observation datasets.
obs_A = pm.Bernoulli(“obs_A”, p_A, observed=observations_A)
obs_B = pm.Bernoulli(“obs_B”, p_B, observed=observations_B)

# To be explained in chapter 3.
step = pm.Metropolis()
trace = pm.sample(20000, step=step)
burned_trace=trace[1000:]

Below we plot the posterior distributions for the three unknowns:
In [ ]:
p_A_samples = burned_trace[“p_A”]
p_B_samples = burned_trace[“p_B”]
delta_samples = burned_trace[“delta”]
In [ ]:
figsize(12.5, 10)

#histogram of posteriors

ax = plt.subplot(311)

plt.xlim(0, .1)
plt.hist(p_A_samples, histtype=’stepfilled’, bins=25, alpha=0.85,
label=”posterior of $p_A$”, color=”#A60628″, normed=True)
plt.vlines(true_p_A, 0, 80, linestyle=”–“, label=”true $p_A$ (unknown)”)
plt.legend(loc=”upper right”)
plt.title(“Posterior distributions of $p_A$, $p_B$, and delta unknowns”)

ax = plt.subplot(312)

plt.xlim(0, .1)
plt.hist(p_B_samples, histtype=’stepfilled’, bins=25, alpha=0.85,
label=”posterior of $p_B$”, color=”#467821″, normed=True)
plt.vlines(true_p_B, 0, 80, linestyle=”–“, label=”true $p_B$ (unknown)”)
plt.legend(loc=”upper right”)

ax = plt.subplot(313)
plt.hist(delta_samples, histtype=’stepfilled’, bins=30, alpha=0.85,
label=”posterior of delta”, color=”#7A68A6″, normed=True)
plt.vlines(true_p_A – true_p_B, 0, 60, linestyle=”–“,
label=”true delta (unknown)”)
plt.vlines(0, 0, 60, color=”black”, alpha=0.2)
plt.legend(loc=”upper right”);

Notice that as a result of N_B < N_A, i.e. we have less data from site B, our posterior distribution of $p_B$ is fatter, implying we are less certain about the true value of $p_B$ than we are of $p_A$. With respect to the posterior distribution of $\text{delta}$, we can see that the majority of the distribution is above $\text{delta}=0$, implying there site A's response is likely better than site B's response. The probability this inference is incorrect is easily computable: In [ ]: # Count the number of samples less than 0, i.e. the area under the curve # before 0, represent the probability that site A is worse than site B. print("Probability site A is WORSE than site B: %.3f" % \ np.mean(delta_samples < 0)) print("Probability site A is BETTER than site B: %.3f" % \ np.mean(delta_samples > 0))

Q2 Compare the Baynesian Test and Simpel Classical AB test, what is your conclusion? Will the conclusion be affected if you have different prior belief about the distribution of the true conversion rate of Design A and the true conversion rate of Design B.¶

THE END