import warnings
Copyright By PowCoder代写 加微信 powcoder
warnings.simplefilter(‘ignore’)
warnings.filterwarnings(‘ignore’)
import pandas as pd
1 Read data¶
df = pd.read_csv(r”YouTube10k.csv”, usecols=[0]+[3]+[i for i in range(6,13)]) #use column [0]+[3]+[i for i in range(6,13)
df = df.dropna(how=’any’,axis=0) #dropna
country category_id views likes/views dislikes/views comment_count/views comments_disabled ratings_disabled video_error_or_removed
0 KR 17 3911688 0.014060682 0.0004801 0.001771614 FALSE FALSE FALSE
2 DE 10 436172 0.003677448 0.00079785 0.00019717 FALSE FALSE FALSE
3 DE 28 536117 0.035596707 0.00211521 0.016278163 FALSE FALSE FALSE
4 GB 10 159436 0.032853308 0.001273238 0.00172483 FALSE FALSE FALSE
5 IN 24 35270 0.014856819 0.00104905 0.000992345 FALSE FALSE FALSE
df.rename(columns=lambda x: re.sub(r’/’, ‘_’, x), inplace=True) # rename : likes/views –>> likes_views
country category_id views likes_views dislikes_views comment_count_views comments_disabled ratings_disabled video_error_or_removed
0 KR 17 3911688 0.014060682 0.0004801 0.001771614 FALSE FALSE FALSE
2 DE 10 436172 0.003677448 0.00079785 0.00019717 FALSE FALSE FALSE
3 DE 28 536117 0.035596707 0.00211521 0.016278163 FALSE FALSE FALSE
4 GB 10 159436 0.032853308 0.001273238 0.00172483 FALSE FALSE FALSE
5 IN 24 35270 0.014856819 0.00104905 0.000992345 FALSE FALSE FALSE
2 Variable transformations¶
#bulid a country_table to transform the string features to numeric features
country_table = dict(zip(set(df[‘country’]), range(1, len(set(df[‘country’]))+1)))
df[“country”] = df[‘country’].apply(lambda x: country_table[x])
country category_id views likes_views dislikes_views comment_count_views comments_disabled ratings_disabled video_error_or_removed
0 9 17 3911688 0.014060682 0.0004801 0.001771614 FALSE FALSE FALSE
2 3 10 436172 0.003677448 0.00079785 0.00019717 FALSE FALSE FALSE
3 3 28 536117 0.035596707 0.00211521 0.016278163 FALSE FALSE FALSE
4 5 10 159436 0.032853308 0.001273238 0.00172483 FALSE FALSE FALSE
5 4 24 35270 0.014856819 0.00104905 0.000992345 FALSE FALSE FALSE
#This ‘category_id’ column has an exception and contains non-numeric data
df[‘category_id’].value_counts()
24 2650
22 1215
10 1150
25 855
23 652
1 507
17 504
26 469
20 272
28 214
27 183
2 123
15 98
29 72
19 34
43 31
– MotorSport” 1
MDDTV Divulgation 1
Lacrim 1
Brytiago – Bipolar” 1
SBI CLERK/IBPS 1
2017″ 1
Anitta – Machika” 1
Cracks MX 1
A Plus Entertainment 1
confirman prosticatálogo 1
zeetv 1
황장수의 뉴스브리핑o 1
Эдуард Адамян 1
The View 1
НОВЫЕ ПРИВОЗЫ ПОЕХАЛИ 1
EsMiHit AleDolores 1
NBA on TNT 1
NEWSONE 1
France 3 Bourgogne-Franche-Comté 1
Hindustan Times 1
Hidden Messages” 1
TÉLÉ NEWS 1
Hoodrich Pablo & Ness)” 1
. Wetter 1
Washington Post 1
SET India 1
но остался девственным” 1
러너 꽃빈TV 1
FIRST TEAM 1
30 1
The Death of Stalin 1
candidats de Nouvelle Star 1
broudge03 1
Cort & Sharmita!” 1
2018″ 1
VikatanTV 1
Name: category_id, dtype: int64
# Handling abnormal data
df = df[df[‘category_id’].apply(lambda x: x.isnumeric())]
country category_id views likes_views dislikes_views comment_count_views comments_disabled ratings_disabled video_error_or_removed
0 9 17 3911688 0.014060682 0.0004801 0.001771614 FALSE FALSE FALSE
2 3 10 436172 0.003677448 0.00079785 0.00019717 FALSE FALSE FALSE
3 3 28 536117 0.035596707 0.00211521 0.016278163 FALSE FALSE FALSE
4 5 10 159436 0.032853308 0.001273238 0.00172483 FALSE FALSE FALSE
5 4 24 35270 0.014856819 0.00104905 0.000992345 FALSE FALSE FALSE
#Numerical transformation, to ‘float32’
df[df.columns[1:-3]] = df[df.columns[1:-3]].astype(‘float32’)
country category_id views likes_views dislikes_views comment_count_views comments_disabled ratings_disabled video_error_or_removed
0 9 17.0 3911688.0 0.014061 0.000480 0.001772 FALSE FALSE FALSE
2 3 10.0 436172.0 0.003677 0.000798 0.000197 FALSE FALSE FALSE
3 3 28.0 536117.0 0.035597 0.002115 0.016278 FALSE FALSE FALSE
4 5 10.0 159436.0 0.032853 0.001273 0.001725 FALSE FALSE FALSE
5 4 24.0 35270.0 0.014857 0.001049 0.000992 FALSE FALSE FALSE
3 Categorical variable¶
#Categorical variable
df = pd.get_dummies(df)
country category_id views likes_views dislikes_views comment_count_views comments_disabled_FALSE comments_disabled_TRUE ratings_disabled_FALSE ratings_disabled_TRUE video_error_or_removed_FALSE video_error_or_removed_TRUE
0 9 17.0 3911688.0 0.014061 0.000480 0.001772 1 0 1 0 1 0
2 3 10.0 436172.0 0.003677 0.000798 0.000197 1 0 1 0 1 0
3 3 28.0 536117.0 0.035597 0.002115 0.016278 1 0 1 0 1 0
4 5 10.0 159436.0 0.032853 0.001273 0.001725 1 0 1 0 1 0
5 4 24.0 35270.0 0.014857 0.001049 0.000992 1 0 1 0 1 0
#Statistical description
df.describe()
country category_id views likes_views dislikes_views comment_count_views comments_disabled_FALSE comments_disabled_TRUE ratings_disabled_FALSE ratings_disabled_TRUE video_error_or_removed_FALSE video_error_or_removed_TRUE
count 9030.000000 9030.000000 9030.0 9030.000000 9030.000000 9030.000000 9030.000000 9030.000000 9030.000000 9030.000000 9030.000000 9030.000000
mean 4.931451 20.116943 1531658.0 0.036861 0.002028 0.005626 0.979291 0.020709 0.982614 0.017386 0.999114 0.000886
std 2.571872 7.300126 7766985.5 0.038326 0.006381 0.011549 0.142415 0.142415 0.130714 0.130714 0.029753 0.029753
min 1.000000 1.000000 711.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 17.000000 50807.5 0.008653 0.000563 0.001197 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
50% 5.000000 23.000000 197121.5 0.024425 0.000988 0.002997 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 7.000000 24.000000 727495.0 0.053765 0.001838 0.006594 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000
max 10.000000 43.000000 337621568.0 0.700424 0.256352 0.541490 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
4 Variable selection¶
#Variable selection
import statsmodels.api as sm
from statsmodels.formula.api import ols
yvar = r”likes_views” #target Variable
cn = list(df) #columns of data
modeleq = yvar + ‘ ~’ #Initialize the model formula
for xvar in (
‘country’,
‘category_id’,
‘dislikes_views’,
‘comment_count_views’,
‘comments_disabled_FALSE’,
‘comments_disabled_TRUE’,
‘ratings_disabled_FALSE’,
‘ratings_disabled_TRUE’,
‘video_error_or_removed_FALSE’,
‘video_error_or_removed_TRUE’
if xvar in cn:
if modeleq[-1] == ‘~’:
modeleq = modeleq + ‘ ‘ + xvar
modeleq = modeleq + ‘ + ‘ + xvar
if modeleq == yvar + ‘ ~’:
#none of the listed columns in h; build equation using all X columns
cn.insert(0, cn.pop(cn.index(yvar)))
df = df[cn]
modeleq = ‘ + ‘.join(cn).replace(‘+’, ‘~’, 1)
bmodeleq = modeleq #best model so far
print(‘\n’ + bmodeleq)
if modeleq.find(‘ + ‘) != -1:
#eliminate X variables one by one:
print(‘\nVariable Selection using |t-stats| & PR(>F) (or Adj R²):’)
#initialize p-value & adjusted R²:
#set to infinity (or import sys; sys.maxsize) (or max=1.7976931348623157e+308 min=2.2250738585072014e-308) :
minfpv = float(‘inf’) #min f-stat p-value found so far
maxadjR2 = -minfpv #max Adj R² found so far
#machine learns:
while True:
hout = ols(modeleq, df).fit()
#print(dir(hout)) # gives all the attributes of .fit(), e.g. .fvalue & .f_pvalue
fpv = hout.f_pvalue
ar2 = hout.rsquared_adj
#see if a better model (smaller F-stat p-value) is found:
if fpv < minfpv:
minfpv = fpv
maxadjR2 = ar2
bmodeleq = modeleq
elif fpv == 0.0:
#resolve using adjusted R2:
if ar2 >= maxadjR2:
minfpv = fpv
maxadjR2 = ar2
bmodeleq = modeleq
numx = modeleq.count(‘ + ‘) #(number of Xs) – 1
print(‘\nF-statistic =’, hout.fvalue, ‘ ‘, (‘Adj R² = ‘ + str(ar2) if fpv == 0.0 else ‘PR(>F) = ‘ + str(fpv)),
‘for’, numx + 1, ‘Xs.’)
if modeleq.find(‘ + ‘) == -1:
# 1 xvar left; nothing to drop
# adjusted-R² for no xvar (fit is y-bar) is 0; consider if adjusted-R² < 0 for 1 xvar
break #quit while loop
#identify X variable to drop by finding the one with largest t-stat p-value (equivalently, PR>F):
#need ANOVA table due to presence of categorical variabe ‘town’
prf = sm.stats.anova_lm(hout, typ=2).iloc[:-1, :].sort_values([‘F’]
).sort_values([‘df’], ascending=False)[‘PR(>F)’]
maxp = max(prf)
#print(‘\n’, dict(prf))
xdrop = prf[maxp == prf].axes[0][0] # 1st element of row-label .axes[0]; prefers categorical variable
# xdrop removed from model equation:
if (modeleq.find(‘~ ‘ + xdrop + ‘ + ‘) != -1): #xdrop is 1st x
modeleq = modeleq.replace(‘~ ‘ + xdrop + ‘ + ‘, ‘~ ‘)
elif (modeleq.find(‘+ ‘ + xdrop + ‘ + ‘) != -1):
modeleq = modeleq.replace(‘+ ‘ + xdrop + ‘ + ‘, ‘+ ‘)
else: #xdrop is last x
modeleq = modeleq[:-len(xdrop) – 3]
#print(‘Model equation:’,modeleq,’\n’)
#print(prf)
print(‘Variable to drop:’, xdrop, ‘ p-value =’, prf[xdrop])
#end of while
#machine learnt
print(‘Variable left:’, prf.loc[~prf.index.isin([xdrop])].axes[0][0])
print(‘\nBest model equation:’, bmodeleq)
print(‘\n’ + (‘Maximum Adj R² = ‘ + str(maxadjR2) if minfpv == 0.0 else ‘Minimum PR(>F) = ‘ + str(minfpv)),
‘for’, bmodeleq.count(‘ + ‘) + 1, ‘Xs.\n’)
likes_views ~ country + category_id + views + dislikes_views + comment_count_views + comments_disabled_FALSE + comments_disabled_TRUE + ratings_disabled_FALSE + ratings_disabled_TRUE + video_error_or_removed_FALSE + video_error_or_removed_TRUE
Variable Selection using |t-stats| & PR(>F) (or Adj R²):
F-statistic = 362.6731359857049 Adj R² = 0.24268508914593045 for 11 Xs.
Variable to drop: comments_disabled_TRUE p-value = 0.6956129891857935
F-statistic = 362.6731359857054 Adj R² = 0.2426850891459308 for 10 Xs.
Variable to drop: video_error_or_removed_TRUE p-value = 0.5512280630369499
F-statistic = 362.6731359857057 Adj R² = 0.2426850891459309 for 9 Xs.
Variable to drop: video_error_or_removed_FALSE p-value = 0.9641121315118777
F-statistic = 414.52914821852534 Adj R² = 0.2427688601141592 for 8 Xs.
Variable to drop: comments_disabled_FALSE p-value = 0.014062897981056751
F-statistic = 482.34288264648353 Adj R² = 0.2423465142588208 for 7 Xs.
Variable to drop: dislikes_views p-value = 0.0012527418871353918
F-statistic = 576.1267156080957 Adj R² = 0.24155582439302692 for 6 Xs.
Variable to drop: views p-value = 0.0003446174739680706
F-statistic = 716.015420096383 Adj R² = 0.2405624394315352 for 5 Xs.
Variable to drop: category_id p-value = 8.232159433477303e-05
F-statistic = 947.9895781663021 Adj R² = 0.23934087764207057 for 4 Xs.
Variable to drop: ratings_disabled_TRUE p-value = 1.162826090078912e-10
F-statistic = 947.9895781663015 Adj R² = 0.23934087764207046 for 3 Xs.
Variable to drop: country p-value = 4.4303681572209554e-16
F-statistic = 1378.8724648377004 Adj R² = 0.2338400148768689 for 2 Xs.
Variable to drop: ratings_disabled_FALSE p-value = 9.165878024689756e-39
F-statistic = 2538.7014810736096 Adj R² = 0.21939716220964178 for 1 Xs.
Variable left: comment_count_views
Best model equation: likes_views ~ country + category_id + views + dislikes_views + comment_count_views + comments_disabled_FALSE + ratings_disabled_FALSE + ratings_disabled_TRUE
Maximum Adj R² = 0.2427688601141592 for 8 Xs.
o = ols(bmodeleq, df).fit() #output from .fit()
#any categorical variable first, then in descending order of F-stat; but ultimately ascending order of PR(>F):
hlm = sm.stats.anova_lm(o, typ=2).sort_values([‘df’, ‘F’], ascending=False).sort_values([‘PR(>F)’])
last = sum(hlm[‘df’][:-1] == 1.0) #number of o. bottom t-stats for numeric Xs to display with more precision
if len(hlm) > last + 1:
#print the coefficient table:
print(hlm.replace(float(‘nan’), ”), ‘\n’)
#print the ANOVA table, and more:
print(o.summary2())
#construct and print the numeric variables in order of importance:
#p-values are the same as PR(>F) from .anova_lm typ=2 & typ=3
print(‘\n’ + str(last) + (‘ quantitative’ if len(hlm) > last + 1 else ”), “X-coefficients’ |t-stats| ranked:”)
nxvar = len(o.tvalues)
r = range(nxvar – last, nxvar)
i = o.params[r].index
d = pd.concat([pd.Series(i, index=i), o.params[r], o.tvalues[r], o.pvalues[r]], axis=1
).sort_values(2, key=abs, ascending=False)
d.columns = [”, ‘Coefficient’, ‘t-stat’, ‘P>|t-stat|’]
d.index = range(1, len(d) + 1)
Results: Ordinary least squares
=======================================================================
Model: OLS Adj. R-squared: 0.243
Dependent Variable: likes_views AIC: -35782.3046
Date: 2022-03-03 10:44 BIC: -35725.4381
No. Observations: 9030 Log-Likelihood: 17899.
Df Model: 7 F-statistic: 414.5
Df Residuals: 9022 Prob (F-statistic): 0.00
R-squared: 0.243 Scale: 0.0011123
———————————————————————–
Coef. Std.Err. t P>|t| [0.025 0.975]
———————————————————————–
Intercept 0.0109 0.0019 5.7921 0.0000 0.0072 0.0146
country -0.0011 0.0001 -8.3176 0.0000 -0.0014 -0.0009
category_id -0.0002 0.0000 -4.4615 0.0000 -0.0003 -0.0001
views -0.0000 0.0000 -3.5944 0.0003 -0.0000 -0.0000
dislikes_views 0.1853 0.0558 3.3181 0.0009 0.0758 0.2948
comment_count_views 1.5318 0.0309 49.5720 0.0000 1.4712 1.5924
comments_disabled_FALSE 0.0064 0.0026 2.4561 0.0141 0.0013 0.0115
ratings_disabled_FALSE 0.0215 0.0015 13.9864 0.0000 0.0185 0.0245
ratings_disabled_TRUE -0.0106 0.0019 -5.7321 0.0000 -0.0143 -0.0070
———————————————————————–
Omnibus: 4585.133 Durbin-Watson: 2.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 433490.633
Skew: 1.516 Prob(JB): 0.000
Kurtosis: 36.807 Condition No.: 9575461379507346210816
=======================================================================
* The condition number is large (1e+22). This might indicate
strong multicollinearity or other numerical problems.
8 X-coefficients’ |t-stats| ranked:
Coefficient t-stat P>|t-stat|
1 comment_count_views 1.531793e+00 49.571983 0.000000e+00
2 ratings_disabled_FALSE 2.150027e-02 13.986351 5.429200e-44
3 country -1.137474e-03 -8.317626 1.027904e-16
4 ratings_disabled_TRUE -1.062067e-02 -5.732083 1.023936e-08
5 category_id -2.173810e-04 -4.461533 8.236153e-06
6 views -1.642277e-10 -3.594415 3.268559e-04
7 dislikes_views 1.853009e-01 3.318079 9.099747e-04
8 comments_disabled_FALSE 6.413187e-03 2.456131 1.406290e-02
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com