L2 – BUC
Introduction¶
In this notebook, we implement a simpler version of the BUC (Bottom up cubing) algorithm.
The original BUC algorithm is a highly optimized and practically efficient algorithm to materialize the entire data cube given a fact table. We instead experiment with a simpler version, called BUC_rec (BUC recursive), based on our teaching slides. The implementation here emphasize more on the conceptual simplicity rather than efficiency.
Once you understand BUC_rec, you may try to understand BUC if you want to learn more by yourself.
The recursive formula used in BUC_rec is:
$$
Cube(R, \{A,B,C,\ldots,M\}) = \bigcup_{i \in \{1, 2, \ldots, m, \text{ALL}\}} Cube(R_i, \{B,C,\ldots,M\})
$$where $R_i$ is $\pi_{B,C,\ldots,M}(\sigma_{A = a_i}(R))$ and $R_{\text{ALL}}$ is $\pi_{B,C,\ldots,M}(R)$.
In [1]:
import sys
import pandas as pd
import numpy as np
ALL = -1
# DEBUG = True
DEBUG = False
##============================================================
# Data file format:
# * tab-delimited input file
# * 1st line: dimension names and the last dimension is assumed to be the measure
# * rest of the lines: data values.
def read_data(filename):
df = pd.read_csv(filename, sep=’\t’)
dims = df.shape[1] – 1 # the last dim is the measure
return (df, dims)
def dump_input2(input):
if DEBUG:
print(“\n.. BUC_rec invoked on:”)
print(input)
print(“………………….\n”)
# helper functions
def project_data(input, d):
# Return only the d-th column of INPUT
return input.iloc[:, d]
def select_data(input, d, val):
# SELECT * FROM INPUT WHERE INPUT.d = VAL
col_name = input.columns[d]
return input[input[col_name] == val]
def remove_first_dim(input):
# Remove the first dim of the input
return input.iloc[:, 1:]
def slice_data_dim0(input, v):
# syntactic sugar to get R_{ALL} in a less verbose way
df_temp = select_data(input, 0, v)
return remove_first_dim(df_temp)
def output(val):
print(‘=>\t{}’.format(val))
In [2]:
data, d = read_data(‘./asset/a_.txt’)
print(d)
data
2
Out[2]:
A B M
0 1 1 20
1 2 1 50
2 1 2 30
3 1 3 40
In [3]:
project_data(data, 0)
Out[3]:
0 1
1 2
2 1
3 1
Name: A, dtype: int64
In [4]:
select_data(data, 1, 2)
Out[4]:
A B M
2 1 2 30
In [5]:
slice_data_dim0(data, 2)
Out[5]:
B M
1 1 50
Now we can implement the buc_rec() algorithm and test it.
In [24]:
# Note that input is a DataFrame
def buc_rec_help(input, result, cur):
# Note that input is a DataFrame
dump_input2(input)
dims = input.shape[1]
if dims == 1:
# only the measure dim
input_sum = sum( project_data(input, 0) )
output(input_sum)
cur.append(input_sum)
result.append(cur)
else:
# the general case
dim0_vals = set(project_data(input, 0).values)
for dim0_v in dim0_vals:
sub_data = slice_data_dim0(input, dim0_v)
print(dim0_v)
buc_rec_help(sub_data, result, cur + [dim0_v] )
## for R_{ALL}
# print(“ALL”)
#cur.append(“ALL”)
sub_data = remove_first_dim(input)
buc_rec_help(sub_data, result, cur + [“ALL”])
def buc_rec_my(input):
result = []
buc_rec_help(input, result, [])
return pd.DataFrame(result, columns=input.columns)
buc_rec_my(data)
.. BUC_rec invoked on:
A B M
0 1 2 100
1 2 1 20
………………….
1
.. BUC_rec invoked on:
B M
0 2 100
………………….
2
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
2
.. BUC_rec invoked on:
B M
1 1 20
………………….
1
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
.. BUC_rec invoked on:
B M
0 2 100
1 1 20
………………….
1
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
2
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
.. BUC_rec invoked on:
M
0 100
1 20
………………….
=> 120
Out[24]:
A B M
0 1 2 100
1 1 ALL 100
2 2 1 20
3 2 ALL 20
4 ALL 1 20
5 ALL 2 100
6 ALL ALL 120
In [63]:
–
Out[63]:
A B M
0 1 2 100
1 1 ALL 100
2 ALL 2 100
3 ALL ALL 100
In [54]:
my.iloc[0]
Out[54]:
B 1
M 20
Name: 1, dtype: int64
In [43]:
data.ix[0]
Out[43]:
A 1
B 2
M 100
Name: 0, dtype: int64
In [5]:
# Note that input is a DataFrame
def buc_rec(input):
# Note that input is a DataFrame
dump_input2(input)
dims = input.shape[1]
if dims == 1:
# only the measure dim
input_sum = sum( project_data(input, 0) )
output(input_sum)
else:
# the general case
dim0_vals = set(project_data(input, 0).values)
for dim0_v in dim0_vals:
sub_data = slice_data_dim0(input, dim0_v)
print(dim0_v)
buc_rec(sub_data)
## for R_{ALL}
print(“ALL”)
sub_data = remove_first_dim(input)
buc_rec(sub_data)
In [13]:
data, d = read_data(‘./asset/a2.txt’)
print(d)
data
2
Out[13]:
A B M
0 1 2 100
1 2 1 20
In [16]:
DEBUG = True
buc_rec(data)
#select_data(data, 1, 2)
#buc_rec(select_data(data, 1, 2))
.. BUC_rec invoked on:
A B M
0 1 2 100
1 2 1 20
………………….
1
.. BUC_rec invoked on:
B M
0 2 100
………………….
2
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
ALL
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
2
.. BUC_rec invoked on:
B M
1 1 20
………………….
1
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
ALL
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
ALL
.. BUC_rec invoked on:
B M
0 2 100
1 1 20
………………….
1
.. BUC_rec invoked on:
M
1 20
………………….
=> 20
2
.. BUC_rec invoked on:
M
0 100
………………….
=> 100
ALL
.. BUC_rec invoked on:
M
0 100
1 20
………………….
=> 120
In [8]:
DEBUG = True
buc_rec(data)
.. BUC_rec invoked on:
A B M
0 1 1 20
1 2 1 50
2 1 2 30
3 1 3 40
………………….
1
.. BUC_rec invoked on:
B M
0 1 20
2 2 30
3 3 40
………………….
1
.. BUC_rec invoked on:
M
0 20
………………….
=> 20
2
.. BUC_rec invoked on:
M
2 30
………………….
=> 30
3
.. BUC_rec invoked on:
M
3 40
………………….
=> 40
ALL
.. BUC_rec invoked on:
M
0 20
2 30
3 40
………………….
=> 90
2
.. BUC_rec invoked on:
B M
1 1 50
………………….
1
.. BUC_rec invoked on:
M
1 50
………………….
=> 50
ALL
.. BUC_rec invoked on:
M
1 50
………………….
=> 50
ALL
.. BUC_rec invoked on:
B M
0 1 20
1 1 50
2 2 30
3 3 40
………………….
1
.. BUC_rec invoked on:
M
0 20
1 50
………………….
=> 70
2
.. BUC_rec invoked on:
M
2 30
………………….
=> 30
3
.. BUC_rec invoked on:
M
3 40
………………….
=> 40
ALL
.. BUC_rec invoked on:
M
0 20
1 50
2 30
3 40
………………….
=> 140
With the following pivot table, we can easily see the output is correct (i.e., all the (non-empty) aggregates are computed).
But did you notice anything else interesting?
In [8]:
data.pivot_table(index = [‘A’], columns = [‘B’], aggfunc = np.sum, margins = True)
Out[8]:
M
B 1 2 3 All
A
1 20.0 30.0 40.0 90.0
2 50.0 NaN NaN 50.0
All 70.0 30.0 40.0 140.0
Exercise¶
The above implementation is not complete in that the aggregated values are not associated with their “coordinates”. We would like to have sth like
1 1 => 20
1 2 => 30
1 3 => 40
1 * => 90
2 1 => 50
2 * => 50
* 1 => 70
* 2 => 30
* 3 => 40
* * => 140
Your task is to enhance the implementation of buc_rec so that you can generate the above more readable output.
In [ ]:
In [ ]:
In [ ]:
In [ ]: