Deadline + Late Penalty¶
Note : It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible.
• The submission deadline for the Project is 20:59:59 on 9th Aug 2020 (Sydney Time).
• LATE PENALTY: 10% on day-1 and 30% on each subsequent day.
Instructions¶
1. This notebook contains instructions for COMP9313 Project 2.
2. You are required to complete your implementation in the file submission.py provided along with this notebook.
3. You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.
4. You are required to submit the following files, via CSE give:
• (i)submission.py(your code),
• (ii)report.pdf (illustrating your implementation details)
• Note: detailed submission instructions will be announced later.
5. We provide you with detailed instructions for the project in this notebook. In case of any problem, you can post your query @Piazza. Please do not post questions regarding the implementation details.
6. You are allowed to add other functions and/or import modules (you may have to for this project), but you are not allowed to define global variables. All the functions should be implemented in submission.py.
7. In this project, you may need to test your model on the provided development dataset in order to evaluate the performance of your stacking model.
8. The testing environment is the same as that of Lab3. Note: Importing other modules (not a part of the Lab3 test environment) may lead to errors, which will result in ZERO score for the ENTIRE Project.
Task1: Stacking Model (90 points)¶
In this task, you will implement several core parts of the stacking machine learning method in Pyspark. More specifically, you are required to complete a series of functions in the file submission.py with methods in PySpark SQL and PySpark MLlib. Details are listed as follows:
Dataset Description¶
1. The dataset consists of sentences from customer reviews of different restaurants. There are 2241, 800, 800 customer reviews in the train, development, and test datasets, respectively. It should be noted that there is at least one sentence in each customer review and each customer review may not be with ending punctuation such as ., ?.
2. The task is to identify the category of each customer review using the review text and the trained model.
3. The categories include:
• FOOD: reviews that involve comments on the food.
▪ e.g. “All the appetizers and salads were fabulous , the steak was mouth watering and the pasta was delicious”
• PAS: reviews that only involve comments on price, ambience, or service.
▪ e.g. “Now it ‘s so crowded and loud you ca n’t even talk to the person next to you”
• MISC: reviews that do not belong to the above categories including sentences that are general recommendations reviews describing the reviewer’s personal experience or context, but that do not usually provide information on the restaurant quality
▪ e.g. “Your friends will thank you for introducing them to this gem!”
▪ e.g. “I knew upon visiting NYC that I wanted to try an original deli”
4. You can view samples from the dataset using dataset.show() to get five samples with descript column showing the review text and category column showing the annotated class.
Task 1.1 (30 points): Build a Preprocessing Pipeline¶
In this task, you need to complete the base_features_gen_pipeline() function in submission.py, which outputs a pipeline (NOTE: not a pipeline model). The returned pipeline will be used to process the data, construct the feature vectors and labels.
More specifically, the function is defined as
def build_base_features_pipeline(input_descript_col=”descript”, input_category_col=”category”, output_feature_col=”features”, output_label_col=”label”):
The function needs to tokenize each customer review (i.e., the descript) and generate bag of words count vectors as features. It also needs to convert the category into label which is an integer between 0 and 2.
The returned type of this function should be pyspark.ml.pipeline.Pipeline.
Task 1.2 (30 points): Generate Meta Features for Training¶
In this task, you need to complete the gen_meta_features() function in submission.py, which outputs a dataframe with generated meta features for training the meta classifier.
More specifically, the function is defined as
def gen_meta_features(training_df, nb_0, nb_1, nb_2, svm_0, svm_1, svm_2):
The description of input parameters are as below:
• training_df: the dataframe contains features, labels, and group ids for training data. The schema of training_df is:
root
• |– id: integer (nullable = true)
• |– features: vector (nullable = true)
• |– label: double (nullable = false)
• |– label_0: double (nullable = false)
• |– label_1: double (nullable = false)
• |– label_2: double (nullable = false)
• |– group: integer (nullable = true)
where features and label are generated using the pipeline built in Task 1.1. label_x corresponds to the binary label of label x (e.g., label_0==0 means that label!=0). group is the group id as defined in the lecture slides (i.e., L7P45).
• nb_x: the predefined x-th Naive Bayes model (i.e., the one will be trained using label_x)
• svm_x: the predefined x-th SVM model (i.e., the one will be trained using label_x)
The output of the function is a dataframe with the following schema:
root
|– id: integer (nullable = true)
|– group: integer (nullable = true)
|– features: vector (nullable = true)
|– label: double (nullable = false)
|– label_0: double (nullable = false)
|– label_1: double (nullable = false)
|– label_2: double (nullable = false)
|– nb_pred_0: double (nullable = false)
|– nb_pred_1: double (nullable = false)
|– nb_pred_2: double (nullable = false)
|– svm_pred_0: double (nullable = false)
|– svm_pred_1: double (nullable = false)
|– svm_pred_2: double (nullable = false)
|– joint_pred_0: double (nullable = false)
|– joint_pred_1: double (nullable = false)
|– joint_pred_2: double (nullable = false)
where nb_pred_x is the prediction of model nb_x, svm_pred_x is the prediction of model svm_x, and joint_pred_x is the joint prediction of model nb_x and svm_x.
More specifically, the value of joint_pred_x is the decimal number of the joint prediction in L7P51 (hence it ranges from 0 to 3). E.g., if nb_pred_1==1 and svm_pred_1==0, then joint_pred_1==2.
Task 1.3 (30 points): Obtain the prediction for the test data¶
In this task, you need to complete the test_prediction() function in submission.py, which outputs a dataframe with predicted labels of the test data.
More specifically, the function is defined as
def test_prediction(test_df, base_features_pipeline_model, gen_base_pred_pipeline_model, gen_meta_feature_pipeline_model, meta_classifier):
The description of input parameters are as below:
• test_df: the dataframe contains features, labels, and group ids for test data. The schema of training_df is:
root
• |– id: integer (nullable = true)
• |– category: string (nullable = true)
• |– descript: string (nullable = true)
• base_features_pipeline_model is the fitted pipeline model for the pipeline built in Task 1.1.
• gen_base_pred_pipeline_model is the fitted pipeline model that generates predictions of base classifiers for the test data.
• gen_meta_feature_pipeline_model is the fitted pipeline model that generates meta features of the data from the single and joint predictions of base classifiers
• meta_classifier is the fitted meta classifier.
• you will see how we declare all the above 3 pipeline models in the examples below.
The output of the function is a dataframe with the following schema:
root
|– id: integer (nullable = true)
|– label: double (nullable = false)
|– final_prediction: double (nullable = false)
where labels are generated using the pipeline built in Task 1.1, and final_prediction is the prediction result of the test data.
Evaluation¶
The evaluation of the project is based on the correctness of your implementation. The three subtasks will be tested independently, i.e., even if you don’t complete task 1.1 and task 1.2, you may still get 30 points, if you have correctly implemented task 1.3.
Similar to Project 1, we will set a very loose time threshold T just in case your code takes long to complete… If your implementation does not finish prediction in a certain time, it will be killed. Hence, 0 score.
Task 2: Report (10 points)¶
You are also required to submit a report named report.pdf. Specifically, in the report, you are at least expected to answer the following questions:
1. Evaluation of your stacking model on the test data.
2. How would you improve the performance (e.g., F1) of the stacking model.
For task 2.2, you may try from the following directions:
• the base feature generation
• the meta feature generation
• the hyper-parameters of base and meta models
Hint: make proper use of the development data.
How to execute your implementation (EXAMPLE)¶
In [1]:
from pyspark.sql import *
from pyspark import SparkConf
from pyspark.sql import DataFrame
from pyspark.sql.functions import rand
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, VectorAssembler
from pyspark.ml.classification import LogisticRegression, LinearSVC, NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from submission import base_features_gen_pipeline, gen_meta_features, test_prediction
import random
rseed = 1024
random.seed(rseed)
def gen_binary_labels(df):
df = df.withColumn(‘label_0’, (df[‘label’] == 0).cast(DoubleType()))
df = df.withColumn(‘label_1’, (df[‘label’] == 1).cast(DoubleType()))
df = df.withColumn(‘label_2’, (df[‘label’] == 2).cast(DoubleType()))
return df
# Create a Spark Session
conf = SparkConf().setMaster(“local[*]”).setAppName(“lab3”)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Load data
train_data = spark.read.load(“proj2train.csv”, format=”csv”, sep=”\t”, inferSchema=”true”, header=”true”)
test_data = spark.read.load(“proj2test.csv”, format=”csv”, sep=”\t”, inferSchema=”true”, header=”true”)
# build the pipeline from task 1.1
base_features_pipeline = base_features_gen_pipeline()
# Fit the pipeline using train_data
base_features_pipeline_model = base_features_pipeline.fit(train_data)
# Transform the train_data using fitted pipeline
training_set = base_features_pipeline_model.transform(train_data)
# assign random groups and binarize the labels
training_set = training_set.withColumn(‘group’, (rand(rseed)*5).cast(IntegerType()))
training_set = gen_binary_labels(training_set)
# define base models
nb_0 = NaiveBayes(featuresCol=’features’, labelCol=’label_0′, predictionCol=’nb_pred_0′, probabilityCol=’nb_prob_0′, rawPredictionCol=’nb_raw_0′)
nb_1 = NaiveBayes(featuresCol=’features’, labelCol=’label_1′, predictionCol=’nb_pred_1′, probabilityCol=’nb_prob_1′, rawPredictionCol=’nb_raw_1′)
nb_2 = NaiveBayes(featuresCol=’features’, labelCol=’label_2′, predictionCol=’nb_pred_2′, probabilityCol=’nb_prob_2′, rawPredictionCol=’nb_raw_2′)
svm_0 = LinearSVC(featuresCol=’features’, labelCol=’label_0′, predictionCol=’svm_pred_0′, rawPredictionCol=’svm_raw_0′)
svm_1 = LinearSVC(featuresCol=’features’, labelCol=’label_1′, predictionCol=’svm_pred_1′, rawPredictionCol=’svm_raw_1′)
svm_2 = LinearSVC(featuresCol=’features’, labelCol=’label_2′, predictionCol=’svm_pred_2′, rawPredictionCol=’svm_raw_2′)
# build pipeline to generate predictions from base classifiers, will be used in task 1.3
gen_base_pred_pipeline = Pipeline(stages=[nb_0, nb_1, nb_2, svm_0, svm_1, svm_2])
gen_base_pred_pipeline_model = gen_base_pred_pipeline.fit(training_set)
# task 1.2
meta_features = gen_meta_features(training_set, nb_0, nb_1, nb_2, svm_0, svm_1, svm_2)
# build onehotencoder and vectorassembler pipeline
onehot_encoder = OneHotEncoderEstimator(inputCols=[‘nb_pred_0’, ‘nb_pred_1’, ‘nb_pred_2’, ‘svm_pred_0’, ‘svm_pred_1’, ‘svm_pred_2’, ‘joint_pred_0’, ‘joint_pred_1’, ‘joint_pred_2’], outputCols=[‘vec{}’.format(i) for i in range(9)])
vector_assembler = VectorAssembler(inputCols=[‘vec{}’.format(i) for i in range(9)], outputCol=’meta_features’)
gen_meta_feature_pipeline = Pipeline(stages=[onehot_encoder, vector_assembler])
gen_meta_feature_pipeline_model = gen_meta_feature_pipeline.fit(meta_features)
meta_features = gen_meta_feature_pipeline_model.transform(meta_features)
# train the meta clasifier
lr_model = LogisticRegression(featuresCol=’meta_features’, labelCol=’label’, predictionCol=’final_prediction’, maxIter=20, regParam=1., elasticNetParam=0)
meta_classifier = lr_model.fit(meta_features)
# task 1.3
pred_test = test_prediction(test_data, base_features_pipeline_model, gen_base_pred_pipeline_model, gen_meta_feature_pipeline_model, meta_classifier)
# Evaluation
evaluator = MulticlassClassificationEvaluator(predictionCol=”prediction”,metricName=’f1′)
print(evaluator.evaluate(pred_test, {evaluator.predictionCol:’final_prediction’}))
spark.stop()
0.7483312619309965
Project Submission and Feedback¶
For the project submission, you are required to submit the following files:
1. Your implementation in the python file submission.py.
2. The report report.pdf.
Detailed instructions about using give to submit the project files will be announced later via Piazza.