Cover sheet for submission of work for assessment
School:
Business IT & Logistics
Program name
Program code
Office use only
School date stamp
Course/unit name
Foundation of Database Development
Course/unit code
ISYS2038
TAFE National Module Unit of Competency (UOC) name
TAFE National Module Unit of Competency (UOC) ID
Assignment no.
2
Due date
16,Oct,2016
Name of lecturer/teacher
Dr Yee Ling Boo
Campus
City
Class day/ time
Tutor/maker’s name
Student/s
Declaration and statement of authorship:
1. I/we hold a copy of this work which can be produced if the original is lost/damaged.
2. This work is my/our original work and no part of it has been copied from any other student’s work or from any other source except where due acknowledgement is made.
3. No part of this work has been written for me/us by any other person except where such collaboration has been authorised by the lecturer/teacher concerned.
4. I/we have not previously submitted this work for this or any other course/unit.
5. I/we give permission for this work to be reproduced, communicated, compared and archived for the purpose of detecting plagiarism.
6. I/we give permission for a copy of my/our marked work to be retained by the school for review and comparison, including review by external examiners.
I/we understand that:
7. Plagiarism is the presentation of the work, idea or creation of another person as though it is my/our own. It is a form of cheating and is a very serious academic offence that may lead to exclusion from the University. Plagiarised material can be drawn from, and presented in, written, graphic and visual form, including electronic data and oral presentations. Plagiarism occurs when the origin of the material used is not appropriately cited.
8. Plagiarism includes the act of assisting or allowing another person to plagiarise or to copy my/our work.
Student signature/s
1)
2)
3)
4)
5)
6)
Further information relating to the penalties for plagiarism, which range from a notation on your student file to expulsion from the University, is contained in Regulation 6.1.1 Student Discipline and the Plagiarism Policy which are available on the Policies and Procedures website at www.rmit.edu.au/policies
ISYS2038 Foundations of Database Development
Group Assignment 2- Advanced SQL Statements
Group Members:
Table of Contents
1 Advanced SQL Statements 3
1.1 Trigger 3
1. 1. 1 Business Purpose 3
1.1.2 SQL Design Statement 3
1.1.3 SQL Scripts 4
1.1.4 SQL scripts Testing 6
1.2 Procedure 7
1.2.1 Business Purpose 7
1.2.2 SQL Design Statement 7
1.2.3 SQL Scripts 8
1.2.4 SQL scripts Testing 9
1.3 Function 11
1.3.1 Business Purpose 11
1.3.2 SQL Design Statement 11
1.3.3 SQL Scripts 12
1.3.4 SQL scripts Testing 12
1.4 View 13
1.4.1 Business Purpose 13
1.4.2 SQL Design Statement 13
1.4.3 SQL Scripts 14
1.4.4 SQL scripts Testing 14
1.5 Nested Query 15
1.5.1 Business Purpose 15
1.5.2 SQL Design Statement 15
1.5.3 SQL Scripts 15
1.5.4 SQL scripts Testing 16
2. Database Administration Approaches 17
2.1 Data Protection 17
2.1.1 Privacy reason- customer’s information 18
2.1.2 Integrity reason- resource information 19
2.2 Data Backup 19
3. Data Analytics 21
3.1 Prediction Model (Orange) 21
3.1.2 Prediction Model in Orange 22
3.1.3 Prediction Aftermath 23
3.1.4 Root Mean Square Error Calculation 25
3.1.5 RMSE Analytics 26
3.1.6 Variation of the RMSE 27
3.1.7 Limitations 28
3.2 Missing value 28
4. References 30
• Advanced SQL Statements
• Trigger
1. 1. 1 Business Purpose
The accumulation of the data amount is tremendously fast than ever, the value behind the vast volume of operational data is yet to be discovered. If business firms is able to maximize the value of those data information through data mining and data analysis, it will refine and enhance the business strategy to a great extent.
Therefore, this trigger is design for record the data concerning basic consuming behaviour of the PTE clients, the spending amount, average spending on several party events, the number of clients holding a party event are all been record in the new table in the database before adding a new party event record in event table. PTE manager is able to observe the consuming data of client, as a result, this enable PTE promotes the appropriate advertising to various target customers.
The outcome of this trigger just record the preliminary consuming behaviour data of the client, there is still a huge upside room for PTE to improving their consuming behaviour data collection, for example, recording the personal favour on performances or foods etc.
1.1.2 SQL Design Statement
Firstly, creating a table for stored up the data relating client analysis, at this stage, there are four columns is been created in the database which is ‘clientID’, ‘consumption_times’, ‘consumption_amount’ and ‘avg_spending’.
In the trigger query, there are two value, ’no_row’ and ‘tot_row’, been declared, this two is been used as intermediate value to assist the trigger action. After that, ‘tot_row’ is use to stored up the number of party events that one specific clients holds, ‘no_row’ is to use to checking the client is already been recorded in the client_analysis table or not. If client already has relevant party record in the table client_analysis, the new record of specific client is been added upon the existent data, times of client holds party plus one time, the new consuming amount been added into total amount, the new average spending per party event is been calculated. If the client is new one and do not have any relevant data in the table, then query will record the new customer consuming information in table.
1.1.3 SQL Scripts
SQL Scripts of Creating table ‘client_analysis
CREATE TABLE `s3588356`.`customer_analyse` (
`clientID` INT NOT NULL,
`consumption_times` VARCHAR(45) NULL,
`consumption_amount` VARCHAR(45) NULL,
`avg_spending` VARCHAR(45) NULL,
PRIMARY KEY (`clientID`));
SQL scripts of Trigger
DELIMITER $$/*specifies the start and end symbol of the trigger*/
create trigger client_analysis
before insert on event/*the trigger is activated before new record is been inserted into “event” table*/
for each row
begin/*begins the trigger*/
declare no_row integer; /*declare the variable no_row as an integer*/
declare tot_row integer; /*declare the variable tot_row as an integer*/
select count(*) into tot_row
from event
where clientID = new.clientID; ; /*tot_row is use to calculate the number of party events that specific clients holds*/
select count(*) into no_row
from customer_analyse
where clientID = new.clientID; /*no_row is to use to checking the client is already been recorded in the client_analysis table or not */
if no_row > 0 then
update customer_analyse
set consumption_times = (tot_row +1),
consumption_amount = new.price + consumption_amount,
avg_spending = consumption_amount/(tot_row+1)
where clientID = new.clientID; /*if the client has hold any party event before, the total consumption times will be plus one time, the total spending amount will be sum of the price of new party and the existed history consumption amount, the average spending number will be the result of total spending amount divided by the times of spending calculation*/
else
insert into customer_analyse
(clientID,consumption_times,consumption_amount,avg_spending)
values(new.clientID,1,new.price,new.price); /*if the client do not have any spending history, then insert a new record in the table client_analysis, the record consist of the ID of the client, the number of client holding a party event, the total spending amount of this particular client, the spending of the latest party event*/
end if;
end $$/*ends the trigger and ends with $$*/
DELIMITER
1.1.4 SQL scripts Testing
Insert Valid value
Valid value test result
Insert Inalid value
Invalid value test result
1.2 Procedure
1.2.1 Business Purpose
If business firm can take full advantage of the existent consuming data, it may helpful for organization analysis the supplier circumstance and uncover the inherent patterns of the customers behaviour. Why this supplier is so popular? Why certain suppliers only account for small proportion of total spending? Those analysis is vital for PTE to examine the performance of cooperative supplier and improve their service quality.
The objective of this procedure is demonstrate the percentage of the spending amount on one specific supplier compare to total expenses of the party event, according to frequency of supplier attendance and its proportion account for total spending. This may help PTE to improve the quality of their existent cooperative supplier and determine the appropriate supplier for future cooperation.
1.2.2 SQL Design Statement
This procedure is intend to calculate the percentage of one specific supplier account for one particular party event. The procedure has two inputs which are the identification number of supplier ‘supplierno’ and the identification number of the event ‘eventno’, after the procedure processing, the output would the percentage that supplier accounted.
As the need of the process, there are three intermediate value been declared. First, the SQL script determine the supplier is provide any products in specific party event or not, if supplier do not provide any services or products in the party event, the figure zero will been insert to judgment value yon, conversely, if the business as the supplier of certain party event, the value of yon will be more than zero. If the spending on this supplier in particular party is 0, hence, the supplier account for 0% of the total party event price. On the other hand, if the supplier did as one of the suppliers servicing the party event, the SQL query will calculate the total price of products that this supplier provides and insert the sum into ‘totalcost’. Meanwhile, the total cost of this party event will be stored up into ‘totalprice’. Then the procedure will execute the percentage calculation. The division aftermath is the percentage that specific supplier account for total party event spending.
1.2.3 SQL Scripts
DELIMITER $$ /*specifies the start and end symbol of the procedure*/
CREATE PROCEDURE supplieranalyse (IN supplierno INT, IN eventno INT, OUT percent decimal(8,4)) /*the procedure supplieranalyse has two inputs and one output*/
proc: /*specifies the label proc so we can use the leave command to cancel the procedure*/
BEGIN /*begins the procedure*/
declare yon INT; ; /*declare the variable yon as an integer*/
declare totalcost decimal(8,2); ; /*declare the data type of totalcost as decimal*/
declare totalprice decimal(8,2); ; /*declare the data type of totalprice as an decimal*/
set percent = 0; /* this value is returned if there is no relevent record of the specific supplier*/
select count(*) from demand inner join goods on demand.itemID = goods.itemID
where goods.supplierID = supplierno and demand.eventID = eventno into yon; /* insert the total number of items that specific supplier provides into yon*/
if yon = 0 then leave proc; /* if the value of yon is equal to zero, leaves the procedure*/
end if;
select sum(cost) from demand inner join goods on demand.itemID = goods.itemID
where goods.supplierID = supplierno and demand.eventID = eventno into totalcost; /* insert the total amount of the products that supplier provide into totalcost*/
select price from event where eventID = eventno into totalprice; /* insert the total spending of one particular party event into value totalprice*/
set percent = totalcost/totalprice; /* set the aftermath of division as the value of percent, this value as the output of this procedure*/
END$$ /*ends the procedure and ends with $$*/
DELIMITER;
1.2.4 SQL scripts Testing
Valid value test
Test 1
Test 2
(This supplier do not provide any goods to this particular party event, hence, the percentage is 0.)
Test 3
Test 4
Invalid value test
1.3 Function
1.3.1 Business Purpose
This function is design for calculate the salary of part-time employee in one single party event. As to PTE manager, this function is extremely help to calculate the wage of different part-time employees. Given the number and randomness of the part-time employee, the function will be a powerful assistance for manager to reduce the chance of calculation mistakes.
1.3.2 SQL Design Statement
The SQL query for function is quite straightforward, two input value and its relate data type are been defined in the very first, ‘HourlySalary’ and ‘working_hour’. Then the query define the data type and routine of the return value. At the last script, the value that function return is been defined as the multiply result of HourlySalary and working_hour.
In addition to function SQL query, there is another SQL query been created to demonstrate the implementation of the function. The function is been used inside the query to calculate the wage of the part-time employee been selected.
1.3.3 SQL Scripts
create function ptimewage
(HourlySalary INT(5), working_hour INT(5))
returns decimal(8,2) DETERMINISTIC
return (HourlySalary*working_hour); /* two input values which is hourly salary and working hours in specific party, the routine is considered as DETERMINISTIC since this procedure always produces the same result for same input parameters, two values will be multiplied and return */
select surname, employee.employeeID, team.eventID, event.date, ptimewage(employee.hourlySalary,team.working_hour) as wage
from employee inner join(team inner join event on team.eventID = event.eventID)
on employee.employeeID = team.employeeID
where employee.employeeType = ‘P’ /*three table join together by inner join query, in addition to surname, employeeID,eventID and date, the part time wage of specific part time employee is been listed via function ’ptimewage’ inside the SQL query */
1.3.4 SQL scripts Testing
1.4 View
1.4.1 Business Purpose
View is a virtual table based on the result-set of an SQL statement, more importantly, it used as a form of security to give other users access to a table but only certain columns in the tables. Therefore, the view is been created for the purpose of giving the access authority to cooperative supplier of PTE. More importantly, it enables those suppliers review their cooperative operation circumstance with PTE and helps the suppliers for their business decision making.
1.4.2 SQL Design Statement
In the first section, the view creating SQL query is been listed. And there are seven columns been selected to showed in the view which are ‘event.eventID’, ‘event.date’, ‘event.no_participants’, ‘event.price’, ‘goods.itemname’, ‘demand.amount’, ‘supplier.name’.
Those values are from four different tables, therefore, inner join function is been used to connect those tables. Besides, the condition of SQL query is been defined. Only when quality of supplier is ‘Good’ and spending of party event great than average spending, the relating information will be display as the SQL query result.
1.4.3 SQL Scripts
create view vw_supplierbriefing as
select event.eventID, event.date, event.no_participants, event.price, goods.itemname,
demand.amount, supplier.name
from event inner join (demand inner join (goods inner join supplier on goods.supplierID = supplier.supplierID) on demand.itemID = goods.itemID) on event.eventID = demand.eventID
where supplierID in (select supplierID from supplier where quality = 5) and
event.price > (select avg(price) from event)
order by event.price DESC; /* four tables are been joined by inner join function, two nested queries define the conditions, if values in outer query satisfy the ‘in’ and ‘>’ conditions, the outcome will be displayed by order price descending */
1.4.4 SQL scripts Testing
1.5 Nested Query
1.5.1 Business Purpose
The purpose of this SQL is to list out the relevant information about the party event that contain performance activity in some conditional location, the condition which is the cost of location should higher than average cost.
1.5.2 SQL Design Statement
There are several columns been choose to display, ‘eventID’, ‘date’, ‘name’, ‘price’, ‘locationname’. Those values are coming from various tables, hence, inner join function is been used to achieve the outcome of query. The condition declaration of the query is contain nested query. Only when the eventID is existed in table performance and location price is great than the average cost of all candidate location, the relevant information will been displayed.
1.5.3 SQL Scripts
Select event.eventID, event.date, entertainer.name, location.price, location.locationname
From entertainer inner join (performence inner join (event inner join location
on event.locationID = location.locationID) on event.eventID = performence.eventID)
on entertainer.entertainerID = performence.entertainerID
where eventID in (Select eventID from performence) and
location.price > (select avg(price) from location)
order by event.date desc; /*four tables are been join by inner join, two nested queries define the conditions, if values in outer query satisfy the ‘in’ and ‘>’ conditions, the outcome will be displayed by order date descending */
1.5.4 SQL scripts Testing
2. Database Administration Approaches
Database administration approaches refer to the mechanism of directing or performing all activities in relation to successfully build up a database environment. It is important to identify database administration approaches within the organisation to ensure data security for the users in regard of both privacy and integrity reasons.
On the other hand, some data might be considered confidential for the organisation, thus it is also important for data administrators to build up the database views with the consideration of what to show to the externals without revealing the internal important information to the public.
In this task, two major database administration approaches would be mainly discussed including data protection and data backup. In the light of these two aspects, it would be helpful to investigate the concerns on what data is need to be protected and additionally what data backup challenges that the data administrator may encounter as the company grow to a larger size.
2.1 Data Protection
Data is recognised as an important enterprise asset which requires to be safe guarded. Failing of data protection is likely to result in enormous direct financial losses and indirect losses from the negative effects of a drop of market confidence and fleeting customers to the competitors (Pfleeger and Pfleeger 2002). Hence, it could be inferred that data protection plays a vital role in maintaining a safe database environment to avoid those threats and risks that lead to negatively influence of corporate development. To be specific, data protection involves both personal data and corporate commercially confidential information which is not supposed to be transparent to the public. In light of evaluating what data needs to be under protection, two respective examples would be provided to be supportive to the discussion: for privacy reason, the importance of customer’s information would be mentioned up; for integrity reason, resource information would be expanded to demonstrate the severeness of achieving the database stability.
2.1.1 Privacy reason- customer’s information
The information that is entrusted to the business by its customers must be protected.
Because it is in relation with the organisation brand as a business bottom line (Massa 2014). Especially, the customer information contained by the organisations is considered as very precious commodity. If the business can illustrate the fact that the enterprise recruit a relevant systems in place to protect the customer personal information, the business is more likely to reach the destination with the development into a healthy commercial relationships, which would beneficial for the business in a long run. However, failing of protecting customer’s information would result in unjustified adverse effects on the individual concerned.
In the database design, ‘client’ table has been included with the details of customers’ name, address, phone number and also email address. These private information could be sensitive personal data with individual concerns. If these kinds of entrusted information about customer personal details exposed intentionally or unintentionally to the others, as a result, customer would be feel insecure with information revealing. Thus, the consequence leads to negative effects on deduction of brand confidence, and next results in the profit loss or customer shifting the other competitive organisation for the service next time. Therefore, it is important to protect customer information as a privacy reason as a database administration approach to secure that entrust customer information has been kept in safa and the connections between organisation and customer remain positive.
2.1.2 Integrity reason- resource information
On the other hand, integrity reason for organisation as a whole in data protection is to protect the resource information as businesses confidentiality. In this case, resource information involves in ‘location’ details, ‘entertainers’, ‘suppliers’ etc., these entrusted data have been considered as important resource which connecting the business with enterprise and other providers. These connections are the fundamental element of the business success, thus it needs to be protected from the other competitors in same industry.
In addition, the cost of the event the company presenting to the customers also an important information that has significant impact on the development of business. It should be protected as well so that customers only know the price but not know the real cost for the event. The difference between event price and event cost is the where the company could make out the profit. Thus, this kind of information should be ensured in security within the organisation, and ensure that only authorised staff have access to information.
2.2 Data Backup
In this case, the company aims at expanding their business to other cities, which means meanwhile the database is growing with more enormous information and data. In a way, the volumes of data to be backed up have always been the challenges. In order to dealing with large volume of information as the company grow in size, some of the major challenges that backup administrators faced involves coping with tape capacity limitations and ensuring that the backup techniques are able to deliver sufficient performance to backup and to restore. Thus, the data backup challenges that administrator encountering is to frequently back up the database in a good timely manner in the pace of the data growth, so that to ensure the data could be kept and restored if it is needed (Farley 2013). This is especially true not only is database data growing exponentially, but also database system read and write data more frequently. Thus more vulnerable to failure and require adherence to stricter recovery point objectives. These backup challenges are particularly painful for PTE this kind of small-to-medium size company, as the increasing cost and complexity as well as lack of expertise resulting the difficulties of managing backup effectively.
Other than that, database backups require expertise in both data backup and database administration, and also in consideration of the source of the company. For those reason, it could be suggested to think about turning to managed backup from a third-party provider instead of handle it in-site. This service involves outsourcing the planning and management of backup environment and requirements to an outside server to meet company’s specific business requirements. It would save the cost and administration pressure, however the security of database information are also ought to be bear in mind in this case.
3. Data Analytics
3.1 Prediction Model (Orange)
3.1.1 Prediction Model Statement
Based on the knowledge of regression and data exploration, one orange model has been produced as shown below. Additionally, in order to increase the precision and accuracy of the prediction, there are five different kinds of regressions been added into the prediction model which are Linear Regression, Nearest Neighbors, Regression Tree, SGD Regression and AdaBoost. Moreover, aiming to achieve more comprehensive prediction outcome and figure out the variation of prediction relating to amount of the input data, through Data Sample widget in the orange, there are five fixed proportion of input data had been set up to compare the change of figure Root Mean Square Error (RMSE), 5%, 30%, 50%, 75%, 99%.
Due to the interaction between this two dimen sions, different regressions and the different proportion of the input data, it is able to define the most appropriate regression for prediction model to predict the likely total sales of TPE. Furthermore, the variation trend of each regression’s Root Mean Square Error (RMSE) relating to the proportion of input data is been demonstrated.
3.1.2 Prediction Model in Orange
3.1.3 Prediction Aftermath
Based on the prediction model created above, the sales data of TPF has been imported after format modification. The objective columns -SalesAmount has been targeted, therefore, the prediction model is able to generate predictive aftermath of the SaleAmount. Here is one example prediction outcome screenshot when the fix proportion of input data has been set as 30%. A total number of TPE sales data is 2000, therefore, 30% of the data number is 600.
Example – Fixed Proportion of data is 30%
~
Example – The Prediction outcome From 1 to 600 when data proportion is 30%
3.1.4 Root Mean Square Error Calculation
Since there are totally five different proportion of input data, 5%, 30%, 50%, 75%, 99%, the prediction outcomes of Orange prediction model are been exported by five times. Here is one exported example screenshot concerning 75% proportion of total sales data.
Exported Example – 75% Proportion of Total data
At this stage, the prediction aftermaths had been processed, however, regression models are evaluated differently. Therefore, which regression is the most accurate and appropriate one for TPE sales amount prediction? According to Week 10 tutorial material, one popular evaluation in regression models is the ‘Root Mean Square Error’ (RMSE). Hence, the RMSE value of different regressions in terms of different fixed proportion of input data are been calculated. Firstly, the deviations between the actual sales amount and predictive figure output from distinct regressions are been calculated. Second, the deviations will be squared. Next, the average of those squared deviations figures will be calculated. Finally, the square-root of the average is the Root Mean Square Error (RMSE). The following content is the processing example to calculate the RMSE when fixed proportion of data is 99%.
~
Excel Calculation Example – Fixed Proportion of data is 99%
After all the RMSE values had been created, figures are all been fit in one table, since it is easy for compare the differences between different regression methods and various proportion of input data.
3.1.5 RMSE Analytics
The smaller the RMSE means the prediction of Orange model is more close to the actual situation. According to the RMSE table listed above, no matter which fix the proportion of data are been set, the smallest RMSE figure among all the regression methods is the prediction using Regression Tree. Even though some RMSE figures of another regression method is not much different with the optimal RMSE, the RMSE of Regression Tree remains the smallest all the time. Therefore, regression tree is the priority when choosing a prediction regression methods. Moreover, excluding the Regression Tree, the AdaBoost regression is the best regression method for TPE sales prediction among the rest of regression methods. Although RMSE of AdaBoost regression is not the smallest one in the section of 30% input data, it still the second smallest RMSE in other four remains different data input proportion sectors. Therefore, according to the analysis, the regression tree has been defined as the best regression method for sales prediction in Orange model, and the AdaBoost regression method would be the second best choice for sales prediction.
Regression Tree is built on the basis of a process knowing as binary recursive partitioning, according to academic literature, the recursive partitioning is an iterative process that splits the data into partitions, and then continues dividing each partition into smaller groups as the method moves up each branch (‘Regression Trees’ 2016). Since the dataset of TPE sales consist of three discrete and three continuous values, coincidently, as academic literature demonstrates, Regression tree are suitable for dependent variables that take continuous or ordered discrete values (Loh 2011). Hence, to some extent, this may be the origin reason why Regression tree method is the optimal regression method for TPE sales prediction, however, considering the complexity of the data prediction, there may existed certain other unknown factors contribute to this prediction aftermath as well.
3.1.6 Variation of the RMSE
In addition to the transverse comparison between various regression tree methods, there is one interesting phenomenon can be discovered by the vertical comparison between different fixed proportion of input data of one single regression method. As can been observed, all the RMSE remain in the smallest values, around 200 or 300, in the circumstance of small data amount (5%). Once the proportion of input data has been set up as 30% or 50%, the RMSE value has experience a dramatically increase, approximately 600 or 700, compare to relatively small amount of input data. However, with the continues raising of the proportion (75%, 99%)the value of RMSE has slightly decrease into figure about 500 or 600. According to variation of RMSE, there one deduction been made which is because the data amount is relatively small in the first stage, therefore, the logic behind the changing prediction value is more easy to teased out, however, with the increase of data, the increased complexity result in the greater deviation, nevertheless, as long as the data increase to a relatively enough amount, the prediction outcome will be more accurate, hence, the deviation is slight drop off.
3.1.7 Limitations
Considering there are only five regression methods been selected as the regressions of this particular prediction model in Orange, other alternative regression methods that not been selected are possible more suitable for the TPE sales prediction. Besides, in order to achieving more accurate and precise prediction, more different regression methods and prediction models are necessary. Also, there sales dataset of TPE is only contain 2000 records which is relatively insufficient for scientific prediction, if the amount of data is relatively enough for prediction, the prediction aftermath would lead to more accuracy than existent one.
3.2 Missing value
Missing values in dataset is a common issue, which lead to data resorting. Thus dealing with these missing values is important to ensure the accuracy and the integrity of the database. According to some past research, listwise and pairwise deletion are the most common methods to dealing with missing values (Peugh and Enders 2004). Thus the way of solving the problem of missing data could be mainly suggested with these two techniques.
Listwise deletion is a mechanism that enable to remove one or multiple missing values, by deleting the rows where the values are missing. However, this technique is to only analyse when all the values are available. And Pairwise deletion attempts to minimize the loss that occurs in listwise deletion by applying calculations on target attributes with those available values in regardless of missing data.
The table below shows the comparisons between listwise deletion and pairwise deletion with their advantages and disadvantages respectively for better understanding:
Listwise Deletion
Pairwise Deletion
Advantages
Easy to apply
Results are accurate
Harder to apply than listwise deletion
No value discarded
Each entry is recorded
Disadvantages
Have to discard some available values which might be usable
No result for every entry
Less entry ends in less statistical power
Results are biased
Harder to compare the results
Other than these two major techniques to handling the missing values, methods like Imputation and Maximum likelihood estimation are also suggested by some other studies (Allision 2001): Imputation method suggesting that the missing values could be replaced by substituted value such as mean value or the average value. Similar with pairwise deletion, the advantage of this method is allowing the study analysing with every entry of data to ensure the statistical power, but the results are heavily biased. And Maximum likelihood estimation is considered as the most robust technique that allows the values being estimated to be most accurate, however the complexity of this method limits the its application in practical.
4. References
Allison, P.D., 2001. Missing data (Vol. 136). Sage publications.
Farley, M., 2013. Rethinking Enterprise Storage: A Hybrid Cloud Model. Microsoft Press.
Loh, W 2011, ‘Classification and regression trees’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14-23.
Massa, P.M., 2014. Brand-Led Transformation: Operationalising the Brand in.Handbook of Research on Managing and Influencing Consumer Behavior, p.439.
Peugh, J.L. and Enders, C.K., 2004. Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of educational research, 74(4), pp.525-556.
Pfleeger, C.P. and Pfleeger, S.L., 2002. Security in computing. Prentice Hall Professional Technical Reference.
‘Regression Trees’ 2016, accessed October 16, 2016, from