PLAN AND PROPOSAL
BRIAN SU 11525425 | ASAD UR REHMAN 11528015 | TIM BURGESS 11433404
ANALYTICS CAPSTONE PROJECT 41004
Contents
1. BAT Analytics……………………………………………………………………………………………………………………… 2 1.1 About Our Company ……………………………………………………………………………………………………… 2 1.2 Mission Statement ………………………………………………………………………………………………………… 2
2. Our Team ………………………………………………………………………………………………………………………….. 2 2.1 Meet the Team……………………………………………………………………………………………………………… 2 2.1.1 Brian Xinda Su ………………………………………………………………………………………………………… 2 2.1.2 Asad Ur Rehman …………………………………………………………………………………………………….. 2 2.1.3 Tim Burgess ……………………………………………………………………………………………………………. 2 2.2 Our Roles……………………………………………………………………………………………………………………… 2
3. The Project ………………………………………………………………………………………………………………………… 3 3.1 Our Client …………………………………………………………………………………………………………………….. 3 3.2 How we can help…………………………………………………………………………………………………………… 3
4. The Data Mining Problem ……………………………………………………………………………………………………. 3
5. Project Proposal …………………………………………………………………………………………………………………. 4
6. Project Plan ……………………………………………………………………………………………………………………….. 4 6.1 CRISP-DM methodology………………………………………………………………………………………………….4 6.1.1 Business Understanding…………………………………………………………………………………………… 5 6.1.2 Data Understanding ………………………………………………………………………………………………… 5 6.1.3 Data Preparation……………………………………………………………………………………………………..6 6.1.4 Modeling ……………………………………………………………………………………………………………….. 6 6.1.5 Evaluation/Deployment …………………………………………………………………………………………… 6 6.2 Work plan…………………………………………………………………………………………………………………….. 6 6.3 Milestones……………………………………………………………………………………………………………………. 0 6.3.1 Project Commencement ………………………………………………………………………………………….. 0 6.3.2 Plan Proposal …………………………………………………………………………………………………………. 0 6.3.3 Mid-project update …………………………………………………………………………………………………. 0 6.3.4 Final Report (Client) ………………………………………………………………………………………………… 0
1
1. BAT Analytics
1.1 About Our Company
BAT Analytics is a data analytics company founded in 2015 by Brian Xinda Su, Asad Ur Rehman, and Tim Burgess. We help clients to better understand their own businesses, with the aim of solving problems, improving performance, and streamlining business process through the use of data analytics services.
1.2 Mission Statement
Our mission is to provide first-class data analytics services to a broad array of clients in a wide range of sectors. We aim to take data from all available sources and transform it into business solutions, by helping clients better understand their own businesses. We also aim to maximise the usefulness of existing and new data, bringing business and social value to our clients.
2. Our Team 2.1 Meet the Team
2.1.1 Brian Xinda Su
Brian is a data analytics student from the University of Technology, Sydney, with a background in all levels of data analytics subjects. He has a passion for statistics and data gathering and analysis, and wants to bring value to our clients by transforming raw data into insightful business solutions. He also has a keen interest and strong background in programming, which only adds further to the value he can bring to our customers.
Brian hopes to continue in the field of data analytics, and will be working as the Lead Data- Gatherer on this client project.
2.1.2 Asad Ur Rehman
Asad is a data analytics student at the University of Technology, Sydney. His background is in data analytics, with a strong focus on statistical analysis, Linear Dynamical Systems, Intelligent Agents, and Patterns Programming. He has an interest in image processing and pattern recognition. Asad’s key strengths include data pre-processing, data modelling, and R programming.
Asad plans to continue as a data analyst upon completion of his time at UTS, and will be working as the Project Manager for this engagement.
2.1.3 Tim Burgess
Tim is a student of data analytics and business at the University of Technology. He has a multi-disciplinary background, with an analytics component comprising of advanced analytics processes & database programming, and a business focus on accounting and financial reporting. In particular, he is keen on the interplay between the two disciplines, and wants to use data analytics as a means to improve business processes.
Tim plans to work in a risk assessment or advisory role, and will be the Lead Data Analyst for this project.
2.2 Our Roles
The structure of our team is quite democratic and fluid, however the nature of the project does require the allocation of individual roles to ensure that all work is completed in a timely and efficient manner. Roles and work allocations are assessed weekly at team meetings, to ensure an equitable and balanced distribution of tasks.
2
Asad Ur Rehman will be acting as the team manager in this project. His duties are to ensure that the team abides by BAT’s high quality and service standards. It is his job to oversee the project and intervene in any areas of concern, including quality control and staff conduct.
Brian is charged with the gathering of data appropriate to the project. Accurate data collection is essential to maintaining the integrity of the research. In order to guarantee the data integrity, we need to be vigilant for errors in the data collecting process. Brian will do this by utilising two methods; quality assurance, which will take place before data collection begins, and quality control, which will take place during and after data collection. It will be Brian’s responsibility to ensure prompt action is undertaken, should any errors be detected either in individual data items, or in the systematic collection or pre-processing of data. Issues with individual staff conduct, including fraud or scientific misconduct, will also be Brian’s responsibility.
Tim will be responsible for leading the data analysis portion of the project. Data analysis is used in many industries to allow companies to make better decisions and operate more efficiently. The goal of data analysis is to benefit the client in terms of reducing costs, allowing better target-marketing and, ultimately, improving profit. There are a number of tools available for many different data analysis purposes; for this project, we anticipate the use of Weka, Rattle, and R to be sufficient for the client’s needs.
3. The Project
3.1 Our Client
Our client is a digital publication called Tip of the Tongue. Tip of the Tongue aims to challenge the perception of Australia as a monolingual nation, and create awareness among Australians about the diversity of languages spoken throughout the country. The client wishes to uncover the challenges and issues faced by minorities. In order to accomplish these objectives, Tip of the Tongue’s strategy is to dig deep and get involved in capturing stories and social attitudes from the general public, rather than simply paraphrasing official press releases and pronouncements.
3.2 How we can help
While we respect our client’s mission and believe there is a lot of value in their qualitative approach to exploring this issue, we also recognise the wealth of information that is available and could be valuable to their publication. We want to explore this data and provide a new and different perspective to complement their existing approach.
4. The Data Mining Problem
The Australian Census is conducted once every five years and is used to collect a wealth of demographic data about Australian citizens, from family characteristics and relationship status, to housing situation and cultural background. The census produces a phenomenal amount of information, which the Australian government uses as a guide for the appropriate provision and distribution of services to Australian communities. While this abides as its primary practical purpose, the Census also exists as a comprehensive resource of information for research and social analysis purposes.
Our client, Tip of the Tongue magazine, aims to shine a light on cultural and language diversity issues around Australia. We believe that we have the capability to assist them in their goal, by providing an analysis of census and other data to help highlight issues and trends affecting Australia’s non-English speaking and ESL population.
3
Tip of the Tongue’s mission statement suggests a preference toward telling individual stories and promoting social conversations about national language issues. Of course these sorts of stories have their own value, but we at BAT Analytics believe we are uniquely positioned to prepare, analyse, and present the data available to help provide a different perspective with which Tip of the Tongue can inform their readership.
We anticipate that our approach will primarily be an unsupervised learning approach. Unsupervised learning is the process of trying to find structure in data, identifying correlations or clusters that might not be immediately obvious. By comparison, supervised learning aims to infer or predict information by analysing labelled training data which, in this project, does not seem to be an appropriate approach.
5. Project Proposal
We anticipate that the Australian Census data will form the majority of our data for analysis proposal. This resource includes extensive data about family characteristics, relationship status, housing situation, and, importantly, cultural and language information. As this information is from an official government source with a strong history of statistical analysis (the Australian Bureau of Statistics), we have deemed it to be reliable enough to serve as our core data source. Throughout the project, we will continue to assess alternate data sources and integrate them when and if we deem them to be of additional value.
We anticipate that we will require Weka and Rattle -data analysis tools- to for most of our data analysis process. Using these tools, we will clean, structure, integrate and transform the raw data for later use in data analysis process.
We may need to use R as well because it provides extra features that are essential for data modelling and analysis that Rattle does not provide.
Since our client only wants to explore existing data about languages and not to predict something in future, this problem is considered to be an unsupervised learning. Having established that, initial type of analysis that we think to be appropriate is the analysis of distribution of languages: how many different languages? How languages are distributed geographically? How languages are distributed amongst different races, sexes, age groups and cultures? Another interesting analysis would be to find out correlations between different parameters. These are all areas of interest flagged by our client, and we intend to explore them fully, as well as raise any other findings we make in the process.
The findings of these analyses will allow our client to support their cause, and hopefully generate interesting and insightful content for their publication.
6. Project Plan
6.1 CRISP-DM methodology
The standard process model for data analysis, conceived in 1996 and refined since, is the Cross-Industry Standard Process for Data Mining (CRISP-DM). The CRISP-DM models consists of six phases, standard to all data mining projects. This is the model BAT Analytics intends to follow. These phases are, in order:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
4
Deployment
The CRISP-DM methodology is illustrated in Figure 6.1, and the individual phases will be
further discussed in the following sections.
Figure 6.1 – CRISP-DM Methodology
6.1.1 Business Understanding
This initial phase focuses on the client, and dictates a need to understand the project objectives and requirements from the client’s business perspective, and then converting this understanding into a data mining problem definition. For this particular project, this phase is already well underway. Our client, Tip of the Tongue magazine, is a digital publication focused on the diversity of spoken languages throughout Australia. They aim to tell stories and create awareness of the rich linguistic landscape of this country.
At first glance, this may seem like a primarily qualitative endeavour, with little scope for quantitative analysis. However, we at BAT Analytics believe that there is a lot of cultural and linguistic data to be explored and analysed, that will help our client provide a fresh and novel perspective on Australian demographics, while still staying true to their publication’s vision.
6.1.2 Data Understanding
The data understanding phase encompasses the collection of data, as well as the initial familiarisation. At this stage, we will assess the quality of the data, and begin to form initial insights. This will include identifying any areas that we believe may be of interest to our client, and beginning to construct tentative hypotheses to be developed and tested.
5
We anticipate that census data will form the main dataset of this project, which should provide a relatively clean and uniform base from which to begin. However, we are looking to incorporate other datasets as well, which may not be as neat, and could thus pose their own challenges.
6.1.3 Data Preparation
This phase follows on from data understanding, and involves the activities required to construct the final dataset. This might include cleaning the data to ensure it is in a state able to be used for analysis, and could include transformation of certain data attributes into a state more conducive to analysis techniques.
6.1.4 Modeling
The modelling phase is the process of modelling the data in order to test hypotheses and gain new insights. There are numerous ways in which to model any dataset, and many different techniques to be employed. However, not all techniques are compatible with all data types, so there may be many different iterations of the Data Preparation and Modeling phases, with adjustments made to the data as required. It is too early to predict exactly what form the modelling phase may take in this particular instance, but updates will be provided in the mid-project report.
6.1.5 Evaluation/Deployment
Normally, the Evaluation and Deployment phases would be two separate steps. However, for this particular project it seems appropriate to combine them into one, as the deployment phase is largely inapplicable in this case. The evaluation phase involves assessing the quality of the model produced in the antecedent phase. This assessment includes ensuring it achieves the objectives of the client. The deployment phase in this case consists simply of compiling and presenting the data in such a way that the client can make sense of it and use it.
6.2 Work plan
A preliminary work plan has been compiled (Figure 6.2) that outlines the individual duties of each member of the BAT team, as well as key dates of the project.
6
Figure 6.2 – Preliminary Work Plan
6.3 Milestones
There are 5 main milestones throughout the duration of this project. These have been identified in Figure 6.1 and are described below:
6.3.1 Project Commencement
February 26th, 2015
This marks the start date of the project, at which point the group was convened and work officially began on the client proposal.
6.3.2 Plan Proposal
March 26th, 2015
This is the official submission of this document, for client and internal review. This is the first deliverable of the project, and allows the client to assess our methodology, and provides an opportunity for further queries and clarifications to be raised.
6.3.3 Mid-project update
April 9th, 2015
This is the second deliverable of the project. It will allow us to communicate to the client our progress to date, and also raise any concerns or unforeseen limitations that we may have encountered in the beginning stages of the project. It will also provide another opportunity for the client to clarify any needs, or advise BAT of any changes to their requirements.
6.3.4 Final Report (Client)
May 21st, 2015
At this stage, this is a soft deadline for the final report to be delivered to the client. By this time, all analysis should be completed and presented in a form that is understandable to the client, and useable for their business needs. Depending on the needs of the client, this deadline may change.