CSIT314 Software Development Methodologies
Data-driven Software Development
Data-driven software development
Copyright By PowCoder代写 加微信 powcoder
Two perspectives:
Developing data-driven software products
• E.g. Many Artificial Intelligence (AI) applications are data-driven. • Also referred to as AI Engineering
Leveraging software development data to generate insights and build tool support for business analysts, software developers, project managers, etc.:
• E.g. Help business analysts identify requirements from app reviews
• E.g. Help project managers predict delays and risks
• E.g. Help agile team estimate efforts (story points)
• E.g. Help software developers locate security vulnerabilities and bugs
• E.g. Automatically generate code comments, commit messages, test cases, etc.
• Also referred to as Software Analytics
• Many large organizations (e.g. Microsoft, Google and Facebook) deployed Software
Analytics in their software development process. 2
The traditional software development approach
Source: , Hands-On Machine Learning with Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems (1st ed.) 3
The traditional approach – an example
Example: the weather problem – condition for playing a cricket game. How can we build an software app which can predict if a cricket game is going to play or not?
The data-driven approach
Source: , Hands-On Machine Learning with Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems (1st ed.) 5
Data-driven software systems
Supervised learning (Machine Learning)
Source: , Hands-On Machine Learning with Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems (1st ed.) 6
Data-driven software development Develop data-driven learning models
How do we (automatically) build a model (or an app) for predicting if a game is played?
This model is a function which can be inferred (learned) from labelled training data (aka data driven)
• f(outlook, temperature, humidity, windy) returns true or false.
• Features/attributes: input variable, e.g. outlook, temperature, etc. • Target/dependent variable, e.g. play = yes or no
• Training set consists of training examples
Classification = prediction.
The AI/ML approach – an example (cont.)
Some basic learning models (learners) Decision Trees
Source: Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition by Ian H. Witten, , . Hall 8
The AI/ML approach – an example (cont.)
Some advanced learning models (learners) Random Forests (RF)
• An ensemble learning method
• A significant improvement of the decision tree approach
• Generating many classification trees, each of which is built with random subset of variables at each node split, and aggregates into the individual results using voting
Neural Networks
and many other ML models.
Data-driven software development lifecycle
Model requirements:
Identify which components of the existing (or new) product are feasible to implement with machine learning/ML, data-driven technology.
Elicit requirements for these data-driven/ML components
Determine what types of models (e.g. supervised vs.
unsupervised).
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Data collection:
Look for available datasets (e.g. internal data, public data, etc.)
or build their own datasets.
May use a mix of datasets (e.g. pre-training on public datasets, and then (post-)training on their own dataset).
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Data cleaning:
Remove inaccurate or noisy data records. Filling missing data
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Data labelling:
Assign ground-truth labels for each data record
Can be done by software engineers, domain experts or crowd workers (e.g. Mechanical Turk).
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Feature engineering:
Extract features from data records – feature extraction
Select informative features (e.g. remove correlated features) – feature selection
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Model training:
Split training vs test data
Choose a model or a set of models (learning algorithms) Training the chosen models
Tuning hyper-parameters
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Model evaluation:
Assess the model’s performance on test data (precision, recall,
F-measure, AUC, MAE, etc.)
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Model deployment:
Deploy on the targeted devices
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle
Model monitoring:
Continuously monitor for performance and errors
Amershi et. al., Software engineering for machine learning: a case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 2019.
Data-driven software development lifecycle An Example
Model requirements:
Data-driven support for software engineers in developing and
managing software projects (Software Analytics)
• E.g.Helpbusinessanalystsidentifyrequirementsfromappreviews
• E.g.Helpprojectmanagerspredictdelaysandrisks
• E.g.Helpagileteamestimateefforts(storypoints)
• E.g.Helpsoftwaredeveloperslocatesecurityvulnerabilitiesandbugs
• E.g.Automaticallygeneratecodecomments,commitmessages,testcases,etc.
• AlsoreferredtoasSoftwareAnalytics
• Manylargeorganizations(e.g.Microsoft,GoogleandFacebook)deployedSoftwareAnalyticsin
their software development process.
Software Engineering
Requirements
Implementation
Verification and Validation Maintenance and evolution Software Project Management
AI for Software Engineering (AI4SE)
Commit messages
Test cases
Issue, bug reports, Product backlog user stories
Source code
System events
Usage logs 21
App reviews
AI4SE: delay prediction in software projects
Model requirements:
Project Manager
Which of these ongoing tasks will be at risk of being delayed?
AI Engineering lifecycle – example
Model requirements:
Feature and label extraction
Project’s issue tracking system (e.g. JIRA)
Training tasks
f1, f2, f3, …, fn
f1, f2, f3, …, fn
f1, f2, f3, …, fn
f1, f2, f3, …, fn
f1, f2, f3, …, fn
f1, f2, f3, …, fn
Task features
t1 Known delay outcome
t2 (e.g. major delay, minor
t3 delay, non-delay)
New ongoing task
Predicted delay outcome
Supervised learning
Delay prediction system
f1, f2, f3, …, fn
Classifier
AI Engineering lifecycle
Data collection:
Analyze 40,830 past tasks (i.e. issues) in 5 large software projects: Moodle, JBoss, Apache,
Duraspace, and Spring.
All these tasks are recorded in the JIRA issue tracking system
AI Engineering lifecycle
Data cleaning:
Remove outliers, e.g. incomplete tasks, issues/tasks long
overdue, issues with no due date, etc.
AI Engineering lifec
Data labelling:
AI Engineering lifecycle
Feature engineering:
AI Engineering lifecycle
Feature engineering:
1. Discussion time
2. Waiting time
4. Number of times that an issue is reopened
5. Priority
6. Changing of priority
7. Number of comments
8. Number of fix versions
9. Changing of fix versions
10. Number of affect versions
11. Number of issue links
12. Number of issues that are blocked by this issue
13. Number of issues that block this issue
14. Topics of an issue’s description (NLP/LDA)
15. Changing of description 16. Number of votes
17. Number of watches
18. Reporter reputation
19. Developers’ workload
20. Percentage of delayed issues that a developer involved with
21. Task dependencies (e.g. blocking, assigned to the same person, or affecting the same components)
AI Engineering lifecycle
Feature engineering:
Descriptive -penalized logistic regression model for risk probability, trained on all tasks collected from the five projects
AI Engineering lifecycle
Model training:
Data (e.g. 40,830 past tasks) split into training set and test set
Use a number of classifiers: Random Forests, Neural Networks, Decision Tree (C4.5), Naïve Bayes and NBTree.
Training set is used to train these classifiers.
AI Engineering lifecycle
Model evaluation:
1.0 0.8 0.6 0.4 0.2
Random Forests
aNN C4.5 Precision Recall
Naïve Bayes F-measure
• , Dam, Truyen Tran and , Characterization and prediction of issue- related risks in software projects, Proceedings of 12th International Conference on Mining Software Repositories (MSR 2015), co-located with ICSE 2015, pages 280 – 291, IEEE (ACM SIGSOFT Distinguished Paper Award)
• , Dam, Truyen Tran and , Predicting the delay of issues with due dates in software projects, Empirical Software Engineering journal, Volume 22, Issue 3, pages 1223-1263, Springer.
AI Engineering lifecycle
Model deployment:
AI-powered plugin for JIRA issue
tracking system
• Recommendingstorypoints,labels, priority, type and components for each issue
• Visualizationofissuedependency
2019 CSIT321 Project – viTech Team See demo https://youtu.be/iI-3Rj-AWRs
See this article featured this project published on Atlassian developer website: https://blog.developer.atlassian.com/artificial- intelligence-for-issue-analytics-a-machine-learning-powered-jira- cloud-app/
Conceptualization of AI Engineering
33 Source: , Developing AI Systems – New challenges for Software Engineering, ICSOC 2019
Top important jobs in AI
Source: , Software Engineering for AI 34
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com