Domain Drivers – Lecture 3-4 Professor Richard O. Sinnott
Director, Melbourne eResearch Group University of Melbourne
Objectives
Copyright By PowCoder代写 加微信 powcoder
• To give the “big picture” of why we need Cluster and Cloud Computing
– This lecture is not focused on technologies, but on giving examples of how challenges are shaping the technological landscape
• …and how on-going/completed projects have met/are meeting those challenges
– Many perspectives
• Big Data – and the hype!
• Big Compute
• Big Distribution • Big Collaboration • Big Security
• Often similar challenges facing many research
• Tools, technologies and methodologies have been/can/are evolving to tackle these challenges
– That there is a huge amount of work still to be done
• Don’t believe the hype!!!
• The pace of research evolution FAR outweighs the pace of
IT know-how to deal with the challenges – Domain knowledge!!!
Focal Point
• Examples from different research domains
– {Computational/Data/Distributed/Collaboration/S ecurity} bound…
• High Energy Physics • Astrophysics
• Electronics
• Bioinformatics
• Biomedical and bioinformatics domain • BREAK
• US Intelligence
• Social Sciences/Urban research domain
– (prelude to workshop and assignment 2)
Project Portfolio Subset of On-Going
• EU European Platform for Study of Wolfram, Alstrom, (EuroWABB)
• National e-Science Centre (I, II, III)
• Dynamic Virtual Organisations for e-Science Education •
• Biomedical Research Informatics Delivered by Grid Enabled Services •
• GridNet, GridNet-2 •
• Grid Enabled Microarray Expression Profile Search •
• Glasgow early adoption of Shibboleth
• Joint Data Standards Survey
• ESP-Grid
• HPC Compute cluster award // Sun industrial sponsorship
• OGC Collision
• Multicenter prospective study of biochemical profiles of monoamine-producing tumors (PMT Study) European Society of Hypertension Study on Pheo/PGL
International DSD
EU FW7 European Network for Study of Adrenal Tumors Cancer Research Platform (ENSAT-CANCER) VicHealth Health Indicators and Spatial Objective Data
• National Spinal Injury Research Platform
• Australian Urban Research Infrastructure Network (AURIN)
• Epilepsy e-Learning portal
• Type-1 Diabetes study of environmental factors on onset of T1D
• Australian Diabetes Data Network (ADDN)
• International Niemann-Pick A, B and C Registry
• Data Journalism in the Big Data Era
• FAMIAN – Combined 18F-fluorodeoxyglucose positron emission tomography and 123I-Iodometomidate
• OMII-Security Portlets // OMII-RAVE
• Integrating VOMS and PERMIS for Superior Grid Authorization
• CESSDA PPP •
• Pharming of Therapeutic RNA •
• Grid Enabled Occupational Data Environment •
• Towards an e-Infrastructure for e-Science Digital Repositories •
Imaging for Adrenal Neoplasia
Melbourne Genomics Health Alliance (variant DB) NeCTAR Cloud Encryption/Decryption and Secure Deletion CRE for Protection of Pancreatic Beta Cells
Airbox (Atmospheric Physics and Climate Research)
• Grid enabled Biochemical Pathway Simulator
• Virtual Organisations for Trials and Epidemiological Studies
• A European e-Infrastructure for e-Science Repositories
• Modelling, Inference and Analysis for Biological Systems up to the Cellular Level
• Drug Discovery Portal
• Parliamentary Discourse
• Scots Words and Placenames
• Qvolution stress management survey system
• Advanced Grid Authorisation through Semantic Technologies ShinTau
• AlstromUK VRE
• Grid-enabled Virtual Safe Settings •
• Clinical Streaming Transcription Software •
• Enhancing Repositories for Language and Literature Researchers (ENROLLER) •
• Proxy Credential Auditing Infrastructure for the NGS
• Scottish Bioinformatics Research Network (SBRN)
• Generation Scotland Scottish Family Health Study
• Breast Cancer Tissue Biobank
• Australian Diabetes Data Network – Phase II (ADDN2)
• Helicopter advanced training system, Australian Department of Defence
• Hort-eye Cloud analytics
• Public Records Office Victoria Data Management Solutions
• Complex System Modelling Platform and GPU utilisation
• Public Records Office Victoria Data Management Solutions Follow-Up Grant
• VicHealth 2016 Indicators API
• Helicopter advanced training system Phase II, Australian Department of Defence
• Twitter data analytics for business
• Mobile Applications for Patients with
• Systems Genomics Support Platform
• SWARM: Smartly-aggregated Wiki-style A (SWARM)
ORCA Cognitive Assessment Platform
• Data Management through e-Social Science (DAMES)
• Meeting the Design Challenges of nanoCMOS Electronics (nanoCMOS)
• EU FW7 AvertIT
• EU FW7 EuroDSD
• NeSC Research Platform (NRP)
• NeSC Information Network (NIN)
• ESF Network for Study of Adrenal Tumors
• Scottish Health Informatics Platform for Research (SHIP) •
• National E-Infrastructure for Social Simulation (NeISS) •
• EU R4SME Diagnosis of Parkinsons Disease (DiPAR)
• Automating River Pollution Detection (CAPIM) •
• Endocrine genomics Virtual Laboratory (endoVL)
• DSDNetwork Australasia
• NESP Clean Air and Urban Environments
• Application of omics-based strategies for improved diagnosis and treatment of endocrine hypertension
• Youth alcohol consumption database and mobile app
• LIDAR Data Analytics Research Environment
• Type-1 Diabetes Clinical Research Network
• American Asian Australian Adrenal Alliance
• International League Against Epilepsy
• Platform for Research Software Solutions (PRESS)
• Mobile applications for the Environmental Determinants of Islet Autoimmunity
• Secure Data Solutions for the Biomedical Communities of the Cloud
• Metabolomics Sample Management and Processing Platform
Linked Data PolicyHub Stage II: Urban & Regional Planning & Communications Australian Genomics Health Alliance
Melbourne Genomics Health Alliance
88days Backpacker app
• VicSpin Victoria-wide Flu Survelllance System
ElectraNetLIDAR/VectorNZ Lidar
• Growing Landscape Carbon 5
• Replicats
• Bushfire data management platform
Compute Scaling
Network Scaling
• From tablets, to papyrus, to books
– (quite adequate for several thousand years)
• Enter silicon transistors circa 1960 – punch cards,
– punched streamer tape,
– magnetic tape,
– floppies, –…
Data Present
• Data Storage today
– local (computer) hard disks, – shared storage,
– tape storage.
– mobile storage,
– The Internet! • Dropbox
• Google • Clouds •…
Data Deluge
• The combined space of all computer hard drives in the world was estimated at approximately 160 exabytes in 2006…
. Gantz (March 2007) An IDC White Paper: The Expanding Digital Universe
• The total amount of global data is expected to grow to 2.7 zettabytes during 2012. This is 48% up from 2011.
International Data Corporation
• By 2015, U.S. IP traffic could reach an annual total of one zettabyte from YouTube, IPTV, and high-definition images, …. Internet of 2015 will be at least 50 times larger than it was in 2006.
http://www.circleid.com/posts/813110_internet_traffic_graph_zettabyte/
Data Intensive / Data driven Research
• Researchers need tools, methodologies – To search for/discover data
– To use/analyse data
– To share data
– To store data
– To track data
– To destroy data
– To move data around
– To check authenticity of data
– To visualise data
– To overcome issues of data heterogeneity –…
… and this should be tailored to the researchers needs!!!
Compute Infrastructure for High Energy Physics
There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~25 PB per
~PBytes/sec
Online System
~622 Mbits/sec
Germany Regional Centre
~100 MBytes/sec
Offline Processor Farm ~20 TIPS
~100 MBytes/sec
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier2 CenTtireer2 CentTreier2 CentTreier2 Centre
France Regional Centre
Italy Regional Centre
~622 Mbits/sec
Institute Institute Institute
Caltech ~1 TIPS
~1 TIPS ~1 TIPS
Physics data cache
Physicist workstations
~1 MBytes/sec
Raw data per event ~1 Mb, produced at a rate of about 40 million events per second.
Basic filtering reduces to 100,000 events per second
Advanced filtering to 100 or 200 events per second
This two-stage data re-processing is performed several times a year on all data acquired since the LHC start-up.
1 TIPS is approximately 25,000 SpecInt95 equivalents
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Mapping the Skies
“Chipsets needed to process data access and SKA applications will need to be capable of 20-25 exaflops of processing power”, according to IBM Research’s Ton Engbersen, DOME scientist and project leader. “Take the current global daily Internet traffic, double it, and you are in the range of the data set that the SKA will collect 13 each day.” This would equate to around 40,000Pb every 24 hours.
Meeting the Design Challenges of Nano-CMOS Electronics
e-Science Pilot Project (EPSRC)
A. Asenov (PI)
R. Sinnott (e-Science Director)
£3.7M EPSRC; £4.4M FEC £5.8M incl. industrial contributions
Industrial partners University partners
11 PDRAs (7 + 4) 7 PhD
Semiconductor device variability
Historic simulation paradigm
A 22 nm MOSFET In production 2011
A 4.2 nm MOSFET In production 2023
Challenges of NanoCMOS Design
+ Statistical
Challenges
Hierarchical statistical system simulations
q Very large device and circuit simulations 3D devices
105 circuit components
q Large statistical samples
1000 – 100000 3D simulations – 4D 1000 – 100000 circuit simulations
q Complex flow and storage of data Many files per simulation
Metadata capture and data provenance
q Collaboration between 5 partners Multidisciplinary background Complex data exchange
q Stringent security requirements Commercial IP
Expensive software licenses
q We started with a secure portal and a wiki!!!”
Experiences
But ended up with…
q… a command-line based solution This community are very HPC savvy
Security solutions involved integration of multiple technologies
Secure, distributed file-based data management
Meta-data capture through REST-based solution
Job submission (& data management & seamless security) to massive (at the time) HPC systems….
• ScotGrid,NGS,TeraGrid,ECDF,partnerclusters • MillionsofCPUhoursclockedup!
The –g flag!!!
The e-Health Future…
+ environmental, social, geographic …
Nucleotide sequences
Cell signalling
Nucleotide structures
Gene expressions Protein Structures
Protein functions Protein-protein interaction (pathways)
Physiology
Organisms Populations
Life Sciences
• Extensive Research Community – Parkville Precinct for example
• Many people care about them
– Health, Food, Environment – truly interdisciplinary!
• Interacts with virtually every discipline
– Physics, Chemistry, Maths/Stats, Nano-engineering, …
• Thousands of databases relevant to bioinformatics (and growing!)
– Heterogeneity, Interdependence, Complexity, Change, …
• Some of the Big Questions/Challenges – How does a cell work?
– How does a brain work?
– How does an organism develop?
– Why do people who eat less tend to live longer? –…
More (and more and more) genomes…
pestis thaliana
elegans jejuni
Helicobacter Mycobacterium
Buchnerasp. APS
Chlamydia pneumoniae
Aquifex aeolicus
Vibrio cholerae
Neisseria meningitidis
Salmonella enterica
Archaeoglobus fulgidus
Drosophila melanogaster
burgorferi tuberculosis
Escherichia Thermoplasma coli acidophilum
Rattus norvegicus
falciparum
aeruginosa
urealyticum
Xylella 22 fastidiosa
Mus musculus
Saccharomyces cerevisiae
prowazekii
Bacillus subtilis
Thermotoga maritima
Distributed, completely heterogeneous data
LPSYVDWRSAGAVVDIKSQG ECGGCWAFSAIATVEGINKI TSGSLISLSEQELIDCGRTQQD NTRGCDGGYI TDGFQFIIND GGINTEENYPYTAQDGDCDV
AGGTATAGCGCGCGCGATATATA
AAATGTACGTACGGGCCCTTATA CGCGCGCGATATATAGCGCGCG
Gene expression
Morphology
Translational Research
Just one example!
VO Authorisation
BRIDGES Project
CFG Virtual Curated Data
Ensembl OMIM
Glasgow Edinburgh
Private data
Private data
Netherlands
Private data
SWISS-PROT MGI
Private data
Private data
Information Integrator
Synteny Service
Magna Vista Service
London Private data
MagnaVista
www.nesc.ac.uk
MagnaVista
Importance of Data Visualisation
Data –> Knowledge?
• Once upon a time…
Crowdsourcing Knowledge & Reasoning • Many approaches that work (?)…
CREATE Program
– Commenced in 2017
– Involve(d) four teams*
• TRACE – Trackable Reasoning and Analysis for Collaboration and Evaluation (Syracuse)
• Co-Arg – Cogent Argumentation System with Crowd Elicitation ( University)
• BARD – Bayesian Argumentation via Delphi (Monash)
• SWARM – Smartly-assembled Wiki-style Argument Marshalling (UniMelb)
– www.swarmproject.info
* Interesting paradigm of funding…
SWARM Overview
• Solution overview
SWARM Overview
• Key features
– Anonymous
• Avatars and rating
– No leader
– Team size flexibility
• DockerizedAWS/NeCTAR
• Kubernetes/DockerSWARM
– Public vs Private
• Off platform work supported
– Arbitrary contributions
• No payments/obligations
– Social interactions (chat) • Social warmth
– Ad hoc use encouraged
Proof of the Pudding
• Does SWARM help to reason better?
– ASIO, VicPolice, …
• 81 reports on platform; 167 off platform (normal)
SWARM Clouds
• Developed on openStack (NeCTAR)
– Scripted solution using Docker & Kubernetes
• Deployed to AWS (US)
– $1000+ / month for basic use • (costs ramp up a LOT!)
• Benefits of scripted solution
– Developed/test/trialled on free Cloud (NeCTAR)
– Deployed to AWS when ready
– Not possible to do if would have used AWS Elastic Container Solution for Kubernetes (EKS)
• (…or KS or Google GKE etc etc)
Australian Urban Research Infrastructure Network (AURIN)
• EIF/NCRIS federally funded project
• DIISRTE -> Innovation -> Education
• $40m+ project (+$18.9m) • www.aurin.org.au
• University of Melbourne are lead agent
• Establishing an e-Infrastructure for Urban and Built Environment Researchers
– Distributed, (completely!) heterogeneous datasets
– Data interrogation services
– Security (unit level data, health data, commercial data!) – Online analysis tools
– Collaboration!!!
AURIN Context $2m+
• Urban and build environment is extremely broad
• transport,
• future population,
• liveability,
• housing,
• design, •…
• Much research depends on access to and usage of data
• There is LOTS (and LOTS) of data of relevance to urban research!!!
• Completely heterogeneous, e.g. geospatial, statistical, temporal, survey, …
• Data is more often than not silo’d
• Requires tools to find, interrogate, analyze and visualize data and enforce good research methodologies
• Consolidate tools and best practice/community know-how!
• Allow researchers to share results, interact and collaborate
• No single expert!
• Allow data providers to keep control of their data and its use
• Authentication and authorisation (and auditing/accounting) 42
AURIN Simplified
BoM AEMO NPI
Citizen ? Science
~5922 data sets
>147 organisation, e.g. ABS,
VicRoads, GA, PSMA, … Australian geo-classifications and their changes over time Many data heterogeneity and security issues overcome
AURIN Example (in one slide!)
AURIN Clouds • Original plan to use NeCTAR
• Early reliability issues
• Actual plan
• Servers purchased (VMware)
• Used for production system • 8yearsold/partialrefresh2016 • 3daysoutagein8years!!!
• Failover system on NeCTAR
• Dev, Staging, Production
• 20,000+ users
• ~4million lines of code
• CouchDB,webservices(ReST,…),geoJSON,…
• Much of the code not written by my team!!!
• Move to using Docker containers/Kubernetes ongoing
• e.g. L. Chen, Y. Pan, R.O. Sinnott, Auto-Scaling a Cloud-based Walkability Tool through Kubernetes and Docker Swarm, CLOSER 2020, Prague, Czech Republic, May 2020.
Demonstration in Workshops (note – Assignment II)
AURIN Homework
(https://portal.aurin.org.au)
Find the suburb (SA2) in Greater Melbourne that had the highest number of Jobseeker recipients in June 2020.
(not assessed!)
Questions …?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com