• To understand – What is big data
– New technologies
• Why we need them
• How they work: fundamentals and practice
Copyright By PowCoder代写 加微信 powcoder
– The fabric of these technologies in terms of
• System design for big data storage and data management • Programming against big data
• Querying big data
Big data: Systems, programming, management
• Ambitious course
• Aiming to familiarize students with the fundamentals of big
data and its 3 constituent pillars: – Data management issues:
• why models/abstractions needed new approaches to consistency, indexing, query processing optimization, etc.
– Systems issues:
• Distributed file systems vs NoSQL systems vs SQL systems
• Ensuringscalability,loadbalancing,highresourceutilization,(near-) real-time processing
– Programming:
• New paradigms: MapReduce/Spark and extensions
– Learntheprogrammingabstractions/toolsavailableforaccessingand manipulating massive amounts of data, procedurally
• Modernapplicationdomains – Declarativequerylanguages – Machine learning
– Graph/stream processing
Big data: Systems, programming, management
• Unique, in the sense that it – Encompasses:
• (Distributed) Systems issues (fault tolerance, high performance, resorce tiliation caching load balancing
– Operating Systems, File Systems
• Data Management
– Data modelling consistenc qer processing indeing – Data Bases
• Programming:
– Functional Map-Reduce
– Parallel programming
– Query languages / Environments
Big data: Systems, programming, management
Programming for big data
• Learnhowtoprogrammaticallyaccessbigdata
MapReduce + Spark
– How to write programs that can access in a massively parallel fashion huge data collections stored in (geo-) distributed clusters of (up to thousands of) nodes
– Without worrying about • Failures,
• Synchronization,
• Combining data from disk readsrites data localities
Big data: Systems, programming, management
Big data storage systems • Learnbigdatasystems
– Scalable distributed data storage / file systems • Apache HDFS ~= Googles File Sstem GFS)
• Yahoo! is responsible for most of HDFS
– Modern scalable big data management systems, aka NoSQL systems
• Googles BigTable ~= Apache HBase,
• Amaons Dynamo != Apache Cassandra • [Riak, Voldermort, …]
Big data: Systems, programming, management
(Possibly) Big data querying systems
• Graph data management/processing systems – Pregel/Giraph
• Stream processing systems
– – Storm
• Machine learning over Big Data – SparkML
– TensorFlow
• Declarative query languages for Big Data
• Going beyond a single datacentre
– Geo-distributed storage/computation
Big data: Systems, programming, management
Goals: High Level
At the end of the course, you should: • Understand
– Why new paradigms are needed for • data management,
• systems,and
• programming
– What are the fundamental design decisions and trade-offs in each of these pillars
• Know how to:
– Write programs that access and manipulate in a massively parallel fashion huge volumes of data, which are stored in various modern DBs, or FSs
• First and foremost, though, be exposed to recent research results in the field of Big Data systems
Big data: Systems, programming, management
• Recommended textbook:
– M. Kleppmann. Designing data-intensive applications: The big ideas
behind reliable scalable and maintainable sstems OReill • Google arond -)
• Moodle page
– Updated regularly as we go
– Slide decks recordings links etra reading
– Student forum (please feel free to ask any and all questions regarding
the course material)
Introduction to Data Science & Systems
Schedule of lectures
Topics covered tentative (need to iron out timing difficulties)
Intro to the course and Big Data
Distributed File Systems; case studies: Google File System, HDFS
Distributed Data Processing; case study: Hadoop MapReduce, YARN
Distributed Data Processing; case study: RDDs and Spark
Distributed NoSQL Systems; case study: Google BigTable, Apache HBase, Amazon Dynamo, Apache Cassandra
TBD + Revision
Big data: Systems, programming, management
Lectures and Tutorials
• 3-4 hours per week
– 2-hour lectures: 1100-1300 on Wed • Online
• Plus possibly extra hour (1500-1600) in weeks 1 and 2
– 2-hour labs/tutorials: 1500-1700 on Wed, starting in week 3
• Online + 1028 (alternate weeks for LB01 and LB02) – Keep an eye out on moodle and your email inbox!
• Pre-session reading vital for the course
– To be uploaded on Moodle every Thursday/Friday, for the lectures of the following week
– Study the material and come prepared to discuss it
• Sessions a mix of typical lecturing and discussion/peer instruction-based activities
– Require an internet-enabled device
– Comprehension quizzes as well as conversation starters
• Beware of self-study!
– Not just about spending time on your own reading up
– You are in charge of what, how and when/for how long you study
Big data: Systems, programming, management
Assessment
• Final exam
– 75% of the grade
• Assessed exercise (groups of up to 3)
• Given a real-world problem and data
• Asked to design, implement and test a solution
• In-class quizzes
– 5% over all lectures
• Comprehension questions at the beginning of/during every lecture – 1-minute feedback after every lecture
• Anonymous and optional, but highly useful for us so please do leave your comments (positive or negative)
Big data: Systems, programming, management
Miscellanea
• Qestions
– OT be answered at the break/in person after the lecture
– Either ask your questions during the lecture or use the Student Forum under the
respectie topic on the corses Moodle page
• Feel free to both ask and answer questions posted on the forum
• You are assumed to know how to interact with a command line interface
• You are assumed to know how to write code in Java or Scala
• Pre-session reading in the form of academic/research papers
– Examinable
– Learn to read
• Read the abstract, intro and conclusions
• Glance over related work and experiments
• Pay close attention to the design/implementation sections (i.e., the rest)
• Extra reading available for those that want to go the extra mile – Not examinable
• Working in groups (up to 3 students)
– Let us know your group composition ASAP see Richards frther instrctions
– Note: after the AE hand-out, you on hae access to the course material unless you
are in a group
Big data: Systems, programming, management
• Lecture notes are not sufficient; you must study the pre-session reading material and keep your own notes
• You must schedule time each week to make sre o dont lose track
• You must pay attention to the assessment handouts and instructions
• Proper time management ill be or friend throghot the semester
• Budget your time
– 10-credit course = 100 hours of work
• ~20 hours to revise for the exam + 1.5 hours to sit the exam
– Assuming attendance/participation in the course throughout the semester!
• 20% assessed exercise = ~20 hours • Weekly:
– 2 hours attending lectures
– 1 hour attending lab/tutorial
– ~3 hours of self-study (plus extra time for AEs)
– ~6 hours per week
– ~9-10 hours per week when AEs are out
Big data: Systems, programming, management
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com