Apache Spark
Syllabus
Apache Spark Training Overview
Spark is a unique framework for big data analytics which gives one unique integrated API by developers for the purpose of data scientists and analysts to perform separate tasks. It supports a wide range of popular languages like Python, R, SQL, Java and Scala. Apache Spark main aim is to provide hands-on experience to create real-time Data Stream Analysis and large-scale learning solutions for data scientists, data analysts and software developers.
Apache Spark Training Objectives
- Apache Spark Architecture How to use Spark with Scala How to deploy Spark projects to the cloud Machine Learning with Spark
Pre-requisites of the Course
- Basic knowledge of object-oriented programming is enough Knowledge of Scala will be an added advantage
- Learners who have basic knowledge on Database, SQL Query will be an added advantage for Learning this Course
Who should do the course
- Developers, Architects, IT Professionals
- Software Engineers, Data scientists, and Analysts
Apache Spark Course Content
Batch and Real-Time Analytics with Apache Spark
SCALA (Object Oriented and Functional Programming)
- Getting started With Scala
- Scala Background, Scala Vs Java and Basics
- Interactive Scala – REPL, data types, variables, expressions, simple functions
- Running the program with Scala Compiler
- Explore the type lattice and use type inference
- Define Methods and Pattern Matching
Scala Environment Set up
- Scala set up on Windows and UNIX
Functional Programming
- What is Functional Programming?
- Differences between OOPS and FPP
Collections ( Very Important for Spark )
- Iterating, mapping, filtering, and counting
- Regular expressions and matching with them
- Maps, Sets, group By, Options, flatten, flat Map
- Word count, IO operations, file access, flatMap
Object-Oriented Programming
- Classes and Properties
- Objects, Packaging, and Imports
- Traits
- Objects, classes, inheritance, Lists with multiple related types, apply
Integrations
- What is SBT?
- Integration of Scala in Eclipse IDE
- Integration of SBT with Eclipse
SPARK CORE
- Batch versus real-time data processing
- Introduction to Spark, Spark versus Hadoop
- The architecture of Spark
- Coding Spark jobs in Scala
- Exploring the Spark shell to Creating Spark Context
- RDD Programming
- Operations on RDD
- Transformations
- Actions
- Loading Data and Saving Data
- Key Value Pair RDD
- Broadcast variables
Persistence
- Configuring and running the Spark cluster
- Exploring to Multi-Node Spark Cluster
- Cluster management
- Submitting Spark jobs and running in the cluster mode
- Developing Spark applications in Eclipse
- Tuning and Debugging Spark
CASSANDRA ( N0SQL DATABASE )
- Learning Cassandra
- Getting started with architecture
- Installing Cassandra
- Communicating with Cassandra
- Creating a database
- Create a table
- Inserting Data
- Modelling Data
- Creating an Application with Web
- Updating and Deleting Data
Spark Integration with NoSQL (CASSANDRA) and Amazon EC2
- Introduction to Spark and Cassandra Connectors
- Spark With Cassandra to Set up
- Creating Spark Context to connect the Cassandra
- Creating Spark RDD on the Cassandra Database
- Performing Transformation and Actions on the Cassandra RDD
- Running Spark Application in Eclipse to access the data in the Cassandra
- Introduction to Amazon Web Services
- Building 4 Node Spark Multi-Node Cluster in Amazon Web Services
- Deploying in Production with Mesos and YARN
Spark Streaming
- Introduction of Spark Streaming
- Architecture of Spark Streaming
- Processing Distributed Log Files in Real Time
- Discretized streams RDD
- Applying Transformations and Actions on Streaming Data
- Integration with Flume and Kafka
- Integration with Cassandra
- Monitoring streaming jobs
Spark SQL
- Introduction to Apache Spark SQL
- The SQL context
- Importing and saving data
- Processing the Text files, JSON and Parquet Files
- DataFrames
- user-defined functions
- Using Hive
- Local Hive Metastore server
Spark MLLib
- Introduction to Machine Learning
Types of Machine Learning - Introduction to Apache Spark MLLib Algorithms
- Machine Learning Data Types and working with MLLib
- Regression and Classification Algorithms
- Decision Trees in depth
- Classification with SVM, Naive Bayes
- Clustering with K-Means
- Building the Spark server