Description

Apache Spark with Java 8 Training : Why Spark?

Apache Spark with Java 8 Training :Spark was introduced by Apache Software Foundation for speeding up the Hadoop software computing process.

The main feature of Spark is its in-memory cluster computing that highly increases the speed of an application processing.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming applications by reducing the management burden of maintaining separate tools.

Apache Spark also have the following features.

Speed− Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory and 10 times faster when running on disk by reducing number of read/write operations to disk and by storing the intermediate processing data in memory.

Supports multiple languages− Spark comes up with 80 high-level operators for interactive querying and provides application development with built-in APIs in different languages in Java, Scala, or Python.

Advanced Analytics− Spark not only supports ‘Map’ and ‘reduce’ programming but it also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Apache Spark with Java 8 Training : Why Java8

With the introduction of lambda expression in Java8, it has provided support of functional programming in a beautiful way. In addition to lambda expression, it has also introduced Streaming API, which can be thought of as a collection framework for functional programming in Java without storing the elements. With of introduction of lambda expression in Java8, code can be written in more concise and elegant way. Learning curve has also become quite smooth as one has to learn just Apache Spark API, not Scala.

Apache Spark with Java – Overview of Java8

Overview of Interface, Static method and Default method in interface

Anonymous Inner Classes

Introduction to Lambda Expressions

Functional Interface, type inference

Method references

Composing Lambda

Understanding Closure

Overview of Streams

Working with Streams

Infinite Streams

Apache Spark with java – Introduction to Spark

Introduction to Big Data

Big Data Problem

Scale-Up Vs Scale-Out Architecture

Characteristics of Scale-Out

Introduction to Hadoop, Map-Reduce and HDFS

Introducing Spark

Hortonworks Data Platform (HDP) using Virtual box

Importing HDP VM image using Virtual box on local machine

Configuring HDP

Overview of Ambari and its components

Overview of services configuration using Ambari

Overview of Apache Zeppelin

Creating, importing and executing notebooks in Apache Zeppelin

IDEs for Spark Applications

Intellij

Eclipse

Resolving dependencies for Spark applications

Spark Basics

Spark Shell

Overview of Spark architecture

Storage layers for Spark

Initialize a Spark Context and building applications

Submitting a Spark Application

Use of Spark History Server

Spark Components

Spark Driver Process

Spark Executor

Spark Conf and Spark Context

SparkSession object

Overview of spark-submit command

Spark UI

RDDs

Overview of RDD

RDD and Partitions

Ways of Creating RDD

RDD transformations and Actions

Lazy evaluation

RDD Lineage Graph (DAG)

Element wise transformations

Map Vs FlatMap Transformation

Set Transformation

RDD Actions

Overview of RDD persistence

Methods for persisting RDD

Persisting RDD with Storage option

Illustration of Caching on an RDD in DAG

Removal of Cached RDD

Pair RDDs

Overview of Key-Value Pair RDD

Ways of creating Pair RDDs

Transformations on Pair RDD

ReduceByKey(), FoldByKey(),MapValues(), FlatMapValues(),keys() and Values() Transformation

Grouping, Joining, Sorting on Pair RDD

ReduceByKey() Vs GroupByKey()

Pair RDD Action

Launching Spark on cluster

Configure and launch Spark Cluster on Google Cloud

Configure and launch Spark Cluster on Microsoft Azure

Logging and Debugging a Spark Application

Setting up a window environment for executing Spark Application using IDE

Steps of using slf4j logging mechanism in Spark Application

Attaching a debugger to Spark Application

Example of debugging a Spark application running inside a cluster

Spark Application Architecture

Spark Application Distributed Architecture

Spark Application submission Mode

Overview of Cluster Manager

Example of using Standalone Cluster Manager

Driver and its responsibilities

Overview of Job, Stage and Tasks

Spark Job Hierarchy

Executor

Spark-submit command and various submission options

Yarn Cluster Manager

Yarn Architecture

Client and Cluster Deploy-mode

Advance concepts in Spark

Accumulator

Broadcast

RDD partitioning

Re-partition RDD

Determining RDD partitioner

Partition based RDD like mapPartitions,

mapPartitionsWithIndex,

mapPartitionsToPair

Spark SQL

Introduction to SparkSQL

Creating SparkSession with Hive Support

DataFrame

Ways of Creating DataFrame

Registering a DataFrame as View

DataFrame Transformations API

DataFrame SQL statement

Aggregate Operations

DataFrame Action

Catalyst Optimizer

Limitation of DataFrame

Introduction to Dataset

Introduction to Encoder

Creating Dataset

Functional transformation on Dataset

Loading CSV, JSON, Parquet format file in SparkSQL

Loading and saving data from/in Hive, JDBC, HDFS, Cassandra

Introduction to User-Defined-Function (UDF)

Customizing a UDF

Usage of UDF in DataFrame Transformations API

Usage of UDF in Spark SQL statement

Introduction to Window Function

Steps of defining a window function

Illustration of Window function usage

Introduction to UDAF

Customizing a UDAF

Illustration of customized UDAF usage

Basic Spark Streaming

Introduction to data streaming

Spark Streaming framework

Spark Streaming and Micro batch

Introduction of DStreams

DStreams and RDD

Word Count example using Socket Text Stream

streaming with Twitter feeds

Setting up a Twitter App

Resolving Twitter dependency in Spark Streaming Application

Steps of creating Uber Jar

Example of extracting hashtags from tweet data

Troubleshooting Twitter Streaming issue in Spark Application

Steps of creating Spark Streaming Application

Architecture of Spark Streaming

Stateless Transformations

Twitter Streaming examples using stateless transformation

Introduction to stateful Transformations

Window Transformations

Window Duration and Slide Duration

Window Operations

Naive and inverse window reduce operation

Checkpoint

Tracking State of an event using updateStateByKey operation

Interact directly with RDD using transform () operation

Example of HDFS file streaming

Example of Spark-Kafka interaction

Saving DStreams to external file system

Spark Structured Streaming

Introduction to Structured Streaming

Structured Streaming-Programming Model

Introduction to Event Time
Various Input Source

Window Operations on Event Time

Late Data and Watermarking

Output Modes

Output Sinks

Streaming Deduplication

Prerequisites of Apache Spark with Java 8:

Understanding of OOPS concept and programming construct in Java will be required. Having programming experience in Java7 will be mandatory. Having understanding or experience of Lambda expressions in Java8 will be an added advantage.