Setup Menus in Admin Panel

Apache Spark with Python (PySpark)

$400.00 $300.00


Pyspark ( Apache Spark with Python ) – Importance of Python

Python is a general purpose, dynamic programming language. Plenty of handy and high-performance packages for numerical and statistical calculations make Python popular among data scientists and data engineer. As per, Python is the world’s fastest growing programming language. Python has now claimed fourth place in the TIOBE index for the first time. However, Python and any language alone cannot handle big data processing efficiently. There is always a need for a distributed framework like Apache Spark.

Why PySpark?

PySpark was introduced by Apache Software Foundation for speeding up the Hadoop software computing process. The main feature of Spark is its in-memory cluster computing that highly increases the speed of an application processing.

Apache Spark also has the following features:

Speed− Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory and 10 times faster when running on disk by reducing number of read/write operations to disk and by storing the intermediate processing data in memory.

Supports multiple languages− Spark comes up with 80 high-level operators for interactive querying and provides application development with built-in APIs in different languages in Java, Scala, or Python.

Advanced Analytics− Spark not only supports ‘Map’ and ‘reduce’ programming but it also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Apache Spark is written in Scala programming language. To support Python with Apache Spark, Apache Spark Community released a tool, PySpark. Using PySpark, one can work with RDDs in Python programming language also. It is because of a library called Py4j that one can use Python with Apache Spark.

Apache Spark with Python – Introduction of Python

Variables in Python

String Handling

Operators and Operands

Conditional Statements

Typecast from one datatype to another



Sequence or Collections






Lambda Functions

Advance concepts in Python

OOPS concept

Exception Handling

Numpy Arrays

Indexing and Slicing in Numpy Arrays

Numpy Array Functions

Pandas DataFrame

Basic Functionality of Pandas like Reindexing , altering labels, head/tail and sorting

Introduction to Spark

Introduction to Big Data

Big Data Problem

Scale-Up Vs Scale-Out Architecture

Characteristics of Scale-Out

Introduction to Hadoop, Map-Reduce and HDFS

Introducing Spark

Hortonworks Data Platform (HDP) using Virtual box

Importing HDP VM image using Virtual box on local machine

Configuring HDP

Overview of Ambari and its components

Overview of services configuration using Ambari

Overview of Apache Zeppelin

Creating, importing and executing notebooks in Apache Zeppelin

Spark Basics

Spark Shell

PySpark Shell

Overview of Spark architecture

Storage layers for Spark

Initialize a Spark Context and building applications

Submitting a Spark Application

Spark Components

Spark Driver Process

Spark Executor

Spark Conf and Spark Context

SparkSession object

Overview of spark-submit command

Spark UI


Overview of RDD

RDD and Partitions

Ways of Creating RDD

RDD transformations and Actions

Lazy evaluation

RDD Lineage Graph (DAG)

Element wise transformations

Map Vs FlatMap Transformation

Set Transformation

RDD Actions

Overview of RDD persistence

Methods for persisting RDD

Persisting RDD with Storage option

Illustration of Caching on an RDD in DAG

Removal of Cached RDD

Pair RDDs

Overview of Key-Value Pair RDD

Ways of creating Pair RDDs

Transformations on Pair RDD

ReduceByKey(), FoldByKey(),MapValues(), FlatMapValues(),keys() and Values() Transformation

Grouping, Joining, Sorting on Pair RDD

ReduceByKey() Vs GroupByKey()

Pair RDD Action

Launching Spark on cluster

Configure and launch Spark Cluster on Google Cloud

Configure and launch Spark Cluster on Microsoft Azure

Spark Application Architecture

Spark Application Distributed Architecture

Spark Application submission Mode

Overview of Cluster Manager

Example of using Standalone Cluster Manager

Driver and its responsibilities

Overview of Job, Stage and Tasks

Spark Job Hierarchy


Spark-submit command and various submission options

Yarn Cluster Manager

Yarn Architecture

Client and Cluster Deploy-mode

Advance concepts in Spark



RDD partitioning

Re-partition RDD

Determining RDD partitioner

Partition based RDD like mapPartitions, mapPartitionsWithIndex,mapPartitionsToPair

PySpark SQL

Introduction to PySparkSQL


Catalyst Optimizer

Speeding up PySpark with DataFrames

Interoperating with RDDs.

Inferring the Schema using Reflection

Programmatically Specifying the Schema

Ways of Creating DataFrame

Registering a DataFrame as View

DataFrame Transformations API

Using SQL to interact with DataFrame

Aggregate Operations

DataFrame Action

Handling missing observations

Loading CSV, JSON, Parquet format file in SparkSQL

Loading and saving data from/in Hive, JDBC, HDFS

Partition Discovery

Introduction to User-Defined-Function (UDF)

Registering a UDF

Usage of UDF in DataFrame Transformations API

Usage of UDF in Spark SQL statement

Introduction to Window Function

Steps of defining a window function

Illustration of Window function usage

Basic Spark Streaming (DStream)

Introduction to data streaming

Spark Streaming framework

Spark Streaming and Micro batch

Introduction of DStreams

DStreams and RDD

Word Count example using Socket Text Stream

Input DStreams and Receiver

streaming with Twitter feeds

Setting up a Twitter App

Example of extracting hashtags from tweet data

Troubleshooting Twitter Streaming issue in Spark Application

Steps of creating Spark Streaming Application

Architecture of Spark Streaming

Stateless Transformations

Twitter Streaming examples using stateless transformation

Introduction to stateful Transformations

Window Transformations

Window Duration and Slide Duration

Window Operations

Naive and inverse window reduce operation


Tracking State of an event using updateStateByKey and mapWithState operation

Interact directly with RDD using transform () operation

Example of HDFS file streaming

Saving DStreams to external file system

Spark Structured Streaming

Introduction to Structured Streaming

Structured Streaming-Programming Model

Data Stream as an unbounded table

Operations on Streaming Dataset

Introduction to Event Time

Window Operations on Event Time

Late Data and Watermarking

Various Input Source

Output Modes

Output Sinks

Streaming Deduplication


Prerequisites :

As this course is based on Python, having prior programming experience in Python or understanding of Python programming structure will be an added advantage.

Duration & Timings :

Duration – 30 Hours.

Training Type: Instructor Led Live Interactive Sessions.

Faculty: Experienced.

For Upcoming Schedules Please  Contact Us 

 Inquiry Now         Discount Offer 

USA: +1 734 418 2465 | India: +91 40 4018 1306


There are no reviews yet.

Be the first to review “Apache Spark with Python (PySpark)”


© 2019 LEARNTEK. ALL RIGHTS RESERVED | Privacy Policy | Terms & Conditions

USA: +1 734 418 2465 | Discount Offer
Season's Best Discount Offer End's in
Discount Offer