What is Hadoop?
Hadoop is a free, Java -based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes of storage capacity. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
- Large Volumes of Data: Ability to store and process huge amounts of variety (structure, unstructured and semi structured) of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
- Computing Power: Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault Tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility: Unlike traditional relational database, you don’t have to process data before storing it, You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos etc.
- Low Cost: The open-source framework is free and used commodity hardware to store large quantities of data.
- Scalability: You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
The following topics will be covered in our Big Data and Hadoop Online Training:
Big Data and Hadoop Training Topics
Introduction to Data and System
Types of Data
Traditional way of dealing large data and its problems
Types of Systems & Scaling
What is Big Data
Challenges in Big Data
Challenges in Traditional Application
What is Hadoop? Why Hadoop?
Brief history of Hadoop
Features of Hadoop
Hadoop and RDBMS
Hadoop Ecosystem’s overview
Installation in detail
Creating Ubuntu image in VMwareDownloading Hadoop
Configuring Hadoop, HDFS & MapReduce
Download, Installation & Configuration Hive
Download, Installation & Configuration Pig
Download, Installation & Configuration Sqoop
Download, Installation & Configuration Hive
Configuring Hadoop in Different Modes
Hadoop Distribute File System (HDFS)
File System – Concepts
Purpose of Name Node
Purpose of Data Node
Purpose of Secondary Name Node
Purpose of Job Tracker
Purpose of Task Tracker
HDFS Shell Commands – copy, delete, create directories etc.
Reading and Writing in HDFS
Difference of Unix Commands and HDFS commands
Hadoop Admin Commands
Hands on exercise with Unix and HDFS commands
Read / Write in HDFS – Internal Process between Client, NameNode & DataNodes.
Accessing HDFS using Java API
Various Ways of Accessing HDFS
Understanding HDFS Java classes and methods
Admin: 1. Commissioning / DeCommissioning DataNode
- Replication Policy
- Network Distance / Topology Script
Map Reduce Programming
Understanding block and input splits
MapReduce Data types
Data Flow in MapReduce Application
Understanding MapReduce problem on datasets
MapReduce and Functional Programming
Writing MapReduce Application
Understanding Mapper function
Understanding Reducer Function
Usage of Combiner
Usage of Distributed Cache
Passing the parameters to mapper and reducer
Analysing the Results
Input Formats and Output Formats
Counters, Skipping Bad and unwanted Records
Writing Join’s in MapReduce with 2 Input files. Join Types.
Execute MapReduce Job – Insights.
Exercise’s on MapReduce.
Job Scheduling: Type of Schedulers.
Schema on Read VS Schema on Write
Install and configure hive on cluster
Meta Store – Purpose & Type of Configurations
Different type of tables in Hive
Joins in hive
Hive Query Language
Hive Data Types
Data Loading into Hive Tables
Hive Query Execution
Hive library functions
Install and configure PIG on a cluster
PIG Library functions
Pig Vs Hive
Write sample Pig Latin scripts
Modes of running PIG
Running in Grunt shell
Running as Java program
Region server architecture
File storage architecture
HBase use cases
Install and configure HBase on a multi node cluster
Create database, Develop and run sample applications
Access data stored in HBase using Java API
Install and configure Sqoop on cluster
Connecting to RDBMS
Import data from Mysql to hive
Export data to Mysql
Internal mechanism of import/export
Introduction to OOZIE
XML file specifications
Specifying Work flow
Oozie job coordinator
Introduction to Flume
Configuration and Setup
Flume Sink with example
Flume Source with example
Complex flume architecture
Introduction to ZooKeeper
Challenges in distributed Applications
ZooKeeper : Design Goals
Data Model and Hierarchical namespace
Hadoop 1.0 Limitations
History of Hadoop 2.0
HDFS 2: Architecture
HDFS 2: Quorum based storage
HDFS 2: High availability
HDFS 2: Federation
Classic vs YARN
YARN Capacity Scheduler
Knowledge in any programming language, Database knowledge and Linux Operating system. Core Java or Python knowledge helpful.
Duration & Timings :
Duration – 30 Hours.
Course Fee : $300 Discount Offer
Training Type: Online Live Interactive Session.
Weekday Session – Mon – Thu 8:30 PM to 10:30 PM (EST) – 4 Weeks. March 6, 2017.
Weekend Session – Sat & Sun 6:00 PM – 9:00 PM (EST) – 5 Weeks. February 25, 2017.
Weekend Session – Sat & Sun 9:30 AM to 12:30 PM (EST) – 5 Weeks. March 4, 2017.
Weekend Session – Sat & Sun 10:00 AM to 1:00 PM (IST – India Time) – 5 Weeks. March 4, 2017.