Course Title: BigData Engineer
Course Duration: 40-45 Hours
Certification: Hadoop Developer
Course Title: BigData Engineer
Course Duration: 40-45 Hours
Certification: Hadoop Developer
Course Outline:
Module1:
Introduction to Big Data: Defining Data, Types of Data, Structured Data, Semi Structured Data, Unstructured Data, How data being Generated, Sources of Data Generation, Rate at which Data is being generated, Different V’s, Volume, Variety, Velocity, Veracity, Value, How single person is contributing towards Big Data, Significance for Big Data, Reason for Big Data, Understanding RDBMS and why it is failing to store Big Data. Future of Big Data, Big Data use cases for E-Commerce Industry, Banking Sector.
Module2:
Hadoop, Apache Community, Cluster, Node, Commodity Hardware, Rack Awareness, History of Hadoop, Need for Hadoop, Apache Hadoop Ecosystem, Hadoop 1.x Architecture, Apache Hadoop Framework, Master- Slave Architecture, Hadoop Distributed File System, Design of HDFS, HDFS Concept, How files are stored in HDFS, Hadoop File system, Replication factor, Name Node, Secondary Name Node, Job Tracker, Task tracker, Data Node, FS Image, Edit-logs, Check-pointing Concept, HDFS federation, HDFS High availability, Architectural description for Hadoop Cluster, When to use or not to use HDFS, Block Allocation in Hadoop Cluster, Read operation in HDFS, Write operation in HDFS, Hadoop Archives, Data Integrity in HDFS, Compression & Input Splits, Advantages of Hadoop, Unix Shell commands and HDFS commands.
Module3:
MapReduce, History, Internal architecture, Input / Output Format types, Text Input Format, Key Value Input Format, Sequence File Input Format, Input split, Record Reader, Mapper Phase, Reducer Phase, Sort and Shuffle Phase, Data Flow, Counters, Combiner Function, Partition Function, Joins, Map Side Join, Reduce Side Join, MapReduce Web UI, Job Scheduling, Task Scheduling, Fault Tolerance, Writing MapReduce Application, Driver Class, Mapper Class, Reducer Class, Serialization, File Based Data Structure, Writing a simple MapReduce program to Count Number of words, MapReduce Work Flows, Importance of MapReduce.
Module4:
YARN: YARN Architecture, YARN Components, Resource Manager, Node Manager, Application Master, Difference between Hadoop 1.x and 2.x Architecture, Cluster Specification, Different modes of Hadoop, Standalone Mode, Pseudo Distributed Mode, Fully Distributed Modes.
Module5:
Apache Pig, Pig on Hadoop, Pig Latin, Pig Philosophy, Pig’s History, Local Mode and MapReduce Mode, Pig’s Data Model, Scalar, Complex, Load, Dump, Store, Foreach, Filter, Join, group, Order by, Distinct, Limit, Sample, Parallel, User Defined Function, Advanced Relational Operations, Using different Join Implementations, Co-group, Union, Cross, Nonlinear Data flows, Controlling Executions, Parameter Substitutions, Program for Word Count Job, Comparison Apache Pig and MapReduce.
Module 6:
Apache Hive, Features of Apache Hive, Command Line Interface, History of Apache Hive, Hive Data Types & Files Formats, Creating Managed Table, External Table, Partitioned Table, Dropping Tables, Alter Table, Loading data into Managed Table, Inserting Data into Tables from Queries, Dynamic Partitions inserts, Exporting data, SELECT from clauses, WHERE Clauses, GROUP BY Clauses, JOIN Statements, ORDER BY, SORT BY, DISTRIBUTE BY, CLUSTER BY, bucketing, UNION ALL, View, Hive Metastore.
Module 7:
Apache Sqoop, Apache Sqoop Architecture, Apache Sqoop Features, Apache Sqoop Import, Apache Sqoop Export, Sqoop Job, Sqoop List Tables, Sqoop List Databases, Sqoop Codegen, Sqoop Hive Import, Sqoop Validation.
Module 8:
Apache Flume, Flume Architecture, Flume Data Flow, Flume Configuration, Flume Fetching Twitter Data, Flume Fan In and Fan Out Architecture, Apache Flume Features.
Module 9:
Apache Impala, Impala Features, Impala Architecture, Impala Shell, Impala Query Language, Impala create Database, Impala create Table, Select Statement, Alter Table, Order By, Group By, Limit, Distinct Operator.