Apache Spark and Scala Training

4.0 (5896) 6701 Learners

24 hours of instructor-led live online training

Master the concepts on Apache Spark framework & development

In-depth exercises and real-time projects on Apache Spark

Learn about Apache Spark Core, Spark Internals, RDD, Spark SQL, etc

Get comprehensive knowledge on Scala Programming language

Duration
25+ Hours
Institution
Open Source
Language
English
Video Script
English

Overview

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

You will understand the basics of Big Data and Hadoop. You will learn how Spark enables in-memory data processing and runs much faster than Hadoop MapReduce. You will also learn about RDDs, Spark SQL for structured processing, different APIs offered by Spark such as Spark Streaming, Spark MLlib. This course is an integral part of a Big Data Developer’s Career path. It will also encompass the fundamental concepts such as data capturing using Flume, data loading using Sqoop, messaging system like Kafka, etc. 

What you will learn

  • Big Data Introduction
  • Introduction on Scala
  • Spark Introduction
  • Spark Framework & Methodologies
  • Spark Data Structure
  • Spark Ecosystem

Syllabus

International industry expertise at your disposal as you deep-dive into the research topic and sector of your choice.

Course Content

Introduction to Big Data Hadoop and Spark (17 Lectures)

What is Big Data?

17:38 Play

Big Data Customer Scenarios

17:38 Play

How Hadoop Solves the Big Data Problem?

17:38 Play

What is Hadoop?

17:38 Play

Hadoop’s Key Characteristics

17:38 Play

Hadoop Ecosystem and HDFS

17:38 Play

Hadoop Core Components

17:38 Play

Rack Awareness and Block Replication

17:38 Play

YARN and its Advantage

17:38 Play

Hadoop Cluster and its Architecture

17:38 Play

Hadoop: Different Cluster Modes

17:38 Play

Hadoop Terminal Commands

17:38 Play

Big Data Analytics with Batch & Real-time Processing

17:38 Play

Why Spark is needed?

17:38 Play

What is Spark?

17:38 Play

How Spark differs from other frameworks?

17:38 Play

Spark at Yahoo!

17:38 Play

Introduction to Scala for Apache Spark (9 Lectures)

Why Scala for Spark?

18:14 Play

Scala in other Frameworks

18:14 Play

Introduction to Scala REPL

18:14 Play

Basic Scala Operations

18:14 Play

Variable Types in Scala

18:14 Play

Control Structures in Scala

18:14 Play

Foreach loop, Functions and Procedures

18:14 Play

Collections in Scala- Array

18:14 Play

ArrayBuffer, Map, Tuples, Lists, and more

18:14 Play

Functional Programming and OOPs Concepts in Scala (12 Lectures)

Functional Programming

18:16 Play

Higher Order Functions

18:16 Play

Anonymous Functions

18:16 Play

Class in Scala

18:16 Play

Getters and Setters

18:16 Play

Custom Getters and Setters

18:16 Play

Properties with only Getters

18:16 Play

Auxiliary Constructor and Primary Constructor

18:17 Play

Singletons

18:17 Play

Extending a Class

18:17 Play

Overriding Methods

18:17 Play

Traits as Interfaces and Layered Traits

18:17 Play

Deep Dive into Apache Spark Framework (8 Lectures)

Spark’s Place in Hadoop Ecosystem

18:18 Play

Spark Components & its Architecture

18:18 Play

Spark Deployment Modes

18:18 Play

Introduction to Spark Shell

18:18 Play

Writing your first Spark Job Using SBT

18:18 Play

Submitting Spark Job

18:18 Play

Spark Web UI

18:18 Play

Data Ingestion using Sqoop

18:18 Play

Playing with Spark RDDs (11 Lectures)

Challenges in Existing Computing Methods

18:20 Play

Probable Solution & How RDD Solves the Problem

18:20 Play

What is RDD, It’s Operations, Transformations & Actions

18:20 Play

Data Loading and Saving Through RDDs

18:20 Play

Key-Value Pair RDDs

18:21 Play

Other Pair RDDs, Two Pair RDDs

18:21 Play

RDD Lineage

18:21 Play

RDD Persistence

18:21 Play

Word Count Program Using RDD Concepts

18:21 Play

RDD Partitioning & How It Helps Achieve Parallelization

18:21 Play

Passing Functions to Spark

18:21 Play

DataFrames and Spark SQL (10 Lectures)

Need for Spark SQL

18:28 Play

What is Spark SQL?

18:28 Play

Spark SQL Architecture

18:28 Play

SQL Context in Spark SQL

18:28 Play

User Defined Functions

18:28 Play

Data Frames & Datasets

18:28 Play

Interoperating with RDDs

18:28 Play

JSON and Parquet File Formats

18:28 Play

Loading Data through Different Sources

18:28 Play

Spark – Hive Integration

18:28 Play

Machine Learning using Spark MLlib (7 Lectures)

Why Machine Learning?

19:07 Play

What is Machine Learning?

19:07 Play

Where Machine Learning is Used?

19:07 Play

Different Types of Machine Learning Techniques

19:08 Play

Introduction to MLlib

19:08 Play

Features of MLlib and MLlib Tools

19:08 Play

Various ML algorithms supported by MLlib

19:08 Play

Deep Dive into Spark MLlib (3 Lectures)

Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest

19:09 Play

Unsupervised Learning - K-Means Clustering & How It Works with MLlib

19:09 Play

Analysis on US Election Data using MLlib (K-Means)

19:09 Play

Understanding Apache Kafka and Apache Flume (16 Lectures)

Need for Kafka

19:11 Play

What is Kafka?

19:11 Play

Core Concepts of Kafka

19:11 Play

Kafka Architecture

19:11 Play

Where is Kafka Used?

19:11 Play

Understanding the Components of Kafka Cluster

19:11 Play

Configuring Kafka Cluster

19:11 Play

Kafka Producer and Consumer Java API

19:11 Play

Need of Apache Flume

19:11 Play

What is Apache Flume?

19:12 Play

Basic Flume Architecture

19:12 Play

Flume Sources

19:12 Play

Flume Sinks

19:12 Play

Flume Channels

19:12 Play

Flume Configuration

19:12 Play

Integrating Apache Flume and Apache Kafka

19:12 Play

Apache Spark Streaming - Processing Multiple Batches (12 Lectures)

Drawbacks in Existing Computing Methods

19:14 Play

Why Streaming is Necessary?

19:14 Play

What is Spark Streaming?

19:14 Play

Spark Streaming Features

19:14 Play

Spark Streaming Workflow

19:14 Play

How Uber Uses Streaming Data

19:14 Play

Streaming Context & DStreams

19:14 Play

Transformations on DStreams

19:14 Play

Describe Windowed Operators and Why it is Useful

19:14 Play

Important Windowed Operators

19:14 Play

Slice, Window and ReduceByWindow Operators

19:14 Play

Stateful Operators

19:14 Play

Apache Spark Streaming - Data Sources (5 Lectures)

Apache Spark Streaming: Data Sources

19:17 Play

Streaming Data Source Overview

19:17 Play

Apache Flume and Apache Kafka Data Sources

19:17 Play

Example: Using a Kafka Direct Data Source

19:17 Play

Perform Twitter Sentimental Analysis Using Spark Streaming

19:17 Play

Course Details

In this era of Artificial intelligence, machine learning, and data science, algorithms that run on Distributed Iterative computation make the task of distributing and computing huge volumes of data easy.  Spark is a lightning fast, in-memory, cluster computing framework that can be used for a variety of purposes. This JVM based open source framework can be used for processing and analyzing huge volumes of data and at the same time can be used to distribute data over a cluster of machines.  It is designed in such a way that it can perform batch and stream processing and hence is known as a cluster computing platform. Scala is the language in which Spark is developed. Scala is a powerful and dynamic programming language that doesn’t compromise on type safety.

Do you know the secret behind Uber’s flawless map functioning? Here’s a hint, the images gathered by the Map Data Collection Team are accessed by the downstream Apache Spark team and are assessed by operators responsible for map edits. A number of file formats are supported by Apache Spark which allows multiple records to be stored in a single file. 

According to a recent survey by DataBricks, 71% of Spark users use Scala for programming.  Spark with Scala is a perfect combination to stay grounded in the Big Data world. 9 out of 10 companies have this successful combination running in their organizations.  Spark has over 1000 contributors across 250+ organizations making it the most popular open source project ever. The Apache Spark Market is expected to grow at a CAGR of 67% between 2019 and 2022 jostling a high demand for trained professionals.

Who should go for thr course?

  • Data Scientists
  • Data Engineers
  • Data Analysts
  • BI Professionals
  • Research professionals
  • Software Architects
  • Testing Professionals
  • Software Developers

Prerequisites

 

Although you don't have to meet any prerequisites to take up Apache Spark and Scala certification training, having familiarity with Python/Java or Scala programming will be beneficial. Other than this, you should possess:

  • Basic understanding of SQL, any database, and query language for databases.
  • It is not mandatory, but helpful for you to have working knowledge of Linux or Unix-based systems.
  • Also, it is recommended to have a certification training on Big Data Hadoop Development.

Enquiry

Training Options

Self-paced Training

299
  • Lifetime access to high-quality self-paced eLearning content curated by industry experts
  • 3 simulation test papers for self-assessment
  • Lab access to practice live during sessions
  • 24x7 learner assistance and support

Live Virtual Classes

499
  • Online Classroom Flexi-Pass
  • Lifetime access 
  • Practice lab and projects with integrated Azure labs
  • Access to Microsoft official content aligned to examination

One on One Training

Enquiry Now
  • Customized learning delivery model (self-paced and/or instructor-led)
  • Flexible pricing options
  • Enterprise grade learning management system (LMS)
  • Enterprise dashboards for individuals and teams
  • 24x7 learner assistance and support

Exam & Certification

No Exam Required.

you will be required to complete a project which will be assesd by our certified instructors. on succesful completion of the project you will be awarded a training certificate.

Apache Spark and Scala Training

Frequently Asked Questions

You will execute all your Spark and Scala Course Assignments/Case Studies on the Cloud LAB environment provided by Edureka. You will be accessing the Cloud LAB via browser. In case of any doubt, Edureka’s Support Team will be available 24*7 for prompt assistance.

CloudLab is a cloud-based Spark and Hadoop environment that Edureka offers with the Spark Training Course where you can execute all the in-class demos and work on real life spark case studies fluently. This will not only save you from the trouble of installing and maintaining Spark and Scala on a virtual machine, but will also provide you an experience of a real big data and spark production cluster. You’ll be able to access the Spark Training CloudLab via your browser which requires minimal hardware configuration. In case, you get stuck in any step, our support team is ready to assist 24×7.

You don’t have to worry about the system requirements as you will be executing your practicals on a Cloud LAB which is a pre-configured environment. This environment already contains all the necessary tools and services required for wissenhive's Spark Training.