Apache Spark and Scala Training

4.0 (5896) 6701 Learners

24 hours of instructor-led live online training

Master the concepts on Apache Spark framework & development

In-depth exercises and real-time projects on Apache Spark

Learn about Apache Spark Core, Spark Internals, RDD, Spark SQL, etc

Get comprehensive knowledge on Scala Programming language

Download Syllabus


Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

You will understand the basics of Big Data and Hadoop. You will learn how Spark enables in-memory data processing and runs much faster than Hadoop MapReduce. You will also learn about RDDs, Spark SQL for structured processing, different APIs offered by Spark such as Spark Streaming, Spark MLlib. This course is an integral part of a Big Data Developer’s Career path. It will also encompass the fundamental concepts such as data capturing using Flume, data loading using Sqoop, messaging system like Kafka, etc. 

What you will learn

  • Big Data Introduction
  • Introduction on Scala
  • Spark Introduction
  • Spark Framework & Methodologies
  • Spark Data Structure
  • Spark Ecosystem


Apache Spark with Scala is used by 9 out of 10 organizations for their big data needs. Let’s take a look at its benefits at the individual and organizational level: 

Individual Benefits:

  • Learn Apache Spark to have increased access to Big Data
  • There’s a huge demand for Spark Developers across organizations
  • With an Apache Spark with Scala certification, you will earn a minimum salary of $100,000.  
  • As Apache Spark is deployed by every industry to extract huge volumes of data, you get an opportunity to be in demand across various industries

Organization Benefits:

  • It supports multiple languages like Java, R, Scala, Python
  • Easier integration with Hadoop as Spark is built on the Hadoop Distributed File System
  • It enables faster  processing of data streams in real-time with accuracy
  • Spark code can be used for batch processing, join stream against historical data, and run ad-hoc queries on stream state

According to Databricks - "The adoption of Apache Spark by businesses large and small is growing at an incredible rate across a wide range of industries, and the demand for developers with certified expertise is quickly following suit". 

Course Content

What is Big Data? 17:38 Play
Big Data Customer Scenarios 17:38 Play
How Hadoop Solves the Big Data Problem? 17:38 Play
What is Hadoop? 17:38 Play
Hadoop’s Key Characteristics 17:38 Play
Hadoop Ecosystem and HDFS 17:38 Play
Hadoop Core Components 17:38 Play
Rack Awareness and Block Replication 17:38 Play
YARN and its Advantage 17:38 Play
Hadoop Cluster and its Architecture 17:38 Play
Hadoop: Different Cluster Modes 17:38 Play
Hadoop Terminal Commands 17:38 Play
Big Data Analytics with Batch & Real-time Processing 17:38 Play
Why Spark is needed? 17:38 Play
What is Spark? 17:38 Play
How Spark differs from other frameworks? 17:38 Play
Spark at Yahoo! 17:38 Play
Why Scala for Spark? 18:14 Play
Scala in other Frameworks 18:14 Play
Introduction to Scala REPL 18:14 Play
Basic Scala Operations 18:14 Play
Variable Types in Scala 18:14 Play
Control Structures in Scala 18:14 Play
Foreach loop, Functions and Procedures 18:14 Play
Collections in Scala- Array 18:14 Play
ArrayBuffer, Map, Tuples, Lists, and more 18:14 Play
Functional Programming 18:16 Play
Higher Order Functions 18:16 Play
Anonymous Functions 18:16 Play
Class in Scala 18:16 Play
Getters and Setters 18:16 Play
Custom Getters and Setters 18:16 Play
Properties with only Getters 18:16 Play
Auxiliary Constructor and Primary Constructor 18:17 Play
Singletons 18:17 Play
Extending a Class 18:17 Play
Overriding Methods 18:17 Play
Traits as Interfaces and Layered Traits 18:17 Play
Spark’s Place in Hadoop Ecosystem 18:18 Play
Spark Components & its Architecture 18:18 Play
Spark Deployment Modes 18:18 Play
Introduction to Spark Shell 18:18 Play
Writing your first Spark Job Using SBT 18:18 Play
Submitting Spark Job 18:18 Play
Spark Web UI 18:18 Play
Data Ingestion using Sqoop 18:18 Play
Challenges in Existing Computing Methods 18:20 Play
Probable Solution & How RDD Solves the Problem 18:20 Play
What is RDD, It’s Operations, Transformations & Actions 18:20 Play
Data Loading and Saving Through RDDs 18:20 Play
Key-Value Pair RDDs 18:21 Play
Other Pair RDDs, Two Pair RDDs 18:21 Play
RDD Lineage 18:21 Play
RDD Persistence 18:21 Play
Word Count Program Using RDD Concepts 18:21 Play
RDD Partitioning & How It Helps Achieve Parallelization 18:21 Play
Passing Functions to Spark 18:21 Play
Need for Spark SQL 18:28 Play
What is Spark SQL? 18:28 Play
Spark SQL Architecture 18:28 Play
SQL Context in Spark SQL 18:28 Play
User Defined Functions 18:28 Play
Data Frames & Datasets 18:28 Play
Interoperating with RDDs 18:28 Play
JSON and Parquet File Formats 18:28 Play
Loading Data through Different Sources 18:28 Play
Spark – Hive Integration 18:28 Play
Why Machine Learning? 19:07 Play
What is Machine Learning? 19:07 Play
Where Machine Learning is Used? 19:07 Play
Different Types of Machine Learning Techniques 19:08 Play
Introduction to MLlib 19:08 Play
Features of MLlib and MLlib Tools 19:08 Play
Various ML algorithms supported by MLlib 19:08 Play
Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest 19:09 Play
Unsupervised Learning - K-Means Clustering & How It Works with MLlib 19:09 Play
Analysis on US Election Data using MLlib (K-Means) 19:09 Play
Need for Kafka 19:11 Play
What is Kafka? 19:11 Play
Core Concepts of Kafka 19:11 Play
Kafka Architecture 19:11 Play
Where is Kafka Used? 19:11 Play
Understanding the Components of Kafka Cluster 19:11 Play
Configuring Kafka Cluster 19:11 Play
Kafka Producer and Consumer Java API 19:11 Play
Need of Apache Flume 19:11 Play
What is Apache Flume? 19:12 Play
Basic Flume Architecture 19:12 Play
Flume Sources 19:12 Play
Flume Sinks 19:12 Play
Flume Channels 19:12 Play
Flume Configuration 19:12 Play
Integrating Apache Flume and Apache Kafka 19:12 Play
Drawbacks in Existing Computing Methods 19:14 Play
Why Streaming is Necessary? 19:14 Play
What is Spark Streaming? 19:14 Play
Spark Streaming Features 19:14 Play
Spark Streaming Workflow 19:14 Play
How Uber Uses Streaming Data 19:14 Play
Streaming Context & DStreams 19:14 Play
Transformations on DStreams 19:14 Play
Describe Windowed Operators and Why it is Useful 19:14 Play
Important Windowed Operators 19:14 Play
Slice, Window and ReduceByWindow Operators 19:14 Play
Stateful Operators 19:14 Play
Apache Spark Streaming: Data Sources 19:17 Play
Streaming Data Source Overview 19:17 Play
Apache Flume and Apache Kafka Data Sources 19:17 Play
Example: Using a Kafka Direct Data Source 19:17 Play
Perform Twitter Sentimental Analysis Using Spark Streaming 19:17 Play

Course Details

In this era of Artificial intelligence, machine learning, and data science, algorithms that run on Distributed Iterative computation make the task of distributing and computing huge volumes of data easy.  Spark is a lightning fast, in-memory, cluster computing framework that can be used for a variety of purposes. This JVM based open source framework can be used for processing and analyzing huge volumes of data and at the same time can be used to distribute data over a cluster of machines.  It is designed in such a way that it can perform batch and stream processing and hence is known as a cluster computing platform. Scala is the language in which Spark is developed. Scala is a powerful and dynamic programming language that doesn’t compromise on type safety.

Do you know the secret behind Uber’s flawless map functioning? Here’s a hint, the images gathered by the Map Data Collection Team are accessed by the downstream Apache Spark team and are assessed by operators responsible for map edits. A number of file formats are supported by Apache Spark which allows multiple records to be stored in a single file. 

According to a recent survey by DataBricks, 71% of Spark users use Scala for programming.  Spark with Scala is a perfect combination to stay grounded in the Big Data world. 9 out of 10 companies have this successful combination running in their organizations.  Spark has over 1000 contributors across 250+ organizations making it the most popular open source project ever. The Apache Spark Market is expected to grow at a CAGR of 67% between 2019 and 2022 jostling a high demand for trained professionals.

Who should go for thr course?

  • Data Scientists
  • Data Engineers
  • Data Analysts
  • BI Professionals
  • Research professionals
  • Software Architects
  • Testing Professionals
  • Software Developers



Although you don't have to meet any prerequisites to take up Apache Spark and Scala certification training, having familiarity with Python/Java or Scala programming will be beneficial. Other than this, you should possess:

  • Basic understanding of SQL, any database, and query language for databases.
  • It is not mandatory, but helpful for you to have working knowledge of Linux or Unix-based systems.
  • Also, it is recommended to have a certification training on Big Data Hadoop Development.

Course Info.

25+ Hours
2-3 Hours/week
Open Source
Video Script

Training Options

Selfpaced Training

  • Lifetime access to high-quality self-paced eLearning content curated by industry experts
  • 3 simulation test papers for self-assessment
  • Lab access to practice live during sessions
  • 24x7 learner assistance and support

Live Virtual Classes

  • Online Classroom Flexi-Pass
  • Lifetime access 
  • Practice lab and projects with integrated Azure labs
  • Access to Microsoft official content aligned to examination

One on One Training

  • Customized learning delivery model (self-paced and/or instructor-led)
  • Flexible pricing options
  • Enterprise grade learning management system (LMS)
  • Enterprise dashboards for individuals and teams
  • 24x7 learner assistance and support

Exam & Certification

No Exam Required.

you will be required to complete a project which will be assesd by our certified instructors. on succesful completion of the project you will be awarded a training certificate.

Apache Spark and Scala Training


You will execute all your Spark and Scala Course Assignments/Case Studies on the Cloud LAB environment provided by Edureka. You will be accessing the Cloud LAB via browser. In case of any doubt, Edureka’s Support Team will be available 24*7 for prompt assistance.

CloudLab is a cloud-based Spark and Hadoop environment that Edureka offers with the Spark Training Course where you can execute all the in-class demos and work on real life spark case studies fluently. This will not only save you from the trouble of installing and maintaining Spark and Scala on a virtual machine, but will also provide you an experience of a real big data and spark production cluster. You’ll be able to access the Spark Training CloudLab via your browser which requires minimal hardware configuration. In case, you get stuck in any step, our support team is ready to assist 24×7.

You don’t have to worry about the system requirements as you will be executing your practicals on a Cloud LAB which is a pre-configured environment. This environment already contains all the necessary tools and services required for wissenhive's Spark Training.