Introduction to Big Data with Spark and Hadoop

Introduction to Big Data with Spark and Hadoop

This course is part of multiple programs.

Instructors: Aije Egwaikhide

Access provided by New York State Department of Labor

70,107 already enrolled

7 modules

Gain insight into a topic and learn the fundamentals.

4.4

(463 reviews)

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

92%

Most learners liked this course

7 modules

Gain insight into a topic and learn the fundamentals.

4.4

(463 reviews)

Intermediate level

Recommended experience

Flexible schedule

2 weeks at 10 hours a week

Learn at your own pace

92%

Most learners liked this course

What you'll learn

Explain the impact of big data, including use cases, tools, and processing methods.
Describe Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce.
Apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.
Use Spark鈥檚 RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark鈥檚 development and runtime environment options.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

14 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about 糖心vlog官网观看 for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is available as part of

When you enroll in this course, you'll also be asked to select a specific program.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 7 modules in this course

This self-paced IBM course will teach you all about big data! You will become familiar with the characteristics of big data and its application in big data analytics. You will also gain hands-on experience with big data processing tools like Apache Hadoop and Apache Spark.

Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is and exploring how insights from big data can be harnessed for a variety of use cases. You鈥檒l also explore how big data uses technologies like parallel processing, scaling, and data parallelism. Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets. You鈥檒l then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the components that make up Apache Spark. You鈥檒l learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the requests your application submits and how you can track work using the Spark Application UI. This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.

In this module, you鈥檒l begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You鈥檒l explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. You鈥檒l also learn how Big Data uses parallel processing, scaling, and data parallelism. Going further, you鈥檒l explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you鈥檒l go beyond the hype and explore additional Big Data viewpoints.

What's included

8 videos1 reading2 assignments2 plugins

8 videosTotal 47 minutes

Course Introduction5 minutes
What is Big Data?7 minutes
Impact of Big Data5 minutes
Parallel Processing, Scaling, and Data Parallelism7 minutes
Big Data Tools and Ecosystem4 minutes
Open Source and Big Data6 minutes
Beyond the Hype4 minutes
Big Data Use Cases5 minutes

1 readingTotal 2 minutes

Summary and Highlights: Introduction to Big Data2 minutes

2 assignmentsTotal 41 minutes

Practice Quiz: Introduction to Big Data14 minutes
Graded Quiz: What Is Big Data?27 minutes

2 pluginsTotal 27 minutes

Introduction to Emerging Big Data Technologies15 minutes
Module 1 Glossary: What Is Big Data?12 minutes

In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications, including Distributed File System (HDFS), MapReduce, Hive, and HBase. You鈥檒l also gain practical skills in hands-on labs when you query the data added using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.

What's included

6 videos1 reading2 assignments3 app items2 plugins

6 videosTotal 37 minutes

Introduction to Hadoop7 minutes
Intro to MapReduce5 minutes
Hadoop Ecosystem 4 minutes
HDFS8 minutes
HIVE5 minutes
HBASE5 minutes

1 readingTotal 2 minutes

Summary and Highlights: Introduction to Hadoop2 minutes

2 assignmentsTotal 36 minutes

Practice Quiz: Introduction to Hadoop12 minutes
Graded Quiz: Introduction to Hadoop Ecosystem24 minutes

3 app itemsTotal 60 minutes

Hands-on Lab: Getting Started with Hive20 minutes
Hands-on Lab: Hadoop MapReduce20 minutes
Hands-on lab : Hadoop Cluster (Optional)20 minutes

2 pluginsTotal 30 minutes

Cheat Sheet: Introduction to the Hadoop Ecosystem15 minutes
Module 2 Glossary: Introduction to the Hadoop Ecosystem15 minutes

In this module, you鈥檒l turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. You鈥檒l also explore Resilient Distributed Datasets (RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you鈥檒l dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. You鈥檒l also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with Spark SQL.

What's included

5 videos1 reading2 assignments2 app items2 plugins

5 videosTotal 24 minutes

Why use Apache Spark?5 minutes
Functional Programming Basics5 minutes
Parallel Programming using Resilient Distributed Datasets 5 minutes
Scale out / Data Parallelism in Apache Spark3 minutes
Dataframes and SparkSQL4 minutes

1 readingTotal 2 minutes

Summary and Highlights: Introduction to Apache Spark2 minutes

2 assignmentsTotal 31 minutes

Practice Quiz: Introduction to Apache Spark10 minutes
Graded Quiz: Apache Spark21 minutes

2 app itemsTotal 75 minutes

Practice Lab: Getting Started with Pyspark and Pandas60 minutes
Hands-on Lab: Getting Started with Spark using Python15 minutes

2 pluginsTotal 30 minutes

Cheat Sheet: Apache Spark15 minutes
Module 3 Glossary: Apache Spark15 minutes

In this module, you鈥檒l learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You鈥檒l explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you鈥檒l fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.

What's included

5 videos1 reading2 assignments2 app items4 plugins

5 videosTotal 25 minutes

RDDs in Parallel Programming and Spark5 minutes
Data-frames and Datasets4 minutes
Catalyst and Tungsten5 minutes
ETL with DataFrames6 minutes
Real-world usage of SparkSQL4 minutes

1 readingTotal 2 minutes

Summary and Highlights: Introduction to DataFrames and Spark SQL2 minutes

2 assignmentsTotal 31 minutes

Practice Quiz: Introduction to DataFrames & Spark SQL10 minutes
Graded Quiz: DataFrames and Spark SQL21 minutes

2 app itemsTotal 30 minutes

Hands-on Lab: Introduction to DataFrames15 minutes
Hands-On Lab: Introduction to SparkSQL15 minutes

4 pluginsTotal 60 minutes

Reading: User-Defined Schema (UDS) for DSL and SQL10 minutes
Reading: Common Transformations and Optimization Techniques in Spark20 minutes
Cheat Sheet: DataFrames and Spark SQL15 minutes
Module 4 Glossary: DataFrames and Spark SQL15 minutes

In this module, you鈥檒l explore how Spark processes the requests that your application submits and learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components, and benefits. You鈥檒l also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, you鈥檒l learn about Apache Spark application submission, including the use of Spark鈥檚 unified interface, 鈥渟park-submit,鈥� and learn about options and dependencies. You鈥檒l also describe and apply options for submitting applications, identify external application dependency management techniques, and list Spark Shell benefits. You鈥檒l also look at recommended practices for Spark's static and dynamic configuration options and perform hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.

What's included

6 videos2 readings3 assignments2 app items4 plugins

6 videosTotal 32 minutes

Apache Spark Architecture5 minutes
Overview of Apache Spark Cluster Modes6 minutes
How to Run an Apache Spark Application6 minutes
Using Apache Spark on IBM Cloud4 minutes
Setting Apache Spark Configuration5 minutes
Running Spark on Kubernetes 4 minutes

2 readingsTotal 4 minutes

Summary and Highlights: Spark Architecture2 minutes
Summary and Highlights: Spark Runtime Environments2 minutes

3 assignmentsTotal 33 minutes

Practice Quiz: Spark Architecture6 minutes
Practice Quiz: Spark Runtime Environments6 minutes
Graded Quiz: Development and Runtime Environment Options21 minutes

2 app itemsTotal 80 minutes

Hands-on Lab: Submit Apache Spark Applications60 minutes
Hands-on Lab: Apache Spark on Kubernetes20 minutes

4 pluginsTotal 40 minutes

Spark Environments - Overview and Options5 minutes
How to Set Up Your Own Spark Environments (Optional)5 minutes
Cheat Sheet: Development and Runtime Environment Options15 minutes
Module 5 Glossary: Development and Runtime Environment Options15 minutes

Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. You鈥檒l also identify common Apache Spark application issues and learn about debugging issues using the application UI and locating related log files. Further, you鈥檒l discover and gain real-world knowledge about how Spark manages memory and processor resources using the hands-on lab.

What's included

5 videos1 reading2 assignments1 app item3 plugins

5 videosTotal 30 minutes

The Apache Spark User Interface5 minutes
Monitoring Application Progress7 minutes
Debugging Apache Spark Application Issues5 minutes
Understanding Memory Resources5 minutes
Understanding Processor Resources5 minutes

1 readingTotal 2 minutes

Summary and Highlights: Introduction to Monitoring and Tuning2 minutes

2 assignmentsTotal 31 minutes

Practice Quiz: Introduction to Monitoring and Tuning10 minutes
Graded Quiz: Monitoring and Tuning21 minutes

1 app itemTotal 30 minutes

Hands-on Lab: Monitoring and Performance Tuning30 minutes

3 pluginsTotal 35 minutes

[Optional] Batch Data Ingestion Methods5 minutes
Cheat Sheet: Monitoring and Tuning15 minutes
Module 6 Glossary: Monitoring and Tuning15 minutes

In this module, you鈥檒l perform a practice lab where you鈥檒l explore two critical aspects of data processing using Spark: working with Resilient Distributed Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames to gain insights and manipulate the data effectively. Further, you鈥檒l apply your knowledge in a final project where you will create a DataFrame by loading data from a CSV file and applying transformations and actions using Spark SQL. Finally, you鈥檒l be assessed based on your learning from the course.

What's included

3 readings1 assignment2 app items2 plugins

3 readingsTotal 5 minutes

Instructions for the Final Assessment1 minute
Congratulations and Next Steps2 minutes
Thanks from the Course Team2 minutes

1 assignmentTotal 100 minutes

Final Assessment100 minutes

2 app itemsTotal 120 minutes

Practice Project: Data Processing Using Spark60 minutes
Final Project: Data Analysis using Spark60 minutes

2 pluginsTotal 35 minutes

Final Project Overview15 minutes
Glossary: Introduction to Big Data with Spark and Hadoop20 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

4.3 (117 ratings)

Aije Egwaikhide

IBM

6 Courses757,770 learners

Romeo Kienzler

IBM

10 Courses796,941 learners

Rav Ahuja

IBM

56 Courses4,433,427 learners

Offered by

IBM

Why people choose 糖心vlog官网观看 for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, 糖心vlog官网观看 is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. 糖心vlog官网观看 allows me to learn without limits."

Learner reviews

4.4

463 reviews

5 stars
66.30%
4 stars
19.22%
3 stars
7.99%
2 stars
3.02%
1 star
3.45%

Showing 3 of 463

Reviewed on Jan 16, 2024

Great program to explore more about AI and Big Data

Reviewed on May 2, 2022

hands on lab and quizzes at the end of each session was very helpful

Reviewed on Jan 18, 2025

I have learned a lot from this course, and hopefully it would be helping me throughout my career ahead.

View more reviews

Explore more from Information Technology

IBM
NoSQL, Big Data, and Spark Foundations
Specialization
Packt
Apache Spark with Scala 鈥� Hands-On with Big Data!
Course
Johns Hopkins University
Big Data and Hadoop Foundations and Setup
Course
Johns Hopkins University
Big Data Processing Using Hadoop
Specialization

糖心vlog官网观看