Data Engineering on AWS

Field | Description |
Purpose | To equip data professionals with the architectural and technical skills required to design, implement, and secure modern data solutions—including data lakes, warehouses, and complex pipelines—at scale on AWS. |
Audience | Professionals interested in the end-to-end lifecycle of data, from ingestion and transformation to storage and consumption. |
Role | Data Engineers, Data Architects, Backend Developers, and Data Scientists looking to operationalize data workflows. |
Domain | Data Engineering / Big Data / Analytics. |
Skill Level | Intermediate. |
Style | A balanced mix of theory and hands-on labs covering batch and streaming architectures, orchestration, and performance tuning. |
Duration | 3 Days. |
Related Technologies | Amazon S3 (Data Lakes), Amazon Redshift Serverless, AWS Glue, Amazon Kinesis, Open Table Formats, and SQL. |
Course Description
Data Engineering on AWS is a comprehensive 3-day deep dive into the practices and solutions required to manage data at scale. Participants explore the foundational roles of data engineering and learn to build production-ready environments. The course covers the implementation of data lakes and Amazon Redshift Serverless warehouses, as well as the creation of both batch and streaming data pipelines. Beyond just building, the curriculum emphasizes optimization and security, ensuring that data solutions are cost-effective, compliant, and performant.
Who is this course for
This course is intended for technical individuals responsible for the plumbing of data-driven organizations. It is ideal for:
Data Engineers who need to move beyond simple ETL to complex cloud-native architectures.
Data Architects designing scalable data environments for analytics and machine learning.
Software Developers tasked with integrating application data into centralized data lakes or warehouses.
Course Objectives
Foundational Strategy: Understand data personas, discovery, and the orchestration of AWS services for data movement.
Data Lake Implementation: Design and secure data lakes using S3, incorporating open table formats and transformation workflows.
Data Warehousing: Set up and optimize Amazon Redshift Serverless, including query tuning and automated orchestration.
Batch Pipelines: Build comprehensive batch processing pipelines that cover cataloging, integration, and secure serving.
Streaming Solutions: Architect real-time streaming pipelines, focusing on ingestion, storage, and live analysis with security and compliance.
Operational Excellence: Apply CI/CD, Infrastructure as Code (IaC), and cost optimization practices to data engineering projects.
Prerequisites
Programming: Working knowledge of Python and libraries like NumPy and Pandas.
AI/ML Basics: Familiarity with supervised/unsupervised learning and basic algorithms (regression, classification).
Cloud & Data: A basic understanding of cloud computing and the AWS platform; familiarity with SQL and relational databases is highly recommended.
Section 1: Data Engineering Roles and Key Concepts
Role of a Data Engineer
Key functions of a Data Engineer
Data Personas
Data Discovery
AWS Data Services
Section 2: AWS Data Engineering Tools and Services
Orchestration and Automation
Data Engineering Security
Monitoring
Continuous Integration and Continuous Delivery
Infrastructure as Code
AWS Serverless Application Model
Networking Considerations
Cost Optimization Tools
Section 3: Designing and Implementing Data Lakes
Hands-on lab: Setting up a Data Lake on AWS
Data lake introduction
Data lake storage
Ingest data into a data lake
Catalog data
Transform data
Server data for consumption
Section 4: Optimizing and Securing a Data Lake Solution
Open Table Formats
Security using AWS Lake Formation
Setting permissions with Lake Formation
Security and governance
Troubleshooting
Hand-on lab: Automating Data Lake Creation using AWS Lake Formation Blueprints
Section 5: Data Warehouse Architecture and Design Principles
Hands-on Lab: Setting up a Data Warehouse using Amazon Redshift Serverless
Introduction to data warehouses
Amazon Redshift Overview
Ingesting data into Redshift
Processing data
Serving data for consumption
Section 6: Performance Optimization Techniques for Data Warehouses
Monitoring and optimization options
Data optimization in Amazon Redshift
Query optimization in Amazon Redshift
Orchestration options
Section 7: Security and Access Control for Data Warehouses
Hands-on lab: Managing Access Control in Redshift
Authentication and access control in Amazon Redshift
Data security in Amazon Redshift
Auditing and compliance in Amazon Redshift
Section 8: Designing Batch Data Pipelines
Introduction to batch data pipelines
Designing a batch data pipeline
AWS services for batch data processing
Section 9: Implementing Strategies for Batch Data Pipeline
Hands-on lab: Aa Data Engineer
Elements of a batch data pipeline
Processing and transforming data
Integrating and cataloging your data
Serving data for consumption
Section 10: Optimizing, Orchestrating, and Securing Batch Data Pipelines
Hands-on lab: Orchestrating Data Processing in Spark using AWS Step Functions
Optimizing the batch data pipeline
Orchestrating the batch data pipeline
Securing the batch data pipeline
Section 11: Streaming Data Architecture Patterns
Hands-on lab: Streaming Analytics with Amazon Managed Service for Apache Flink
Introduction to streaming data pipelines
Ingesting data from stream sources
Streaming data ingestion services
Storing streaming data
Processing Streaming Data
Analyzing Streaming Data with AWS Services
Section 12: Optimizing and Securing Streaming Solutions
Hands-on lab: Access Control with Amazon Managed Streaming for Apache Kafka
Optimizing a streaming data solution
Securing a streaming data pipeline
Compliance considerations

