ETL Process Overview

Here it’s Conventional Route VS. Extract, Transform, Load “ ETL “ approach
Old-Conventional Route adds time delays and hinder the processes very much as it loads the server with multiple
redundant processes and makes the end reports old and not viable as it took days or maybe weeks to develop certain report
On the other hand, ETL handles all the data “even if a source is a Big Data resource “ with ease and transform the common transformation needed before appending the data to the data warehouse then everyone interacts with the data warehouse easily guaranteeing the data is in good shape and form

ETL Process is the process of Extracting the data from the data source like : Flat Files, API like REST API, Cloud Services like Hadoop HDFS or Amazon S3, S-a-a-S like Shopify or SAP, Real-time “Continuous Ingestion” Sources like Sensors or Apache Kafka , or even Logs and ERP exports

Here is an example of how to implement ETL and it shows that we can use parts of the ETL on deferent layers according to the system needs and the data architecture
Category |
Type |
Description and Why it’s used ? |
Extraction Technique |
File Parsing |
Since we handle files come out of CRM & ERP system |
Extraction Type |
Full Extraction |
Since we have no background on the data base so it fits well to extract the data fully |
Extraction Method |
Pull Extraction |
We will go and pull the data from the source whether manually or code-automated |
Transformation |
ALL |
We don’t know how the data is collected and how much of errors it contains so we have assume the WCS to obtain the best results |
Load Method |
Batch Processing |
It does make sense to since we chose the Full Extraction & File parsing to we do it in one go |
Load Method |
Full Load & Truncate and Insert |
Since it does make since to make it with the choice of “ Full Extraction” |
Slowly Changing Dimensions |
SCD 1 |
Overwriting the data every time we process the data |

Extraction Methods and Types
Category |
Description |
Full Extraction |
Every time we extract the whole table after deleting the last one |
Incremental Extraction |
We take only the new changing data like by date since last extraction |
Extraction Type |
Description |
Pull Extraction |
Source system is queried by ETL tool to extract data |
Push Extraction |
Source system pushes data to the ETL system |
Extraction Techniques
Technique |
Description |
Manual Data Extraction |
Manually access the source and extract the data manually |
Database Querying |
Using SQL or other query languages to extract data from databases |
File Parsing |
Reading and parsing data from structured files (CSV, XML, JSON, etc.) |
API Calls |
Using APIs Calls to extract the data |
Event Based Streaming “Kafka” |
Capturing data changes as events in real-time |
Web Scraping |
Extracting data from websites programmatically |
CDC (Change Data Capture) |
Identifying and tracking data that has changed |
Transformation Operations
Operation |
Sub-operations |
Description |
Data Enrichment |
|
Adding value to existing data with additional information |
Data Integration |
|
Combining data from different sources to one data model |
Derived Columns |
|
Creating new data columns from existing ones |
Data Normalization & Standardization |
|
Mapping data likely code to standard formats and structures easier for analysis and understandable |
Business Rules & Logic |
|
Applying business-specific rules to transform data |
Data Aggregations |
|
Summarizing data into useful metrics |
Data Cleansing |
-Remove Duplicates |
|
-Data Filtering |
|
|
-Handling Missing Data |
|
|
-Handling Invalid Values |
|
|
-Handling Unwanted Spaces |
|
|
-Outlier Detection |
|
|
-Data Type Casting |
Improving data quality by fixing issues |
|
Load Methods and Processing Types
Processing Types |
Description |
Batch Processing |
Loading the data warehouse in one big chunk of data that will run and load the data into the data warehouse scheduling it like once or twice a day |
Continuous Processing |
Change in the source will be processed ASAP through all layers in the warehouse which make it real-time data warehouse |
Loading Methods |
Description |
Full Load |
-Truncate & Insert : We delete all records in the table and reinsert the new data |
-Update & Insert “Upsert”: We update the whole records and insert new if found |
|
-Drop, Create, Insert : Delete the table itself then create a new table then insert the data in it |
|
Incremental Load |
-Update & Insert “Upsert”: We update the whole records and insert new if found |
-Append : if its a log so we need the previous and the new so we just add to the old data |
|
-Merge : Same as the “Upsert” but we delete |
|
Slowly Changing Dimensions (SCD)
SCD Type |
Description |
SCD 0: No Historization |
Original data is never changed |
SCD 1: Overwrite |
Data is overwritten with new values without keeping history |
SCD 2: Historization |
Creates new records for changed data while preserving history |
SCD 3 |
Keeps limited history using separate columns for previous values |
SCD 4 |
Uses separate history tables to track all changes |