ETL Process Overview

Here it’s Conventional Route VS. Extract, Transform, Load “ ETL “ approach
Old-Conventional Route adds time delays and hinder the processes very much as it loads the server with multiple
redundant processes and makes the end reports old and not viable as it took days or maybe weeks to develop certain report
On the other hand, ETL handles all the data “even if a source is a Big Data resource “ with ease and transform the common transformation needed before appending the data to the data warehouse then everyone interacts with the data warehouse easily guaranteeing the data is in good shape and form

Here it’s Conventional Route VS. Extract, Transform, Load “ ETL “ approach Old-Conventional Route adds time delays and hinder the processes very much as it loads the server with multiple redundant processes and makes the end reports old and not viable as it took days or maybe weeks to develop certain report On the other hand, ETL handles all the data “even if a source is a Big Data resource “ with ease and transform the common transformation needed before appending the data to the data warehouse then everyone interacts with the data warehouse easily guaranteeing the data is in good shape and form

ETL Process is  the process of Extracting the data from the data source like : Flat Files, API like REST API, Cloud Services like Hadoop HDFS or Amazon S3, S-a-a-S like Shopify or SAP, Real-time “Continuous Ingestion” Sources like Sensors or Apache Kafka , or even Logs and ERP exports

ETL Process is the process of Extracting the data from the data source like : Flat Files, API like REST API, Cloud Services like Hadoop HDFS or Amazon S3, S-a-a-S like Shopify or SAP, Real-time “Continuous Ingestion” Sources like Sensors or Apache Kafka , or even Logs and ERP exports

Here is an example of how to implement ETL and it shows that we can use parts of the ETL on deferent layers according to the system needs and the data architecture

Here is an example of how to implement ETL and it shows that we can use parts of the ETL on deferent layers according to the system needs and the data architecture

Category Type Description and Why it’s used ?
Extraction Technique File Parsing Since we handle files come out of CRM & ERP system
Extraction Type Full Extraction Since we have no background on the data base so it fits well to extract the data fully
Extraction Method Pull Extraction We will go and pull the data from the source whether manually or code-automated
Transformation ALL We don’t know how the data is collected and how much of errors it contains so we have assume the WCS to obtain the best results
Load Method Batch Processing It does make sense to since we chose the Full Extraction & File parsing to we do it in one go
Load Method Full Load & Truncate and Insert Since it does make since to make it with the choice of “ Full Extraction”
Slowly Changing Dimensions SCD 1 Overwriting the data every time we process the data

image.png

Extraction Methods and Types

Category Description
Full Extraction Every time we extract the whole table after deleting the last one
Incremental Extraction We take only the new changing data like by date since last extraction
Extraction Type Description
Pull Extraction Source system is queried by ETL tool to extract data
Push Extraction Source system pushes data to the ETL system

Extraction Techniques

Technique Description
Manual Data Extraction Manually access the source and extract the data manually
Database Querying Using SQL or other query languages to extract data from databases
File Parsing Reading and parsing data from structured files (CSV, XML, JSON, etc.)
API Calls Using APIs Calls to extract the data
Event Based Streaming “Kafka” Capturing data changes as events in real-time
Web Scraping Extracting data from websites programmatically
CDC (Change Data Capture) Identifying and tracking data that has changed

Transformation Operations

Operation Sub-operations Description
Data Enrichment Adding value to existing data with additional information
Data Integration Combining data from different sources to one data model
Derived Columns Creating new data columns from existing ones
Data Normalization & Standardization Mapping data likely code to standard formats and structures easier for analysis and understandable
Business Rules & Logic Applying business-specific rules to transform data
Data Aggregations Summarizing data into useful metrics
Data Cleansing -Remove Duplicates
-Data Filtering
-Handling Missing Data
-Handling Invalid Values
-Handling Unwanted Spaces
-Outlier Detection
-Data Type Casting Improving data quality by fixing issues

Load Methods and Processing Types

Processing Types Description
Batch Processing Loading the data warehouse in one big chunk of data that will run and load the data into the data warehouse scheduling it like once or twice a day
Continuous Processing Change in the source will be processed ASAP through all layers in the warehouse which make it real-time data warehouse
Loading Methods Description
Full Load -Truncate & Insert : We delete all records in the table and reinsert the new data
-Update & Insert “Upsert”: We update the whole records and insert new if found
-Drop, Create, Insert : Delete the table itself then create a new table then insert the data in it
Incremental Load -Update & Insert “Upsert”: We update the whole records and insert new if found
-Append : if its a log so we need the previous and the new so we just add to the old data
-Merge : Same as the “Upsert” but we delete

Slowly Changing Dimensions (SCD)

SCD Type Description
SCD 0: No Historization Original data is never changed
SCD 1: Overwrite Data is overwritten with new values without keeping history
SCD 2: Historization Creates new records for changed data while preserving history
SCD 3 Keeps limited history using separate columns for previous values
SCD 4 Uses separate history tables to track all changes