In this post we will learn how to implement Slowly Changing Dimension (SCD) Type-II
What is SCD Type 2 ?
Software Requirements :
1) Pentaho Data Integration(a.k.a Kettle ETL) - 5.4.0
2) PostgreSQL database
3) Sample data
i.e., keep the old location value along with new location value (i.e., keep track of history - how ? )
Example :
If 103 numbered employee moves to a new location let's say "new delhi" keep both the records(old location and new location records) in the dimension table
CASE 1 : emp_staging (raw data) table after 103 employee moved to a new location "new delhi"
CASE 2 : emp_staging (raw data) table after 103 employee moved again to a new location "USA"
NOTE : employee can be shifted to "N" number of times to "N" number of locations.
Whenever employee moves to a new location keep all the locations in the dimension table along with new location.
ETL 1st RUN : (On raw data)
table name : dim_employee
ETL 2nd RUN (Change-1): Employee 103 moved to "new delhi" from "chenni"
table name : dim_employee
3rd RUN(Change-2) : Assume the same 103 employee moved from "chenni" to "USA"
table name : dim_employee
Theory about SCD Types:
1) Type 1 :
http://www.kimballgroup.com/2008/08/slowly-changing-dimensions/
2) Type 2 :
http://www.kimballgroup.com/2008/09/slowly-changing-dimensions-part-2/
All 3 types.
3) http://www.1keydata.com/datawarehousing/slowly-changing-dimensions-type-3.html
2) Design : Dimension table design in data model (dim_employee)
3) ETL : Design transformation and load the dimension table
Dimension Lookup/Update
1) Provide all the db connections
2) Write the dimension table name (eg : dim_employee)
3) a) Surrogate key = primary key = emp_id_sk
b) version field holds the revision numbers for dimensions to keep track of history.
4) Keys
Stream field = emp_code and dimension field=emp_code
5) Fields
Download this example : Click here
Hope this helps some one :-)
Thanks for your time on this page :-)
What is SCD Type 2 ?
Software Requirements :
1) Pentaho Data Integration(a.k.a Kettle ETL) - 5.4.0
2) PostgreSQL database
3) Sample data
Problem Statement : (All the three points comes under)
1) Source data (Staging data environment) - Raw Data for employee staging table
Let's say table name is: emp_staging2) Problem scenario :
Track the location of employee when he moved to other place.(with the same emp_no )i.e., keep the old location value along with new location value (i.e., keep track of history - how ? )
Example :
If 103 numbered employee moves to a new location let's say "new delhi" keep both the records(old location and new location records) in the dimension table
CASE 1 : emp_staging (raw data) table after 103 employee moved to a new location "new delhi"
CASE 2 : emp_staging (raw data) table after 103 employee moved again to a new location "USA"
NOTE : employee can be shifted to "N" number of times to "N" number of locations.
Whenever employee moves to a new location keep all the locations in the dimension table along with new location.
3) Expected final output for dimension table
Let's say, the dimension table name is "dim_employee"ETL 1st RUN : (On raw data)
table name : dim_employee
ETL 2nd RUN (Change-1): Employee 103 moved to "new delhi" from "chenni"
table name : dim_employee
3rd RUN(Change-2) : Assume the same 103 employee moved from "chenni" to "USA"
table name : dim_employee
Theory about SCD Types:
1) Type 1 :
http://www.kimballgroup.com/2008/08/slowly-changing-dimensions/
2) Type 2 :
http://www.kimballgroup.com/2008/09/slowly-changing-dimensions-part-2/
All 3 types.
3) http://www.1keydata.com/datawarehousing/slowly-changing-dimensions-type-3.html
Technical Implementation of above example in Pentaho Data Integration (using Kettle ETL)
1) ETL : Prepare staging area (Sample data) in staging database. (table name : emp_staging)
2) Design : Dimension table design in data model (dim_employee)
3) ETL : load the dimension table (Run 1 , Run 2 and Run 3 , for this example).
1) ETL : Prepare staging area(Sample data) in staging database. (table name : emp_staging)
2) Design : Dimension table design in data model (dim_employee)
3) ETL : Design transformation and load the dimension table
Dimension Lookup/Update
1) Provide all the db connections
2) Write the dimension table name (eg : dim_employee)
3) a) Surrogate key = primary key = emp_id_sk
b) version field holds the revision numbers for dimensions to keep track of history.
4) Keys
Stream field = emp_code and dimension field=emp_code
5) Fields
Download this example : Click here
Hope this helps some one :-)
Thanks for your time on this page :-)
such a lovely post!!
ReplyDeleteMust have a look at Tips For Setting A Quick, Beautiful Outdoor Table