Pentaho Tools :

Pentaho C-Tools(CDE,CDF,CDA),Pentaho CE & EE Server,OLAP-Cubes,Analysis using Pivot4J, Saiku Analytics, Saiku Reporting, Ad-hoc Reporting using Interactive Reporting Tool,Dashboards,Reports using PRD, PDD,Data Integration using Kettle ETL,Data Mining usign WEKA,Integration of Servers with Databases,Mobile/iPad compatible Dashboards using Bootstrap Css,Drilldown dashboards,Interactive Dashboards

Thursday, 9 July 2015

Pentaho Data Integration Basics - 7 - (DWH) : Implementing SCD Type-II dimension table in data warehouse

In this post we will learn how to implement Slowly Changing Dimension (SCD) Type-II

What is SCD Type 2 ?

Software Requirements : 
1) Pentaho Data Integration(a.k.a Kettle ETL) - 5.4.0
2) PostgreSQL database
3) Sample data

Problem Statement :  (All the three points comes under)

1) Source data (Staging data environment)  - Raw Data for employee staging table

Let's say table name is: emp_staging

2) Problem scenario :

 Track the location of employee when he moved to other place.(with the same emp_no )
 i.e., keep the old location value along with new location value (i.e., keep track of history - how ? )

Example :
If 103 numbered employee moves to a new location let's say "new delhi" keep both the records(old location and new location records) in the dimension table

CASE 1 : emp_staging (raw data) table after 103 employee moved to a new location "new delhi"


CASE 2 : emp_staging (raw data) table after 103 employee moved again to a new location "USA"
NOTE : employee can be shifted to "N" number of times to "N" number of locations.
Whenever employee moves to a new location keep all the locations in the dimension table along with new location.

3) Expected final output for dimension table

 Let's say, the dimension table name is "dim_employee"

ETL  1st RUN : (On raw data)
table name : dim_employee

ETL 2nd RUN (Change-1): Employee 103 moved to "new delhi" from "chenni"
table name : dim_employee

3rd RUN(Change-2) : Assume the same 103 employee moved from "chenni" to "USA"
 table name : dim_employee

Theory about SCD Types: 
1) Type 1 :
2) Type 2 :

All 3 types.

Technical Implementation of above example in Pentaho Data Integration (using Kettle ETL)

1)  ETL :  Prepare staging area (Sample data) in staging database.  (table name : emp_staging)
2)  Design : Dimension table design in data model (dim_employee)
3)  ETL : load the dimension table (Run 1 , Run 2 and Run 3 , for this example). 

1)  ETL : Prepare staging area(Sample data) in staging database. (table name : emp_staging)

2)  Design : Dimension table design in data model (dim_employee)

3)  ETL : Design transformation and load the dimension table 

Dimension Lookup/Update

1) Provide all the db connections
2) Write the dimension table name (eg : dim_employee)
3) a) Surrogate key = primary key = emp_id_sk
    b) version field holds the revision numbers for dimensions to keep track of history.
4) Keys
     Stream field = emp_code and dimension field=emp_code
5) Fields

Download this example :  Click here

Hope this helps some one :-)

Thanks for your time on this page :-)