Hi,
This post will give you the beginner learning concepts in understanding incremental loading of data from a OLTP database(transaction database) table into OLAP database(data warehouse dimension table) table.
Software used for this example :
1) Pentaho Data Integration CE 5.4 (or 6.0 CE)
2) PostgreSQL CE
Source data:
create database OLTP;
create table customer(c_id int not null primary key,c_name text,c_location text);
insert into customer values(1,'sadakar','hyderabad')
insert into customer values(2,'hasini','mncl');
insert into customer values(3,'anji','banglore');
insert into customer values(4,'venkat','chenni');
insert into customer values(5,'anjali','karimnagar');
select * from customer
c_id c_name c_location
----------------------
1;"sadakar";"hyderabad"
2;"hasini";"mncl"
3;"anji";"banglore"
4;"venkat";"chenni"
5;"anjali";"karimnagar"
Scenario :
Lets say a new record is inserted in above customer table then while running ETL transformation/job then the only newly inserted record should be loaded into the customer dimension table in warehouse table.
i.e.,
if 6;"soujanya";"banglore" is inserted into above table then this record only load into the customer dimension table in warehouse.
Target :
create database OLAP
CREATE TABLE dim_customer
(
customer_id INTEGER NOT NULL PRIMARY KEY
, customer_name TEXT
, customer_location TEXT
)
;
NOTE :
You can create above dimension table manually in the database or can generate the script using table output step.
Steps :
1) Create two data source connections (one is for OLTP and the other is for OLAP) in kettle.
2) Drag and drop "Table Input" step and write below query taking from OLTP database.
SELECT * FROM customer
3) Drag and drop "Table Output" step and give all the properties in the window.
database name : OLTP , table name : dim_customer
Check Specify database fields
Table field Stream field
=========================
customer_id c_id
customer_name c_name
customer_location c_locatioin
4) ETL 1st run : save the transformation and run the job.
Observe , Step Metrics tab from "Execution Results: tab ( 5 rows are read and 5 rows are written).
dim_customer table output after 1st ETL run
customer_id customer_name customer_location
====================================
1;"sadakar";"hyderabad"
2;"hasini";"mncl"
3;"anji";"banglore"
4;"venkat";"chenni"
5;"anjali";"karimnagar"
5) Now, drag and drop table input step for finding the maximum of customer id from "dim_customer" table using below query.
SELECT COALESCE(max(customer_id),0) FROM dim_customer
6) connect this step to source table input step as shown in below
7) Edit the source table input step - i.e., Give insert data from Step - connect the above step to source table input step.
Also, impotently change the SQL code from
SELECT * FROM customer to
SELECT * FROM customer WHERE c_id > ?
8) ETL 2nd Run : Save the transformation and run the job.
In above case no new record is inserted in OLTP table hence no new records are loaded into dimension table. You can see it in Step Metrics in above figure.
dim_cusotmer table output after 2nd run of ETL
customer_id customer_name customer_location
====================================
1;"sadakar";"hyderabad"
2;"hasini";"mncl"
3;"anji";"banglore"
4;"venkat";"chenni"
5;"anjali";"karimnagar"
9) Now, insert few new records in OLTP customer table.
lets say,
insert into customer values(6,'soujanya','banglore');
insert into customer values(7,'kumar','bellampally');
insert into customer values(8,'srinivas','somagudem');
insert into customer values(9,'srinandan','warangal');
SELECT * from customer
c_id c_name c_location
====================
1;"sadakar";"hyderabad"
2;"hasini";"mncl"
3;"anji";"banglore"
4;"venkat";"chenni"
5;"anjali";"karimnagar"
6;"soujanya";"banglore"
7;"soujanya";"banglore"
8;"kumar";"bellampally"
9;"srinivas";"somagudem"
10;"srinandan";"warangal"
i.e, from row 6 to 10 are newly inserted rows in OLTP customer table.
Now when you run the ETL, the c_id should load from 6 .
10) ETL 3rd run - Insert the newly inserted rows of OLTP into OLAP dim_customer table
Observe the table input has read 5 records which were not available in dim_customer table and loaded only these new 5 rows.
dim_customer table after 3rd ETL run
customer_id customer_name, customer_lcoation
=====================================
1;"sadakar";"hyderabad"
2;"hasini";"mncl"
3;"anji";"banglore"
4;"venkat";"chenni"
5;"anjali";"karimnagar"
6;"soujanya";"banglore"
7;"soujanya";"banglore"
8;"kumar";"bellampally"
9;"srinivas";"somagudem"
10;"srinandan";"warangal"
NOTE :
The same procedure can be applied on dates.
# If you are taking CREATED_ON in OLTP, find the max(CREATED_ON) instead of ID.
# You can also take the combination of ID and CREATED_ON for incremental loading.
Download sample .ktr file : Click Me.!!
References :
Table Input 1 (Select MAX(Insert Timestamp) from DestinationTable ) ->
Table Input 2 ( Select * from SourceTable where Insert Timestamp > ? ; Insert Data from Table Input 1) ->
Table Output (Insert Rows into DestinationTable)
1) Table Input: Select MAX(ID) from LocalTable
2) Table Input: Select * from RemoteTable where ID>? (Insert Data from 1)
3) Table Output: All Fields into LocalTable
http://forums.pentaho.com/showthread.php?170386-Incremental-load
http://forums.pentaho.com/showthread.php?190987-Incremental-Load-in-Pentaho
http://forums.pentaho.com/showthread.php?93240-Loads-incremental-data-from-a-table-of-one-database-to-table-of-another
I hope this helps some one .! Please do share if you find this post is useful.
Cheers..!!!