Pentaho Tools :

Pentaho C-Tools(CDE,CDF,CDA),Pentaho CE & EE Server,OLAP-Cubes,Analysis using Pivot4J, Saiku Analytics, Saiku Reporting, Ad-hoc Reporting using Interactive Reporting Tool,Dashboards,Reports using PRD, PDD,Data Integration using Kettle ETL,Data Mining usign WEKA,Integration of Servers with Databases,Mobile/iPad compatible Dashboards using Bootstrap Css,Drilldown dashboards,Interactive Dashboards

Tuesday 17 March 2015

Pentaho Data Integration(CE): Basics-4 : Variables in Pentaho Data Integration (a.k.a Kettle ETL)

In this post you will find the basic understanding of variables in pentaho kettle ETL.

Ways to define variables in Kettle

  • Variables can be used throughout Pentaho Data Integration, including in transformation steps and job entries.
1) Variable can be defined as parameters(Named parameters)
2) Variables can be defined in Kettle properties file with key value paired syntax
3) Also in ,Previous job or transformation (if  transformation is going to be called from a job)
4) Variables section of the "Execute transformation window" or "Execute job" window when click on Run button.

How to use variables ?

    • Grabbing variables using the Get Variable step  OR
    • By specifying meta-data strings like 
      • ${VARIABLE} - Windows Style (OR)
      •  %%VARIABLE%% - Linux Style

 Refer below links for more information on variables 
1) http://wiki.pentaho.com/display/COM/Using+Variables+in+Kettle
2) http://wiki.pentaho.com/display/EAI/.07+Variables

In addition to above links, below information will be helpful to understand.

Variables in kettle.properties file with an example:

1) Define variables as key value pairs in kettle.properties file.
2) Location of the file in windows is : C:\Users\sadakar.p.SADAKAR_HYD\.kettle & the file name is kettle.properties
 Example : 

3) For example, You want to provide database connection details through variables. Define the configuration details as shown in above image in kettle.properties file.
Connecting to foodmart database of postgresql server
i.e., HOST=localhost
       DATABASE=foodmart
     USER_NAME=postgres

4) Now take a new transformation & get "Table Input" step from Input section category to the canvas
5) Double click to get it's properties and provide the above created variables in the respective placeholders.
6) Use Ctrl+Space button to get the list of all variables. Find the variables coming from kettle.properties and provide them in the text boxes.
6) Test the connection & see the magic..

Variables as parameters :

1) You can also use variables as parameters.
2) Example : select all the employees belonging to a particular country.. i.e., every time you pass a country dynamically you have to the results from employee table.
3) Take a new transformation and save it.. Press control+T to get the properties of transformation and define a parameter. For example : Define "VAR_COUNTRY" as a parameter and provide default value as "Mexico"
4) Now take a "Table input component" and give all the connections of postgres foodmart database and test it.
5) In query area write this select statement :
SELECT * FROMcustomer WHERE country like '${VAR_COUNTRY}'

NOTE : You should check "Replace Variables in Script"

6) Complete the transformation by adding a dummy step to the input step and have a preview.
7) select the dummy step to view it's output in "Execution Results" step. You will find the only "Mexico results "



8) If you pass other parameter value like "USA" when you run the transformation, you will all the result set for "USA" country. 

Variables section of "Execute transformation window" or "Execute job window" when click on Run button. 

Take the same example of above scenario with a small change.. 
Remove parameter from transformation settings and save it. Execute the transformation by clicking "Run button". In "Execute a transformation" window you can find the parameter used in the query. 
Provide the value for that parameter and test the output. 

the scenario is shown in below image.

Tuesday 3 March 2015

Pentaho Data Integration(CE): Basics-5 : Scheduling Community Kettle ETL Jobs in Winodws 7 or 8

In this post we will see the basics of how to schedule a Job in Windows environment using "Windows Task scheduler".

Prior to scheduling jobs or transformations either in linux or windows one should know how to execute jobs/transformations from command line by passing arguments or parameters or variables.

1) Design & develop transformation or job.
2) Make transformation or job dynamic using parameters or variables.
3) Execute the transformation or job from command line by passing parameter or variables.
4) Prepare a list of commands in a batch file.
5) Schedule the .bat file in windows using "Task Scheduler" Tool.


Transformations are used to process the rows of data flow whereas jobs are used for high level flow control hence we will be designing a transformation to process some data & then use this transformation in job.
Finally we will be seeing how to schedule job in Windows.
 
 Design & develop transformation or job

A Simple scenario : select table content from database and store in a file. Each run should fetch different table content by passing table name as a parameter.

Transformation Design :  Image-1
 
 Job Design : Image-2

Transformation :
1) Click Crtrl+N to create a new transformation and save it as "Parameters.ktr"  and then press "Ctrl+T" to get its settings.
2) Define a parameter "TABLE_NAME" under parameters section  and provide description & default value.
3) Take a "Table input" component and connect to foodmart database by providing all the details.
4) Write a simple query  : For example : SELECT * FROM ${TABLE_NAME}
5) Check "Replace Variables in Script"
6) Add dummy & Text files as shown in image-1
7) Execute it from GUI and command prompt by passing different values for table names.
For example on the command prompt: 

D:\Softwares Archive Installed\Pentaho\pdi-ce-5.2.0.0-209\data-integration>Pan.bat /file:C:\Users\sadakar.p\Desktop\Parameters.ktr "-param:TABLE_NAME=sales_fact
_1997"
 


8)

9) Repeat step-8 & step-9 with different value (i.e., by passing different table name) on the command prompt.

Job
1) Ctrl+Alt+N to create a new job and save it as "Parameters_job.kjb" as shown in image 2.
2) Ctrl+J to get job settings and define parameter as "TABLE_NAME".
3) Connect START job entry -> Transformation  entry -> success entry as shown in image-2
4) Double click on "transformation" entry & provide the path of the transformation then go to parameters tab and click on "Get parameters" button.
5) Now, save the job & run from GUI and command prompt.
6) For example : 
D:\Softwares Archive Installed\Pentaho\pdi-ce-5.2.0.0-209\data-integration>Kitchen.bat /file:C:\Users\sadakar.p\Desktop\Parameters_job.kjb "-param:TABLE_NAME=sa
les_fact_1997"



7) Now you can conclude that the jobs and transformations are running fine.

Preparing a batch file :

d:
cd /d "D:\Softwares Archive Installed\Pentaho\pdi-ce-5.2.0.0-209\data-integration\"
call Kitchen.bat /file:E:\Explore\Kettle\2_WindowsScheduling\Parameters_job.kjb "-param:TABLE_NAME=region" -logfile=E:\Explore\Kettle\2_WindowsScheduling\log.txt
exit

 


Windows Scheduling :
1) Go to "Task Scheduler" in windows (simply type it in search box) and click on "Create Basic Task".
2)


3)

4)

5)

6)

7) Next click on finish

8)

At the time of execution it'll pop a window saying that the execution has started.


In this way we can schedule Kettle community jobs in windows 7 or 8 .

:-) comments & suggestions are most welcome to improve this article :-)

Thank you for spending your time on this page.

Monday 2 March 2015

Pentaho Data Integration : Basics-3 : Parameters in Pentaho Kettle ETL

In this post we will see the basics of parameters in Pentaho Data Integration(a.k.a Kettle ETL)

Software Setup for this tutorial is :

Kettle ETL - Spoon : 5.2.0.0 CE


Quick navigation steps : 

   Parameters
    1. How to define ?
      • In transformation ?
      • In Job ?
    2.  Passing values to parameters ?
      • From GUI of transformation or job.
        • How to assign values to parameters ?
        • How to access parameters ? Syntax.
      • From Command Line
    3. Example 
      1. Parameters usage in a Transformation (Explained GUI and Command line in detail). 
      2. Parameters usage in a Job.

Parameters:

  • Parameters are local variables to the transformation or to the job.
  • How to Define ?

    •  In Transformation :  
    •  Right-click on the transformation workspace and select Transformation settings from the context menu (or just press Ctrl-T), then click on the Parameters tab.
                               Image-1: Setting parameters in Transformation 
           
    • In Job: 
    •  Right-click on the Job workspace and select Job settings from the context menu (or just press Ctrl-J), then click on the Parameters tab.
                               Image-2: Setting parameters in Job 
                              

Note:

  • If the name of the parameter(named parameter) and  variable is same then parameter executes first  
  • B'z parameters are local to the transformation or job whereas variables are like global to transformations or jobs.
  • Parameters are variables(but local to transformation or  job)

Passing Values to Parameters : 

There are two ways to pass values for parameters

1) FROM GUI  of transformation [Image1 & Image 2 gives GUI way of passing values from parameters]

  • How to assign values to parameters ?

    • It offers setting of default value & description for the parameter
                      Parameter                                  Default Value                            Description
                       INPUT_FILE                           E:\Test\HOST\file1.txt                     Input File 1
                       MASTER_HOST                192.168.1.3                            Master host name

  • How to Access parameters defined & assigned in transformations or jobs? 
    • Using ${Parameter_Name}  
      • Example :
        ${ MASTER_HOST }

 2) FROM command line 


Pan is for 
Transformations
Kitchen is for
Jobs
NOTE :
Windows requires you to use quotes around the parameter otherwise the equals sign is
 treated as a space by the command interpreter

Listing parameters on command line for Transformation 
Win
>pan.bat /file:<directory>\<transformation name>  -listparam
>pan.bat /file:”D:\tmp\foo.ktr” -listparam
Linux
sh.pan.sh -file=<directory>/<transformation name> -listparam
user@host:$ sh pan.sh -file=/tmp/foo.ktr -listparam

OUTPUT
Parameter: MASTER_HOST=, default=localhost : The master slave server hostname to connect to
Parameter: MASTER_PORT=, default=8080 : The master slave server HTTP control port

Listing parameters on command line for Job
Win
>Kitchen.bat /file:<directory>\<transformation name>  -listparam
>Kitchen.bat /file:”D:\tmp\foo.kjb” -listparam
Linux
sh.Kichen.sh -file=<directory>\<transformation name> -listparam
user@host:$ sh Kitchen.sh -file=/tmp/foo.kjb -listparam

OUTPUT
Parameter: MASTER_HOST=, default=localhost : The master slave server hostname to connect to
Parameter: MASTER_PORT=, default=8080 : The master slave server HTTP control port


Passing parameter values from command line for Transformation 
Win
c:\> pan.bat -file:”D:\tmp\foo.ktr” “-param:MASTER_HOST=192.168.3.7 
 -param:MASTER_PORT=5432
Linux
c:\> pan.sh -file=/tmp/foo.ktr -param:MASTER_HOST=192.168.3.7 -param:MASTER_PORT=5432

Passing parameter values from command line for Transformation 
Win
c:\> Kitchen.bat -file:D:\tmp\foo.ktr -param:MASTER_HOST=192.168.3.7 
 -param:MASTER_PORT=5432
Linux
user@host:$ sh Kitchen.sh -file:/tmp/foo.ktr -param:MASTER_HOST=192.168.3.7 -param:MASTER_PORT=5432

NOTE :

  •   In above syntax's tested only for "Windows" machine and assuming the linux commands will work. 
  • Use double quotes when there is a space in the folder names.

 Example :1) Parameters usage in a Transformation


Problem statement :
Connect to "foodmart" database of postgresql server by passing HOST and PORT as parameters.

First we will be seeing the usage of parameters in a transformation. 
1. Defining parameters
a) Ctrl+N to design a new transformation and save it then Press Ctrl+T to get transformation properties to define parameters.
b) Give the parameter names ( eg : DEV_DB_HOST and DEV_DB_PORT)
c) Drag and drop Table input step from "Input" category.
BEST PRACTICE :
a) Always give parameter or variable names in CAPITAL letters.
b) By default Kettle converts all variables in to capital letters 

2 Accessing parameters in "Table input" step as variables(b'z parameters are local variables to.
a) Double click the "Table input step" and click on New button to connect postgresql foodmart database.
b) In place of "HOST Name" and  "Database Name" provide parameters with
     ${DEV_DB_HOST} and ${DEV_DB_PORT}
c) Test the connection by clicking "Test" button.
d) Write your queryin "SQL" area.

 HINTS:
1) Once the parameters creation is done, those parameters will be available in the drop down of variable selection.
2) You can use Ctrl+Space button to select from the drop down instead typing manually.

 3)Executing from GUI
a) Save your transformation and click on "Run" button which will open "Execute a transformation" window.
b) In window you can find "Parameters" section and it'll display the already set parameters list with it's default value.
c) You can specify the parameters values here if you want to pass any other. If not the default values will be assigned. 
d) Check the log of transformation for success information or for any error messages in "Execution Results" tab.  

4) Command Line - Listing parameters
a) Refer the syntax's and examples explained in this post (at top).
b) Let us see what are the parameters we are taking in the transformation from command prompt.
c) We use "-listparam" to list all the parameters that is taking by transformation with it's name, default value and description.
d) Example of this transformation : 
 i) Navigate to the Pentaho Data integration installation folder and find Pan.bat file. (Pan is a transformation tool to run it from command line).
ii) Now issue this command : >Pan.bat /file:E:\ParametersDemo.ktr -listparam
iii) Check the output on the command which looks like below
Parameter:DEV_DB_HOST=,default=localhost:Postgres sql installed on this host
Parameter: DEV_DB_PORT=,default=5432:Postgresql port number

NOTE :
If there is any space in your folder names , enclose the file path in double quotes.
Example : >Pan.bat /file:”E:\tmp folder\foo.ktr” -listparam  ( there is a space b/w tmp and folder)
5) Command Line - Executing transformations from command line by passing parameters
a) Once the list parameters test is completed let's pass the parameters from command line.
b) Command : 
>Pan.bat /file:E:\ParametersDemo.ktr  "-param:DEV_DB_HOST=localhost"  "-param:DEV_DB_PORT=8080" 
c) NOTE : Enclose parameters in double quotes for windows command line execution. 

Example : 2 : Parameters usage in a Job ( Pass parameters from job to transformation)

GUI Execution : 

1) In this example we'll see how to pass parameters from a job to transformation. 
2) Press Ctrl+Alt+N to create a new job and save it 
3) Drag start, tansforamtion, success job entries and connect them as shown in below image.

4) Press Ctrl+J to open job settings and go to Parameters tab and give parameter names 
5) Now, double click the transformation entry to locate the transformation(the one we developed in example-1)
NOTE : Remove the parameters in transformation in its setting tab - This is optional. 

6) Keep on the same window opened and go to "parameters section"  and provide parameters that need to pass down to the job. 


NOTE : 
1) You can create N number of parameters in job but can send limited (As per your requirement). 
2) Will update on "Stream column name" and value  in soon. 

7) Now save your job and run the the job. 

Command Line Execution : 

Listing parameters

Passing parameters from command line to the job.

Download Section : 

References :

http://wiki.pentaho.com/display/EAI/Named+Parameters