Pentaho Tools :

Pentaho C-Tools(CDE,CDF,CDA),Pentaho CE & EE Server,OLAP-Cubes,Analysis using Pivot4J, Saiku Analytics, Saiku Reporting, Ad-hoc Reporting using Interactive Reporting Tool,Dashboards,Reports using PRD, PDD,Data Integration using Kettle ETL,Data Mining usign WEKA,Integration of Servers with Databases,Mobile/iPad compatible Dashboards using Bootstrap Css,Drilldown dashboards,Interactive Dashboards

Wednesday, 23 December 2015

Tip : Difference between Text File Input and CSV File Input steps in Pentaho Data Integration(aka Kettle ETL)

Hi Folks,

A question makes us go in-depth of subject, isn't it ? Recently, One of my community folks asked a question on the differences between Text File Input & CSV file input ? Which one would you prefer to use if you are given a CSV file ?  
Hmmm..!!! Was trying to explain some bla stuff but later in the evening had a quick investigation on finding the differences. (An example speaks more than theory when understanding the concepts).

Here we go. ! Pentaho WIKI has detailed information about these two steps at

Text File Input : http://wiki.pentaho.com/display/EAI/Text+File+Input
CSV File Input : http://wiki.pentaho.com/display/EAI/CSV+Input

Here is my quick understanding on the conceptual things which you may find useful.

1) "Text File" input steps can read variety of files such as Text, CSV, Fixed length whereas CSV file  
     can read only read .csv extension files.

2) "Text File" input step is useful to add file information such as File Name, Extension, File URI(File directory and etc) which you can't do with CSV file.

3) What if you given multiple CSV files to read at a time ? Text file input step is capable of reading multiple files whereas CSV file only takes 1 file.

4)  CSV file is a subsidiary component of Text file input step.

5) There are few advantages of CSV file over text file.
*  Use CSV file input step if you are given single file to process at a time (for instance lookup
       files).
* When you use CSV file  input step it will faster the processing and hence increases the 
     performance.
*  It has NIO (Non blocking Input Output / Native System calls) - which is nothing but size of
    read buffer.
*  It represents the amount of bytes that is read in one time from disk. 
*  Parallel running: If you configure this step to run in multiple copies or in clustered mode, and you 
    enable parallel running, each copy will read a separate block of a single file allowing you to
   distribute the file reading to several threads or even several slave nodes in a clustered
   transformation

* Lazy conversion: If you will be reading many fields from the file and many of those fields will not be manipulate, but merely passed through the transformation to land in some other text file or a database, lazy conversion can prevent Kettle from performing unnecessary work on those fields such as converting them into objects such as strings, dates, or numbers.

Example : 




Download Example here : 
Click me .!!!
Change Files path to your system path.

More Information at NIO is at : 
https://en.wikipedia.org/wiki/Non-blocking_I/O_%28Java%29
 

How to create a sample Fixed length file using Excel ? 
create a fixed length text file from excel
https://www.google.co.in/?gws_rd=ssl#q=create+a+fixed+length+text+file+from+excel

 References :
http://stackoverflow.com/questions/21803856/pentaho-data-integration-csv-input-add-filename-column