I am migrating the data through pentaho. there is a problem occur when the number of rows is more than 4 lankhs.transaction fail in b/w the transaction.how can we migrate the large data by pentaho ETL Tool.
As a basic debugging, do the following
If your output is a text file or Excel file, make sure that you check the size of string/text columns. As defaut the 'text ouput step' will take the maximum string length and when you start writing, it can throw up heap errors. So reduce the size and re-run the ktr files.
If the output is a table ouput step, then again check for columns with datatypes and maximum column size defined in your output table.
Kindly share the error logs if you think there is something else running around. :)
Related
I have a SQL notebook to change data and insert into another table.
I have a situation when I'm trying to change the storaged block size in blobStorage, I want to have less and bigger files. I try change a lot of parameters.
So i found a behaviour.
When I run the notebook the command create the files with almost 10MB for each.
If I create the table internaly in databricks and run another comand
create external_table as
select * from internal_table
the files had almost 40 MB...
So my question is..
There is a way to fix the minimal block size in external databricks tables?
When i'm transforming data in a SQL Notebook we have best pratices? like transform all data and store locally so after that move the data to external source?
Thanks!
Spark doesn't have a straightforward way to control the size of output files. One method people use is to call repartition or coalesce to the number of desired files. To use this to control the size of output files, you need to have an idea of how many files you want to create, e.g. to create 10MB files, if your output data is 100MB, you could call repartition(10) before the write command.
It sounds like you are using Databricks, in which case you can use the OPTIMIZE command for Delta tables. Delta's OPTIMIZE will take your underlying files and compact them for you into approximately 1GB files, which is an optimal size for the JVM in large data use cases.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html
I have set of files on Azure Data-lake store folder location. Is there any simple power-shell command to get the count of records in a file? I would like to do this with out using Get-AzureRmDataLakeStoreItemContent command on the file item as the size of the files in gigabytes. Using this command on big files is giving the below error.
Error:
Get-AzureRmDataLakeStoreItemContent : The remaining data to preview is greater than 1048576 bytes. Please specify a
length or use the Force parameter to preview the entire file. The length of the file that would have been previewed:
749319688
Azure data lake operates at the file/folder level. The concept of a record really depends on how an application interprets it. For instance, in one case the file may have CSV line or in another a set of JSON objects. In some cases files contain binary data. Therefore, there is no way at the file system level to get the count of records.
The best way to get this information is to submit a job such as a USQL job in Azure Data Lake Analytics. The script will be really simple: An EXTRACT statement followed by a COUNT aggregation and an OUTPUT statement.
If you prefer Spark or Hadoop here is a StackOverflow question that discusses that: Finding total number of lines in hdfs distributed file using command line
I have a transformation that is successfully writing the first row to the log file.
However the same transformation is not writing the first row to a text file.
The text file remains blank.
Does anyone know why this may be?
edited - only focusing on the applications to run and set pm variable transformations, as the other transformations are replications of set pm variable but for different fields
It looks like your Set Variables step is distributing its rows over the two follow-up steps in a round-robin way, which is the default setting in PDI.
Right-click the Set Variables step and under Data Movement, select Copy. That will send all rows to BOTH steps. You should see a documents icon on the hops then.
I am using Cloudera's Hue. In the file browser, I upload a .csv file with about 3,000 rows (my file is small <400k).
After uploading the file I go to the Data Browser, create a table and import the data into it.
When I go to Hive and perform a simple query (say SELECT * FROM table) I only see results for 99 rows. The original .csv has more than those rows.
When I do other queries I notice that several rows of data are missing although they show in the preview in the Hue File Browser.
I have tried with other files and they also get truncated sometimes at 65 rows or 165 rows.
I have also removed all the "," from the .csv data before uploading the file.
I finally solved this. There were several issues that appeared to cause a truncation.
The main was that the variable type automatically set after importing the data was assigned according to the first lines. So when the data type changed from TinyINT to INT it got truncated or changed to "NULL". To solve this perform EDA and change the datatype before creating the table.
Other issues were that the memory I had assigned to the virtual machine slowed the preview process and that the csv contained commas. You can set the VM to have more memory or change a csv to tab separated.
I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.