How to generate logs in Pentaho to capture response from sftp if each file is put or not, for a job which loops to put dynamic files into sftp - pentaho

I have a Pentaho job which has '10' ids as input in first transformation. In the next job I have a SQL query which needs to loop through each id from input.
So, I am using copy rows to results in first transformation and get rows from result in next job and selected execute for every input row in job properties of second job to loop through every 'id' from the first transformation. In the next steps data from the SQL query in second job is stored as files into local machine dynamically in different folders based on a specific field from the telesis query.
Next, I need to send these files in different folders of the local machine into sftp server through 'sftp put' step.
Now I want to track the logs with columns as:
1.number of files loaded into local machine for each id from input.
2.number of files loaded to sftp successfully from local machine to check whether all files loaded into local machine are sent successfully to sftp or not.
3.If a file is not sent to sftp for any reason, I need the name of the file which failed to load into sftp.
Thanks in advance..

Related

Can I copy data table folders in QuestDb to another instance?

I am running QuestDb on production server which constantly writes data to a table, 24x7. The table is daily partitioned.
I want to copy data to another instance and update it there incrementally since the old days data never changes. Sometimes the copy works but sometimes the data gets corrupted and reading from the second instance fails and I have to retry coping all the table data which is huge and takes a lot of time.
Is there a way to backup / restore QuestDb while not interrupting continuous data ingestion?
QuestDB appends data in following sequence
Append to column files inside partition directory
Append to symbol files inside root table directory
Mark transaction as committed in _txn file
There is no order between 1 and 2 but 3 always happens last. To incrementally copy data to another box you should copy in opposite manner:
Copy _txn file first
Copy root symbol files
Copy partition directory
Do it while your slave QuestDB sever is down and then on start the table should have data up to the point when you started copying _txn file.

Snowflake COPY INTO Command return

I have a question about the snowflake COPY INTO, searched but did not get my answers.
Suppose I want to push data from snowflake to s3 bucket and using the snowflake COPY INTO command in my code, How will I know if the file is ready or command is completed? So that I can read the file from the s3 location.
You can do the following things to check whether your COPY INTO was successful or at least to retrieve some useful information about your command:
Set DETAILED_OUTPUT = TRUE and check the result (this means you get information about every single unloaded file as a output; if set to "false" you only receive information about the whole unload-process)
Query your stage by using the syntax that can be found here https://docs.snowflake.com/en/user-guide/querying-stage.html
Query the metadata of your staged data by using metadata$filename and metadata$file_row_number: https://docs.snowflake.com/en/user-guide/querying-metadata.html
Keep in mind that even a failed COPY-command can result in some unloaded files on your stage.
More information can also be found at https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html#validating-data-to-be-unloaded-from-a-query
depending on how you're actually running this.
any Snowflake interface will run synchronously so the query will just spin until it's complete.
any async call would need extra checks - the easiest one being the web interface (it will show the status of the query and when it completes the unload is complete)

Azure Power-shell command to get the Count of records in Azure Data lake file

I have set of files on Azure Data-lake store folder location. Is there any simple power-shell command to get the count of records in a file? I would like to do this with out using Get-AzureRmDataLakeStoreItemContent command on the file item as the size of the files in gigabytes. Using this command on big files is giving the below error.
Error:
Get-AzureRmDataLakeStoreItemContent : The remaining data to preview is greater than 1048576 bytes. Please specify a
length or use the Force parameter to preview the entire file. The length of the file that would have been previewed:
749319688
Azure data lake operates at the file/folder level. The concept of a record really depends on how an application interprets it. For instance, in one case the file may have CSV line or in another a set of JSON objects. In some cases files contain binary data. Therefore, there is no way at the file system level to get the count of records.
The best way to get this information is to submit a job such as a USQL job in Azure Data Lake Analytics. The script will be really simple: An EXTRACT statement followed by a COUNT aggregation and an OUTPUT statement.
If you prefer Spark or Hadoop here is a StackOverflow question that discusses that: Finding total number of lines in hdfs distributed file using command line

SSIS package which runs biweekly but there is no reverse out plan if it fails

Step 1- This is X job that creates the (b) job.dat file
Step 2- This is an SSIS package that splits the output dat file into 4 different files to send to Destination
Step 3-Moves the four files from the workarea to another location where MOVEIT can pick them up from
***Step two is not restartable
***There is no reversing out if any of the step fails
Note: what if i add exception handler or should I add condional split... any other ideas ?
Batch Persistence
One thing you can do for starters is to append the file(s) names with a timestamp that includes the date time of the last record processed (if timestamps do not apply then you can use a primary key incrementing value). The batch identifiers could also be stored in a database. If your SSIS package can smartly name the files in chronological sequence then third step can safely ignore files that is has already processed. Actually, you could do that at each step. This would give you the ability to start the whole process from scratch, if you must do it that way.
Ignorant and Hassle Free Dumping
Another suggestion would be do dump all data each day. If the files do not get super large then just dump all data for whatever it is you are dumping. This way each step would not have to maintain state and the process could start/stop at anytime.

How to see output in Amazon EMR/S3?

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?
The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.