Why BigtableIO writes records one by one after GroupBy/Combine DoFn? - bigtable

Is someone aware of how the bundles are working within BigtableIO? Everything looks fine until one is using GroupBy or Combine DoFn. At this point, the pipeline would change the pane of our PCollection element from PaneInfo.NO_FIRING to PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} and then BigtableIO will output the following log INFO o.a.b.sdk.io.gcp.bigtable.BigtableIO - Wrote 1 records. Is the logging causing a performance issue when one have millions records to output or is it the fact that BigtableIO is opening and closing a writer for each record?

BigtableIO sends multiple records in a batch RPC. However, that assumes there there are multiple records sent in the "bundle". Bundle sizes are dependent on a combination of the step before hand, and the Dataflow framework. The problems you're seeing don't seem to be related to BigtableIO directly.
FWIW, here's the code for logging the number of records that occurs in the finishBundle() method.

Related

How to set KOFAX KTM Server global variable value which will be initialized in Batch open, updated in SeparateCurrentPage & used in BatchClose?

I am trying to count a specific barcode value from Project.Document_SeparateCurrentPage and use it in BatchClose to compare if the count is greater than 1 and if it is >1 then send the batch to a specific queue with specific priority. I used a global variable in KTM Project Script to hold the count value which was initialized to 0 in Batch open. It worked fine until unit testing. But our automation team found that out of 20 similar batches, few batches were sent to the queue where the batch should go only if the count satisfies the greater than one condition, though they used only one barcode.
I googled and found that KTM Server script events do not allow to use shared information in different processes(https://docshield.kofax.com/KTM/en_US/6.4.0-uuxag78yhr/help/SCRIPT/ScriptDocumentation/c_ServerScriptEvents.html). Then I tried to use a batch field to hold the barcode count but unable to update its value from Project.Document_SeparateCurrentPage function using pXRootFolder.Fields.ItemByName("BatchFieldName").Text = "GreaterThanOne". The logs show that the batch reads the first page three times and then errors out.
Any links would help. Thanks in advance.
As you mentioned, the different phases of batch/document processing can execute in different processes, so global variables initialized in one event won’t necessarily be available in others. Ideally you should only use global variables if their content can be set from Application_InitializeScript or Application_InitializeBatch, because these events occur in each separate process. As you’ve found out, you shouldn’t use a global variable for your use case, because Document_SeparateCurrentPage and Batch_Close for one batch may occur in different processes, just as the same process will likely execute those events for multiple batches.
Also, you cannot set batch fields from document level events for a related reason: any number of separate processes could be processing documents of a batch in parallel, so batch level data is read-only to document events. It is a bit unintuitive, but separation is a document level event even though it seems like it is acting on the whole batch. (The three times you saw is just an error retry mechanism.)
If it meets your needs, the simplest answer might be to use a barcode locator as part of normal extraction (not just separation), and assign to a field if needed. While you cannot set batch fields from document events, you can read document data from batch events. So instead of trying to track something like a count over the course of document events, just make sure whatever data you need is saved at a document level. Then in a Batch_Close you can iterate the documents and count/calculate whatever you need. (In your case maybe the number of locator alternatives for the barcode locator, across each document.)

Pentaho : Executing a Job multiple times with different Params - Params are not sent over to Job

I have a Pentaho "Job" that should be executed multiple times, each time with a different param value.
I indeed have a similar set up as we see in this link https://help.pentaho.com/Documentation/8.1/Products/Data_Integration/Transformation_Step_Reference/Job_Executor
The Job does get executed as many times as the number of rows, although the different parameter values are not passed onto the Job. If I add a write to Log in the job it does not print the different values.
Figured it was a defect in 8.1.x. Upgraded to 8.2.x and it is resolved.

Combine the Output of 3 Transformation in Pentaho

I'm executing 3 transformations in parallel. the o/p of three transformation contains same column names.
I've added output of all transformation to common dummy step in job and also added WaitForSql step to wait until all 3 transformations have completed execution, and also added unique step in next transformation to remove duplicate records.
All works proper till WaitForSQL, but when next transformation gets rows from result and performs Unique step I get duplicate records also when I perform Unique step.
Has anyone solution for this issue, plz reply.....
You have to sort your resulting stream after the dummy step before removing the duplicate rows. The sort will also make sure that all 3 streams are completed before sorting.
I didn't know you could use the dummy step to combine stream results. I always used the append streams-step for that.
Several points:
You cannot simply combine the outputs of multiple transformations at the job level. You wil need another transformation to read the data using the Get rows from result; jobs don't know about data streams, they only know about tasks (job entries) and exit status.
Be careful with "Launch next entries in parallel" at the job level. Lets say you have 2 transformations, trans1 and trans2, launched in parallel, followed by a dummy step. The dummy will be called TWICE, once after trans1 finishes and another when trans2 finishes. A job hop is not a data stream, it's a workflow. If you want to run transformations in parallel and later go back to a single workflow you need a subjob that calls the transformations and doens't have a Success job entry. That way, the subjob only finishes after the 2nd transformation finishes and only then it goes to the dummy step in the parent.
Why do you need those transformations running inside a job? If they have the same column structure, why don't you call them as sub transformations inside a transformation, and not a job? Steps in a transformation are always launched in parallel, so if you're parallelizing things for performance, a transformation is the way to do it, not a job. A job is meant to run multiple tasks in sequential order, one after the other, with workflow control depending on the result of the previous step.
If you want the Output from three of them into a single file, then you could run the same 3 instances in a single transformation with the append to target option ticked in your output step

SSRS Data-Driven Subscription [based on static Subscription table] Not Picking Up Changes Made to Subscription Table

I have a .RDL report which I designed in BIDS and have deployed to my report server. The report asks for three parameters before viewing report: Year, Month and Customer ID. The report works great and does exactly what it is supposed to.
While I used to run each report individually because there were 2-3 customers, now there are 30+ customers who receive the report, so I wanted to switch to a more automated fulfillment method to get the reports generated. After doing some research it appears that a using Report Manager to create a "Data Driven Subscription" (DDS) using the "Windows File Share" option gives me the capabilities I need.
As part of creating the DDS, I created a table called [Subscription] which is a table containing one row for each customer receiving the report and has the following columns:
Year
Month
CustomerID
FileName
FileLocation
Overwrite
Format
...so through using the DDS Wizard in Report Manager, I was able to successfully set up a Data Driven Subscription (which is linked to various columns in the [Subscription] table) which creates a new report for each customer in the [Subscription] table, saves [and overwrites, if necessary] it in a location of my choosing as a PDF (specified in [Subscription].[FileLocation], or the FileLocation column of my table for each row), and runs every minute (I plan on changing frequency to once a week, eventually).
This works flawlessly, giving me a new set of 30 reports in the directory of my choosing, with each report having a name I assigned in the FileName column of my table. Exactly what I was looking for.
HERE'S THE PROBLEM: When I update the FileLocation or FileName (or anything, really) in the [Subscription] table - it doesn't pick up the changes right away. Sometimes it doesn't even pick it up at all (for example I updated the [ReportName] column for one customer from Report_711622 to SpecialReport_711622, so that the output file for that customer should be named SpecialReport_711622 while all of the other reports should be called Report_XXXXX [no Special prefix]. But the file name of report for Customer 711622 remains the same!
It's almost like the job only see's what it needs to do once a day, and then does not go back and reference the [Subscription] table until I leave for the night, then when I come back in the morning it picks up the change.
Since I am about to scale this process out to a large customer-base using a different report, I need to be able to make edits to the [Subscription] table and have them get picked up by the Data Driven Subscription immediately (and if not immediately, at least a fixed interval of time that I can adjust, so that I can know 100% when the change will get picked up).
Does anyone know what's causing my lag? How do I change it so that updates to the Subscription table get picked up regularly? I'm also having issues with creating new DDS on other reports (following the exact process outlined above) - I've created the subscriptions, for every minute, and it says they are running and the number of outputs match the number of customers with 0 errors, but there are no files in the drive I specified (or anywhere else I've looked, for that matter).
Any help would be greatly appreciated!
I think the answer lies in the mechanism SSRS uses. There are a few places "lag" can occur.
The subscription is in fact an SQL Agent job which creates a record in the Event table. This table is a queue that SSRS checks to do scheduled tasks.
There is a small amount of time between the moment the subscription creates the Event record and the moment SQL reads it and starts creating the dataset for your DDS. The creation of the DDS dataset takes some time, too. In this time, the subscription will be in the Pending state. If you change anything in the data during this time, The subscription will still use the old data as report parameters. So obviously you will not notice your change until the next scheduled run.
Which brings me to the following: if a subscription is still being run and the next schedule kicks in (chances are, because yours runs every minute), the engine will not execute it, but wait for the next subscription schedule, and so on. So that's another possibility of lag - and cause of missing reports for a certain schedule minute. The subscription processes reports sequentially, one row from your DDS recordset at a time. Again, this takes some time. You can also see that in the subscription window when it says: # of # processed.
I suggest you look at the Event table in the database ReportServer during an execution. Also the ExecutionHistory views (there are 3) may be interesting. A scheduled run shows up as a RequestType = 1 and generates one record for each report. You can see the exact timing and parameters of each report that is run in the subscription. You may be able to extract the data you need to resolve your other issues.
EDIT: Here is a more elaborate guide to DDS data and events
http://blogs.msdn.com/b/deanka/archive/2009/01/13/diagnosing-and-troubleshooting-subscriptions.aspx
http://blogs.msdn.com/b/deanka/archive/2010/02/16/troubleshooting-subscriptions-part-ii-using-the-report-services-trace-log-file.aspx
Could this "Double-Hop" problem be the source of my issues? I'm so stuck on this one!
The Double-Hop Problem - MSDN Knowledgecast

Process Each Row in Kettle ONE AT A TIME?

I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.