Failed Activity not running - azure-data-factory-2

I have an Azure Data Factory v2 Pipeline with a copy data activity. If the activity fails a Lookup activity should be run. Unfortunately the Lookup never runs. Why doesn't it run on failure of the copy data activity? How do I get this to work?
I'm expecting the "Set load of file to failed" activity to run because the Load Zipped File to Import Destination" activity failed. In fact in the output you can see the Status is "Failed" but no other activity is run.
Later I updated the Copy Activity to skip incompatible rows which caused the Copy data activity to succeed. The expected number of rows loaded now doesn't match the total number of rows loaded, so the If Condition activity goes to the failure route. Why would the Lookup run from the If Condition only triggering the failure Activity vs the Copy Data activity?

Activity dependencies are a logical AND. The lookup activity Set load of file to failed will only execute if both the Copy data activity and the If condition fail. It's not one or the other - it's both. I blogged about this here.
It's common to redesign this as:
A. Use multiple failure activities. Instead of having the one set load of file to failed at the end, copy that activity and have the copy data activity link to the new one on failure.
B. Create a parent pipeline and use an execute pipeline activity. Then add a single failure dependency from the execute pipeline activity to Set load of file to failed activity emphasized text.

Related

Azure Data Factory Copy Activity - Does not fail, despite being unsuccessful

Overview
I have a Data Factory copy activity in a pipeline that moves tabular data (.csv) from a file in Azure Data Lake into a SQL database-landing schema.
The table definitions will be refreshed each load, in principle, and so there should not be errors. However, I want to log errors in the case that the table exists and the file cannot be mapped to the destination.
I anticipate the activity failing when it cannot match the column to the destination table definition. Strangely my process succeeds despite not loading the file into the database.
Testing
I added a column in my source file.
I did not update the destination database table definition.
I truncated the destination table.
I unset fault tolerance.
Deleted all copy activity logs (nascent development environment, so this is OK).
Run the pipeline.
Check Data Factory pipeline Output, 100% success rating!?
Query destination table (empty).
Open each log file and look for relevant information (nothing just headers per copy activity call).
Copy activity settings

Snowflake COPY INTO Command return

I have a question about the snowflake COPY INTO, searched but did not get my answers.
Suppose I want to push data from snowflake to s3 bucket and using the snowflake COPY INTO command in my code, How will I know if the file is ready or command is completed? So that I can read the file from the s3 location.
You can do the following things to check whether your COPY INTO was successful or at least to retrieve some useful information about your command:
Set DETAILED_OUTPUT = TRUE and check the result (this means you get information about every single unloaded file as a output; if set to "false" you only receive information about the whole unload-process)
Query your stage by using the syntax that can be found here https://docs.snowflake.com/en/user-guide/querying-stage.html
Query the metadata of your staged data by using metadata$filename and metadata$file_row_number: https://docs.snowflake.com/en/user-guide/querying-metadata.html
Keep in mind that even a failed COPY-command can result in some unloaded files on your stage.
More information can also be found at https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html#validating-data-to-be-unloaded-from-a-query
depending on how you're actually running this.
any Snowflake interface will run synchronously so the query will just spin until it's complete.
any async call would need extra checks - the easiest one being the web interface (it will show the status of the query and when it completes the unload is complete)

Running SSIS Solution/Package deletes components out of the Data Flow Task

I'm working on a package to import data from a raw text file to a table in SQL Server. My package contains:
1) An Execute Process Task that runs a batch file to compile .txt files
2) An Execute SQL Task that Truncates the table I want to import
3) A Data Flow Task that takes the data from the raw text file and puts it in the table in SQL Server
I was able to run each step individually and they worked as expected - however, when I run the solution from inside SSIS itself, it gives me the "success" message but nothing actually happens. Even worse, the components of the data flow task are now missing.
Has anyone experienced this who found a work around?
Sorry for the lack of specifics! I actually figured it out. Let me clarify my second paragraph:
The batch portion and the Execute SQL Task work perfectly when I disable the Data Flow Task! However, upon enabling the Data Flow Task, the package would "run" but not complete the Data Flow Task and would delete the Data Flow Task's components completely. Within the data flow task I had:
1) Flat File Source
2) Conditional split that ignored rows in the first column if the value was "".
3) OLE DB destination table
What I found is that changing the Conditional Split from specifically ignoring rows with "" to making the criteria based on value length, rather than looking for that symbol, worked out and didn't completely delete out components in the data flow task.
TL;DR: For whatever reason, the solution I built didn't like the conditional split criteria being based on the "" character. When I removed that, the solution worked perfectly.

Getting the JOB_ID variable in Pentaho Data Integration

When you log a job in Pentaho Data Integration, one of the fields is ID_JOB, described as "the batch id- a unique number increased by one for each run of a job."
Can I get this ID? I can see it in my logging tables, but I want to set up a transformation to get it. I think there might be a runtime variable that holds an ID for the running job.
I've tried using the Get Variables and Get System Info transformation steps to no avail. I am a new Kettle user.
You have batch_ids of the current transformation and of the parent job available on the Get System Info step. On PDI 5.0 they come before the "command line arguments", but order changes with each version, so you may have to look it up.
You need to create the variable yourself to house the parent job batch ID. The way to do this is to add another transformation as the first step in your job that sets the variable and makes it available to all the other subsequent transformations and job steps that you'll call from the job. Steps:
1) As you have probably already done, enable logging on the job
JOB SETTINGS -> SETTINGS -> CHECK: PASS BATCH ID
JOB SETTINGS -> LOG -> ENABLE LOGGING, DEFINE DATABASE LOG TABLE, ENABLE: ID_JOB FIELD
2) Add a new transformation call it "Set Variable" as the first step after the start of your job
3) Create a variable that will be accessible to all your other transformations that contains the value of the current jobs batch id
3a) ADD A GET SYSTEM INFO STEP. GIVE A NAME TO YOUR FIELD - "parentJobBatchID" AND TYPE OF "parent job batch ID"
3b) ADD A SET VARIABLES STEP AFTER THE GET SYSTEM INFO STEP. DRAW A HOP FROM THE GET SYSTEM INFO STEP TO THE SET VARIABLES STEP AS ITS MAIN OUTPUT
3c) IN THE SET VARIABLES STEP SET FIELDNAME: "parentJobBatchID", SET A VARIABLE NAME - "myJobBatchID", VARIABLE SCOPE TYPE "Valid in the Java Virtual Machine", LEAVE DEFAULT VALUE EMPTY
And that's it. After that, you can go back to your job and add subsequent transformations and steps and they will all be able to access the variable you defined by substituting ${myJobBatchID} or whatever you chose to name it.
IT IS IMPORTANT THAT THE SET VARIABLES STEP IS THE ONLY THING THAT HAPPENS IN THE "Set Variables" TRANSFORMATION AND ANYTHING ELSE YOU WANT TO ACCESS THAT VARIABLE IS ADDED ONLY TO OTHER TRANSFORMATIONS CALLED BY THE JOB. This is because transformations in Pentaho are multi-threaded and you cannot guarantee that the set variables step will happen before other activities in that transformation. The parent job, however, executes sequentially so you can be assured that once you establish the variable containing parent job batch ID in the first transformation of the job that all other transformaitons and job steps will be able to use that variable.
You can test that it worked before you add other functionality by adding a "Write To Log" step after the Set Variables transformation that writes the variable ${myJobBatchID} to the log for you to view and confirm it is working.

What happens when bigquery upload job fails after loaded a portion of the JSON file?

As the title mentioned, what happens when I start a bigquery upload job and, let's say, after loading 50% of the rows in the JSON file the job failed. Does bigquery rollback everything of the load job or am I left with 50% of the data loaded?
I am appending data daily into a single table and keeping duplicate-free is very important. We are using the HTTP Rest API
BigQuery appends data atomically. You will never get half of the data in the table if the load fails. If the job completes successfully, all of the data will show up at once.
There are two additional tricks you can use to prevent duplicates:
Specify a job id for the load job. Imagine you pull your network cable mid way through starting the job... how do you know whether it succeeded? Specifying a job id lets you look up the job later if the job creation request fails.
Perform your loads to a temporary table, and specify WRITE_TRUNCATE as the writeDisposition. This means that you can run import jobs idempotently to the temporary table, and if you don't know whether a job succeeded, just run another one, and it will overwrite the data. Once you have a load job that completes successfully, run a table copy job with a writeDisposition to WRITE_APPEND to append the new data to your main table.