Find details of missing records in ADF Pipeline - azure-data-factory-2

I am building a very basic ADF Pipeline with copy data activity loading data from salesforce to salesforce. After i debug the pipeline i get few records skipped in load. How can i find the detail of these skipped records.
I am new to ADF. Any inputs will be helpful.
Thanks in Advance.

Have you checked this doc? https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-fault-tolerance
You can enable logging if you want to track the skipped columns

Related

Can we automate ETL in Azure?

I am currently working on a very interesting ETL project using Azure to transform my data manually. However, transforming data manually can be exhausting and lengthy when I start having several source files to process. My pipeline is working fine for now because I have only a few files to transform but what if I have thousands of excel files?
So what I want to achieve is that I want to extend the project and extract the excel files that are coming from Email using the logic app then apply ETL directly on top of them. Is there any way I can automate ETL in Azure. Can I do ETL without modifying the pipeline for a different type of data manually? How can I make my pipeline flexible to be able to handle data transformation for various types of source data?
Thank you in advance for your help.
Can I do ETL without modifying the pipeline for a different type of
data manually?
According to your description, i suppose that you already knew the ADF connector is supported in the Logic App. You could execute ADF pipeline in the Logic App flow and even pass parameters into ADF pipeline.
Normally, the source and sink service should be fixed in one copy activity, but you could define dynamic file path in the datasets. So you don't need to create multiple copy activities.
If the data types are different, you could try to pass the parameter from Logic App into ADF. Then before the data transmission, you could use Switch activity to route the transmission into different branches.

Hot folder - How to check the status of ingested files into Hybris?

In our current production system, we have several files that will be processed by Hybris hotfolder from external system on a daily / hourly basis. What is the best way to check the status of each file that is being processed by hot folder? Is there any OOTB dashboard functionality available for hotfolder? or is it a custom development?
So far, I'm following to check see backoffice cronjob logs. But it is very cumbersome process - by monitoring logs, finding out unique cron job id etc..any other best approaches?
I'm looking something similar to jenkins jobs status.
Appreciate your inputs.
There is a workaround. Please check this link :
https://help.sap.com/viewer/d0224eca81e249cb821f2cdf45a82ace/1808/en-US/b8004ccfcbc048faa9558ae40ea7b188.html?q=CronJobProgressTracker
Firstly, you need to implement the CronJobProgressTracker class to your current cronjob. And you can see the progress of cronjob in either hac or Backoffice ;
hac : execute flexible search
Backoffice : you can add a setting for the CronJobHistory menu. Then
just click the refresh button to see the last state of progress.
As I know , not possible to track file progress state in OOTB hotfolder. Also you can write custom code in your uploading process .BTW , to be honest my last sentence is not so meaningful . Because need to know your hotfolder xml context to give more hints ..
Hot-folder ingests a file in a series of steps specified by the beans in the hot-folder-spring.xml.Add loggers in each of the bean, eg- batchFilesHeader, batchExternalTaxConverterMapping
Then you can see the status in the console logs.

import.io stuck at Test your connector

I have created a connector using import.io windows application.
I am able to successfully test my connector using example queries. I want to extract data returned from this connector into dataset. I am stuck at "Test your connector" option.
Here is the screenshot:
The import.io Connector tool requires multiple queries to ensure it captures the right template. This increases the accuracy of collecting the right dataset.
It has taken me up to 5 queries before seeing "I'm done creating tests."

Call a pipeline from a pipeline in Amazon Data Pipeline

My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.
My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help
I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.
In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.
Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.
This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:
There are are two datasets:
Product
Sales
Steps to load these datasets:
Get source data: Say from S3. For this, use S3DataNode
Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
Load output to Redshift: Use RedshiftDatabase
You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.
Condition: 'product' needs to be loaded before 'sales' runs
Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.
Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.

Google BigQuery, unable to load data into shared datasets

I created a project on Google BigQuery and enabled billing.
Went on to create few datasets that were shared with my team members (Can EDIT premissions).
However, my team mates are unable to load data into the respective datasets shared with them. Whenever they try it says billing not enabled for this project.
I am able to load data into the datasets but not my team.
It's been more than 24 hours
Thanks in advance
Note that in order to load data, they need to run a load job, and that load job needs to be run in a project. Perhaps billing is not enabled on the project they are using?
You can give your team members read access to the project (or greater) to allow them to run jobs in your own billing-enabled project.
You can share a BigQuery project at the project level and at the dataset level.
See https://developers.google.com/bigquery/access-control.
I assume you are sharing at the dataset level. Can you try sharing the project instead with your team members? (here: https://cloud.google.com/console/project)
Please report back!