Transferring Storage Accounts Table into Data Lake using Data Factory - azure-storage

I am trying to use Data Factory to transfer a table from Storage Accounts into Data Lake. Microsoft claims that one can, "store files of arbitrary sizes and formats into Data Lake". I use the online wizard and try to create a pipeline. Pipeline gets created, but I then always get an error saying:
Copy activity encountered a user error: ErrorCode=UserErrorTabularCopyBehaviorNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=CopyBehavior property is not supported if the source is tabular data source.,Source=Microsoft.DataTransfer.ClientLibrary,'.
Any suggestions what I can do to be able to use Data Factory to transfer data from Storage Accounts table into Data Lake?
Thanks.

Your case is supported by ADF. While for the error you hit, there is a known defect that for some cases the copy wizard mis-generate a "CopyBehavior" property which is not applicable. We are fixing that now.
For you to workaround, go to Azure portal -> Author and deploy -> select that pipeline -> find the "CopyBehavior": "MergeFiles" under AzureDataLakeStoreSink and remove that line -> then deploy and rerun the activity.
If you happened to author an run-once pipeline, please re-author a scheduled one given the former is hard to be updated using JSON.
Thanks,
Linda

Related

BigQuery - Data transfer "Detected that no changes will be made to the destination table"

I use a script to generate files from an API and store them on Google Cloud Storage. Following this documentation, https://cloud.google.com/bigquery/docs/cloud-storage-transfer?hl=en_US#limitations, I've created a BigQuery table with the corresponding schema in advance and t then created a Data Transfer with the following configuration:
When I run the Data Transfer the following error shows up in the logs:
Detected that no changes will be made to the destination table
I've updated some of the files, added files, deleted files, etc and everytime I get the same message. I also have other Data Transfers that work just fine with the same BigQuery instance and Cloud Storage bucket.
Only issue I found on SO, Not able to update Big query table with Transfer from a Storage file, says you need to wait 1 hour, but even after a day I get the same error.
Any idea as to what triggers BiQuery to determine changes have been made (or not)?

Google Data Fusion Salesforce to Bigquery Pipeline, automatic way of managing schema updates in Salesforce

Hey I am trying to create some batch jobs that reads from a couple Salesforce Objects and pushes them to BQ. Every-time batch process runs it will truncate the table in BQ and push all the data in the SF object back into BQ. Is it possible for google data fusion to automatically detect changes in an object in Salesforce(like adding a new column or changing data types of a column) then be registered and pushed to BQ via google data fusion?
For SF side of the puzzle you could look into https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_describeGlobal.htm and If-Modified-Since header telling you if the definition of table(s) changed. That url is for all tables in the org or you run table-specific metadata describe calls with https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_sobject_describe.htm
But I can't tell you how to use it in your job.
You can use the provided answer of #eyescream to be the condition or the trigger for the update to BigQuery. You may push changes to BigQuery using the pre-built plugin Stream Source approach from Datafusion in which, as mentioned in this docmentation, it
tracks updates in Salesforce sObjects. Examples of sObjects are
opportunities, contacts, accounts, leads, any custom object, etc.
You may use this approach to automatically track changes and push them to BigQuery. You can also find the whole Salesforce Streaming Source configuration reference in this documentation as also redirected from google's official documentation.
However, if you want a more dynamic approach for your overall use case, you may also use the integration of BigQuery with Salesforce. However in this approach, you will need to build your own code in which you can also use #eyescream 's answer as the primary condition/trigger and then automatically push the update to your BigQuery schema.

Error when creating scheduled query on Bigquery "Error creating scheduled query: er"

I just started a new project on Google Cloud, set up some bigquery datasets and tables. I now want to set up some scheduled queries. I have already enabled BigQuery Data Transfer API. My query is valid (it's just SELECT * FROM table). I can't find anything about this error online.
See screenshot
UPDATE: I've experimented a bit and it seems to be an organization wide issue. All projects, new and old within my organization get this same error when trying to schedule a query. I tried for a project in a different organization and did not have the issue. What could be causing this error for ALL projects in an organization?
UPDATE 2:
By querying a table that is not empty the error change to "Error creating scheduled query: Yn" instead of "Error creating scheduled query: er" (when the scheduled query would have queried an empty table).
I faced the same issue than you, and basically I just needed to run the query first before creating the the scheduled query... And that did the trick.
from the BQ FAQs :
"Scheduled queries use features of BigQuery Data Transfer Service. Verify that you have completed all actions required in Enabling BigQuery Data Transfer Service."
basically, what this means is that you need to enable the data transfer api in your project, AND give the user who creates the scheduled query a BQ admin role in order to have the right permissions to access that transfer service.
If done right, you should get a popup when creating the scheduled query to confirm that the data transfer service has access to your uses account (if you block popups you might not see this message and get stuck)
If this error only occurs in your organisation, I believe it might be caused by a organisation policy on Google Cloud. I would encourage you to double check if there is any org policy causing this error. If that's not the case, open a support ticket with GCP.
What worked for me was signing in through Incognito Mode with just my account and attempting to save the scheduled query. I have multiple Google Accounts signed it at one time and for whatever reason, BigQuery throws this generic error after authorization is successful and BigQuery is granted the access it requested.
You need to make sure that you are creating the query under the project targeted not in any other projects because it won't appear
Also you need to enable the API as one of the above answers
This eventually worked for me when i ran this in an cognito window

How to save API output data to a Dataset in Azure Data Factory

I'm currently working on a project in Azure Data Factory, which involves collecting data from a Dataset, using this data to make API calls, and thereafter taking the output of the calls, and posting them to another dataset.
In this way I wish to end up with a dataset containing various different data, that the API call returns to me.
My current difficulty with this is, that do not know how to make the "Web activity" (which I use to make the API Call) save its output to my dataset.
I have tried numerous different solutions found online, however none of them seem to work. I am not sure if the official documentation is outdated or if I'm misunderstanding parts of it. Below I've listed links to the solutions I've tried and failed:
Copy data from a REST source
Copy data from an HTTP source
(among others, including similar posts to mine.)
The current flow in my pipeline is, that a "Lookup" collects a list of variables named "User_ID". These user ID's are put in to a ForEach loop, which makes an API call with the "Web" activity, using each of the USER_ID's. And this is where in the pipeline I wish to implement an activity or other, that can post each of these Web activity outputs into my new dataset.
I've tried to use the "Copy data" activity, but all it seems to do, is copying data straight from one dataset to another, and not being able to manipulate the data (which i wish to do with my api call).
Does anyone have a solution to how this is done?
Thanks a lot in advance.
Not sure why you could not achieve this following Copy data from a REST endpoint. I tested the below which works fine. I used schema mapping feature of 'Copy data' activity.
For example, I used a sample API http://dummy.restapiexample.com/api/v1/employees as source and for my testing, I used CosmosDB as sink. Of course you can choose any other dataset as per your requirement.
Create 'Linked Service' for the REST API. For simplicity I do not have authentication for this API. Of course, you have that option if required.
Create 'Linked Service' for the target data store. In my case, it is CosmosDB.
Create Dataset for the REST API and link to the linked service created in #1.
Create Dataset for the Data store (in my case CosmosDB) and link to the linked service created in #2.
In the pipeline, add a 'Copy data' activity like below with source as the REST dataset created in #3 and sink as the dataset created in #4. Also, in my case I had to add schema mapping to select the employees array from the API output and map to each field in my datastore.
And voila, that's it. When I run the pipeline, it calls the REST API and saves the output in my DB with my desired mapping.

How to transfer data from SQL Server to mongodb (using mongoose schema for validation)

Goal
We have a MEAN stack application that implements a strict mongoose schema. The MEAN stack app needs to be seeded with data that originates from a SQL Server database. The app should function as expected as long as the seeded data complies with the mongoose schema.
Problem
Currently, the data transfer job is being done through the mongo CLI which does not perform validation. Issues that have come up have been Date objects being saved as strings, missing keys that are required on our schema, entire documents missing, etc. The dev team has lost hours of development time debugging the app and discovering these data issues.
Solution we are looking for
How can we validate data so it:
Throws errors
Fails and halts the transfer
Or gives some other indication that the data is not clean
Disclaimer
I was not part of the data transfer process so I don't have more detail on the specifics of that process.
This is a general problem of what you might call "batch import", "extract-transform-load (ETL)", or "data store migration", disconnected from any particular tech. I'd approach it by:
Export the data into some portable format (e.g. CSV or JSON)
Push the data into the new system through the same validation logic that will handle new data on an ongoing basis.
It's often necessary to modify that logic a bit. For example, maybe your API will autogenerate time stamps for normal operation, but for data import, you want to explicitly set them from the old data source. A more complicated situation would be when there are constraints that cross your models/entities that need to be suspended until all the data is present.
Typically, you write your import script or system to generate a summary of how many records were processed, which ones failed, and why. Then you fix the issues, run it on those remaining records. Repeat until you're happy.
P.S. It's a good idea to version control your import script.
Export to csv and write a small script using node. that will solve your problem. You can use fast-csv npm