Currently I'm working on the Collibra with Mule to export data as reports.
My requiremnt to export the data from the Collibra to external file.
For this I'm using the tableview config with Mule &collibra connect.
which using the exportCSV am able to get the data in format dd/mm/yyyy but I need the time stamp with date.
Please help me.
The DGC Connector can be used to import the CSV that resulted from converting the external data.
The important part here, is the Table View Config that specifies how DGC has to interpret the CSV data and map it to DGC concepts.You have to configure the Table View Config as follows:
Asset (Term) ID should be the unique identifier of the DGC Assets. You can get that ID from the mapping information.
The default operation has to be UPDATE, to cope with the scenarios described earlier. You already created the asset, so you do not have to CREATE anything anymore.
Mule dataweave can be used to convert the time from dd/mm/yyyy to the corresponding timestamp
Related
I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.
I have a Data Factory V2 pipeline consisting of 'get metadata' and 'forEach' activities that reads a list of files on a file share (on-prem) and logs it in a database table. Currently, I'm only able to read file name, but would like to also retrieve the date modified and/or date created property of each file. Any help, please?
Thank you
According to the MS documentation.
We can see File system and SFTP both support the lastModified property. But we only can get the lastModified of one file or folder at a time.
I'm using File system to do the test. The process is basically the same as the previous post, we need to add a GetMetaData activity to the ForEach activity.
This is my local files.
First, I created a table for logging.
create table Copy_Logs (
Copy_File_Name varchar(max),
Last_modified datetime
)
In ADF, I'm using Child Items at Get Metadata1 activity to get the file list of the folder.
Then add dynamic content #activity('Get Metadata1').output.childItems at ForEach1 activity.
Inside the ForEach1 activity, using Last modified at Get Metadata2 activity.
In the dataset of Get Metadata2 activity, I key in #item().name as follows.
Using CopyFiles_To_Azure activity to copy local files to the Azure Data Lake Storage V2.
I key in #item().name at the source dataset of CopyFiles_To_Azure activity.
At Create_Logs activity, I'm using the following sql to get the info we need.
select '#{item().name}' as Copy_File_Name, '#{activity('Get Metadata2').output.lastModified}' as Last_modified
In the end, sink to the sql table we created previously. The result is as follows.
One way , I can think of is please add a new Getmetdata inside the FE loop and use paramterized dataset and pass a filename as the paramter . The below animation should helped , I did tested the same .
HTH .
Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?
According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.
I am trying to import data into a table in Oracle from a CSV file using SQL Loader. However, I want to add two additional attributes namely date of upload and the file path from which the data is being imported. I can add the date using SYSDATE, Is there a similar method of obtaining the file path?
The trouble with using SYSDATE is that it will not be the same for all rows. This can make it difficult if you do more than one load in a day and need to back out a particular load. Consider using a batch_id also using the method in this post: Insert rows with batch id using sqlldr
I suspect it could be adapted to use the SYSDATE as well so it would be the same for all rows. Give it a try and let us know. At any rate, using a batch_id from a sequence would make working with problems much easier should you need to delete based on a batch_id.
Does the API support importing a CSV to a new table when there is a TIMESTAMP field?
If I manually (using the BigQuery web interface) upload a CSV file containing timestamp data, and specify the field to be a TIMESTAMP via the schema, it works just fine. The data is loaded. The timestamp data is interpreted as timestamp data and imported into the timestamp field just fine.
However, when I use the API to do the same thing with the same file, I get this error:
"Illegal CSV schema type: TIMESTAMP"
More specifically, I'm using Google Apps Script to connect to the BigQuery API, but the response seems to be coming from the BigQuery API itself, which suggests this is not a feature of the API.
I know I can import as STRING, then convert to TIMESTAMP in my queries, but I was hoping to ultimately end up with a table schema with a timestamp field... populated from a CSV file... using the API... preferably through Apps Script for simplicity.
It looks like TIMESTAMP is missing from the 'inline' schema parser. The fix should be in next week's build In the mean time, if you pass the schema via the 'schema' field rather than the schemaInline field it should work for you.