Migration from Oracle to CDS using ADF - azure-data-factory-2

I am trying to migrate data from Oracle to an entity in Common Data Service( CDS) through Azure Data Factory Copy Activity. As CDS comes with GUID as a primary key and Oracle doesnt have primary key, my pipeline always fails.
I tried to create an additional column in source data set with value as #guid() however it throws that column must be of type guid
also tried
select REGEXP_REPLACE(SYS_GUID(), '(.{8})(.{4})(.{4})(.{4})(.{12})', '\1-\2-\3-\4-\5') MSSQL_GUID,c. * from table_name c;
guid is coming as string in the mapping
How do we automatically generate guid in this scenario

Could you please try updating your additional column (#guid()) data type from "type": "String" to "type": "Guid" by editing the JSON payload of your pipeline (look for {} symbol at top right corner of your pipeline). See below GIF:
Update:
After further analysis by collaborating with product team, it (type coversion) is identified as an unsupported feature with dynamics sink, where UX disables type conversion for dynamics sink. UX hasn't supported it since the release of type conversion feature.
Product team has opened a work item as a feature improvement for type conversion with Dynamics sink . The ETA for this feature support is mid of September (Note: This is tentative date), but product team is actively working on it. I will closely monitor the work item and will update this post as soon as I have additional information.
As a workaround, please try to split pipeline into 2 copies (copy activities). Oracle -> csv & csv -> dynamics. In first copy, add an additional column to write empty guid column in csv file. In second copy, change the type of guid column in csv to Guid and do the copy.
Please let us know how it goes.

Related

How to update insert new record with updated value from staging table in Azure Data Explorer

I have requirement, where data is indigested from the Azure IoT hub. Sample incoming data
{
"message":{
"deviceId": "abc-123",
"timestamp": "2022-05-08T00:00:00+00:00",
"kWh": 234.2
}
}
I have same column mapping in the Azure Data Explorer Table, kWh is always comes as incremental value not delta between two timestamps. Now I need to have another table which can have difference between last inserted kWh value and the current kWh.
It would be great help, if anyone have a suggestion or solution here.
I'm able to calculate the difference on the fly using the prev(). But I need to update the table while inserting the data into table.
As far as I know, there is no way to perform data manipulation on the fly and inject Azure IoT data to Azure Data explorer through JSON Mapping. However, I found a couple of approaches you can take to get the calculations you need. Both the approaches involve creation of secondary table to store the calculated data.
Approach 1
This is the closest approach I found which has on-fly data manipulation. For this to work you would need to create a function that calculates the difference of Kwh field for the latest entry. Once you have the function created, you can then bind it to the secondary(target) table using policy update and make it trigger for every new entry on your source table.
Refer the following resource, Ingest JSON records, which explains with an example of how to create a function and bind it to the target table. Here is a snapshot of the function the resource provides.
Note that you would have to create your own custom function that calculates the difference in kwh.
Approach 2
If you do not need a real time data manipulation need and your business have the leniency of a 1-minute delay, you can create a query something similar to below which calculates the temperature difference from source table (jsondata in my scenario) and writes it to target table (jsondiffdata)
.set-or-append jsondiffdata <| jsondata | serialize
| extend temperature = temperature - prev(temperature,1), humidity, timesent
Refer the following resource to get more information on how to Ingest from query. You can use Microsoft Power Automate to schedule this query trigger for every minute.
Please be cautious if you decide to go the second approach as it is uses serialization process which might prevent query parallelism in many scenarios. Please review this resource on Windows functions and identify a suitable query approach that is better optimized for your business needs.

SQL Server big tables or store data in a xml field

I have a .net solution with a big form with many data that the customer need to fill, like a form with many steps to fill all data we need to get.
So i was wondering if it's better (from a performance and design approach) a traditional big table with many fields, o store the data only on one field of XML type.
Example of one "TraditionalTable":
RecordId
CustomerId
Data 1
Data 2....
to Data N
1
120
01/01/1980
abcd ....
123
2
20
04/02/2004
fgh ....
230
3
10
05/01/1995
xyz ....
135
Example of one "DataWithXMLField":
RecordId
CustomerId
FormData
1
120
< data>< customerdetails>< borndate>01/01/1980< /borndate>< /customerdetails >< financialinfo >...."
I've done many systems like this and prefer to keep the data as XML (often it's a serialized object). I find this to be efficient at runtime and at design time. (See item below about binary attachments).
The following are some suggestions based on what I've done in the past. Obviously it's not a one-sized hammer...
Often data is "collected" by a user and "approved" by an administrator. While collecting the data, it's stored as XML. When approved, the XML is shred and placed into "normal" relational tables/fields.
Often this data has been collected through multiple pages. Storing as XML allows collecting data in a way that is logical to the user but doesn't fit the final data structure very well.
If a form is abandoned (not completed or canceled) it's easy to delete a single row.
Things to keep in mind:
Some data is related to workflow and is separate from the data being collected. For example, and field for "Form Status" may go from "In Progress", to "Submitted" to "Approved". This type of data should be kept as regular columns.
Store Binary Data separately. If your form includes submitting binary data (like uploading a PDF) I like to generate a GUID on the front end. Store that GUID in the XML and then save the binary data separately using the GUID. Possibly on disk or in a separate "attachments" table.
Define a column for a "version number" of the XML. This way you can programmatically identify what is in the XML. This will help in the future when you need to make changes to the XML.
Define a column for a "Summary" that is short human-friendly version of the XML. For example, if your XML contains information for registering for summer camps, your "XML Summary" might contain the text: "SMITH,JOHN, Camp White Pine 2021". This text us calculated on the front end. It can then be used for displaying rows of data without having to poke into the XML. For example, an administrative page may exist that lists applications that require approval.
Define a column to indicate if the XML meets all your requirements. You don't want to validate XML in the database (it's often hard, and likely repetitive of the UI). Your business layer can apply business rules (Validation) to the XML (or classes) and store in the database an indicator that all business rules are met.

Data Lake Analytics: Custom Outputter to write to different files?

I am trying to write a custom outputter for U-SQL that writes rows to individual files based on the data in one column.
For example, if the column has a date "2016-01-01", it writes that row a file with that name, and a the next row to a file with the value in the same column.
I am aiming to do this by using the Data Lake Store SDK within the outputter, which creates a client and uses the SDK functions to write to individual files.
Is this a viable and possible solution?
I have seen that the function to be overriden for outputters is
public override void Output (IRow row, IUnstructuredWriter output)
In which the IUnstructuredWriter is casted to a StreamWriter(I saw one such example), so I assume this IUnstructuredWriter is passed to this function by the U-SQL script. So that doesn't leave for me any control over this what is passed here, also it will remain constant for all rows and can't change.
This is currently not possible but we are working on this functionality in reply to this frequent customer request. For now, please add your vote to the request here: https://feedback.azure.com/forums/327234-data-lake/suggestions/10550388-support-dynamic-output-file-names-in-adla
UPDATE (Spring 2018): This feature is now in private preview. Please contact us via email (usql at microsoft dot com) if you want to try it out.

How to implement a key lookup for generated keys table in pentaho Kettle

I just started to use Pentaho Kettle for integration. Seems great so far, quite intuitive compared to Talend, which I was also investigating.
I am trying to migrate some customers without their keys. So I have their email addresses.
The customer may already exist in the database, so what I need to do is:
If the customer exists, add it's id to the imported field and continue.
But if the customer doesn't exist I need to get the next Hibernate key from the table Hibernate_Sequences and set it as the id.
But I don't want to always allocate a key, so I want to conditionally execute a step to allocate the next key.
So what I want to do, is in the flow execute the db procedure, which allocates the next key and returns it, only if there's no value in id from the "lookup id" step.
Is this possible?
Just posting my updated flow - so the answer was to use a filter rows component which splits the data on true/false. I really had trouble getting the id out of the database stored proc because of a bug, so I had to use decimal and then convert back to integer (which I also couldn't figure out how to do, so used a javascript component).
Yes it is. As per official documentation (i left only valuable information) "Lookup values are added as new fields onto the stream". So u need just to put step "Filter row" in Flow section and check for "id" which suppose to be added in "Existing Id Lookup" step.

Redgate SQL Data Generator -> Reviewing sqlgen project -> What does "Same as mapped data" mean?

I am reviewing a coworkers sqlgen job and I am unable to figure out what this means in the table generation settings.
Specify number of rows by: "Same as mapped data"
My coworker has this selected on each table, I just need to know what is meant by this I have looked through documentation and been unable to find a definition for this.
I am on version 2 at the moment. Probably not the best question but I need an answer and he is gone for a long period of time and our data is not working correctly with this tool.
The "Same as mapped data" option is only available when you're using an existing table or view as a data source - it just means that the generator will insert all the rows from the source table or view. The other options are:
Numeric value - a set number of rows
Proportion of table - a proportion of the source table/view
Generation time - as much data as the tool can generate in a set time
There's a little more about using an existing table/view as a data source here on the website, but it doesn't have much else useful in it.