Azure data factory - multiple primary keys in source excel to be inserted in SQL database [closed] - azure-sql-database

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 months ago.
Improve this question
I am working on a pipeline where I am using Excel as the source. The data has a primary key let's say Id, which is repeating multiple times in the Excel.
Now, when I insert it into a SQL database, it fails with the error:
java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object 'dbo.xyz'. The duplicate key value is XXXX.
How can I make take care of this scenario using mapping data flows in ADF?
I am using a mapping data flow here to take care of the different other transformations.
Example of such data, coming from the Excel source
ID Name PhoneNo
1 John Doe 11110000
1 John Doe 88881111
2 Harry Potter 88999000
2 Harry Potter 00001112
3 abc xyz 77771111
I need to take save the top 1 ID and Name (and there are more columns) in one table and Phone No and ID will be saved in another

You can use the aggregate transformation to remove duplicate values from the source.
Source:
Add sample excel source with duplicate values in ID and Name columns.
Aggregate transformation:
Under the group by property, add the list of columns in which the duplicate rows are identified.
Under the aggregates property, add the aggregate column. Here, we are getting the first value of the Phone column from the duplicate rows.
Expression: first(Phone)
Aggregate output:
Sink1:
Add the aggregate output to sink1 to pass the ID, and Name columns to 1 table.
Sink2:
Add another sink after aggregate transformation to pass Id and phone to a different table.

Related

Insert distinct values from another table's column on to a new table BigQuery

I am trying to write a BigQuery to get distinct values from one table say ABC
select distinct name from dataset.abc
1
2
3
4
now I want to insert these values to another table say XYZ, which also contains column as name and another column as company.
Note: the company column can be duplicated as I want the to have all 4 rows against every company to be inserted in table ABC.
I need a BigQuery to do this dynamically instead of updating every dataset manually every time.
P.S. Sorry if my question is not as per standards as this is the first time I am posting on Stackoverflow.

BigQuery Create Table Query from Google Sheet with Variable item string field into Repeated Field

I hope I explain this adequately.
I have a series of Google Sheets with data from an Airtable database. Several of the fields are stringified arrays with recordIds to another table.
These fields can have between 0 and n - comma separated values.
I run a create/overwrite table SELECT statement to create native BigQuery tables for reporting. This works great.
Now I need to add the recordIds to a Repeated field.
I've manually written to a repeated field using:
INSERT INTO `robotic-vista-339622.Insurly_dataset.zzPOLICYTEST` (policyID, locations, carrier)
VALUES ('12334556',[STRUCT('recordId1'),STRUCT('recordId2')], 'name of policy');
However, I need to know how I to do this using SELECT statement rather than INSERT. I also need to know how to do this if you do not know the number of recordIds that have been retrieved from Airtable. One record could have none and another record could have 10 or more.
Any given sheet will look like the following, where "locations" contains the recordIds I want to add to a repeated field.
SHEETNAME: POLICIES
|policyId |carrier | locations |
|-----------|-----------|---------------------------------|
|recrTkk |Workman's | |
|rec45Yui |Workman's |recL45x32,recQz70,recPrjE3x |
|recQb17y |ABC Co. |rec5yUlt,recIrW34 |
In the above, the first row/record has no location Id's. And then three and two on the subsequent rows/records.
Any help is appreciated.
Thanks.
I'm unsure if answering my own question is the correct way to show that it was solved... but here is what it took.
I create a Native table in BigQuery. the field for locations is a string, mode repeated.
Then I just run an overwrite table SELECT statement.
SELECT recordId,Name, Amount, SPLIT(locations) as locations FROM `projectid.datasetid.googlesheetsdatatable`;
Tested and I run linked queries on the locations with unnest.

Excluding data pairs from a query based on a table?

I have a massive and messy database of facilities where there are many duplicates. Addresses have been entered in such a haphazard way that I will be making many queries to identify possible duplicates. My objective is for each query to identify the possible duplicates, and then a person actually goes through the list and marks each pairing as either "not a duplicate" or "possible duplicate."
When someone marks a facility pair as not a duplicate, I want to record that data pair in a table so when that when one of the queries would otherwise return that pairing, it is instead excluded. I am at a loss for how to do this. I'm currently using MS Access for SQL queries, and have rudimentary visual basic knowledge.
Sample of how it should work
Query 1 is run to find duplicates based on city and company name. It brings back that facilities 1 and 2, 3 and 4, 5 and 6 are possible duplicates. The first two pairings are duplicates I need to go fix, but that 5 and 6 are indeed separate facilities. I click to record that facilities 5 and 6 are not duplicates, which records the data in a table. When query 1 is run again it does not return that 5 and 6 are possible duplicates.
For reference, the address duplicates look something like this, which is why there need to be multiple queries
Frank's Garage, 123 2nd St
Frank's Garage LLC, LLC, 123 Second st
Frank's Garage and muffler, 123 2nd Street
Frank's, 12 2nd st
The only way I know to fix this is to create a master table of company names and associate this table PK with records in original table. It will be a difficult and tedious process to review records and eliminate duplicates from master and associate remaining PK of a duplicate group to the original records (as you have discovered).
Create a master table of DISTINCT company and address data from original table. Include autonumber field to generate key. Join tables on company/address fields and UPDATE a field in original table with this key. Have another field in original table to receive a replacement foreign key.
Have a number field (ReplacementPK) in master table. Sort and review records and enter the key you want to retain for company/address duplicates group. Build a query joining tables on original key fields, update NewFK field in original table with selected ReplacementPK from master.
When all looks good:
Delete company and address and original FK fields from original table.
Delete records from master where PK does not match ReplacementPK.

SQL data sorting by column [duplicate]

This question already has answers here:
How does sql server sort your data?
(4 answers)
Closed 5 years ago.
I am facing one issue I cant handle yet.
Here's the deal: I am working on a program which should monitor employees working hours. So far, I created a SQL Server table called TablicaSQL with 4 columns:
Id, Ime (Name), Datum (date), BrojSati (WorkingHours)
It saves data according to the time of saving.
Example: if I enter Kristijan (name) worked on 2017-11-03 4 hours, but tomorrow if I save that Kristijan worked on 2017-11-01 4 hours, it will show which data has been saved first, which in this case is 2017-11-03.
So my question is: How can I sort my data according the column Datum (date), NOT by the date of saving the data.
Also, I am not looking for query which says something like this:
SELECT *
FROM..
ORDER BY...ASC/DESC
I need some kind of "permanetly asc/desc query".
Here is the screenshot of my table
There isn't a permanent order on database table. They are unorder data set. The data isn't order by the data of creation. Is just returned in the order is storage. But that can change if db engine optimizer find a better way to read the data. Multiple Partition, Clusters, etc.
If you want the data return in a specific order YOU MUST include ORDER BY

Insert one of a pair of values into SQL table based on matching value of other

I'm having a hard time searching for this, and if this question has already been asked and answered I will gladly use that information.
I have a table that is basically complete. But I need to change the values of one column in that I have in a text file. The text file has paired values. One matches an existing value in another column of the table, the other is the new value I need to use to replace the old value in the table.
So for example I have stuff like this in the table:
Name Address
John Smith 123 Main St
Jim Brown 123 Main St
Bob Jones 123 Main St
And in the text file I have:
John Smith;123 Real Address
Jim Brown;456 Another Real Address
Bob Jones;789 Yet Another Address
I want to be able to match on the name and insert each matching address into the table. Can I do this in a big query? Perhaps an update with a join to a selected set of values like this:
UPDATE MyTable
SET Name = (SELECT **all my values here**)
WHERE Name = **something**
Or maybe it would be possible to export the entire table, and merge the values in a text file with a script, then reinsert the new table values? I can't figure out a convenient way to do this though.
Use SSIS or any tool you prefer to import the text file to a separate table, and then UPDATE your existing table with a JOIN to the new table.
There are already plenty of questions on the site that answer the question of how to UPDATE a table with a JOIN to another table. Here is a good one:
How can I do an UPDATE statement with JOIN in SQL?