I am looking for a way to merge three seperate datasets (.csv format) into one in Azure Synapse and then store it as a new .csv in Azure Blob Storage. I am using the Union data flow based on this tutorial: https://www.youtube.com/watch?v=vFCNbHqWct8
Generally speaking, the extraction and saving of the new file works. However, when merging the files I receive 3x the number of rows as in the source datasets. Each source dataset has 36 entries each. CustomerID ranges from 1-36 in each dataset.
Dataset 1 has 2 columns: CustomerID, loyalty_level
Dataset 2 has 3 columns: CustomerID, name, email
Dataset 3 has 2 columns: CustomerID, salestotal
When I run it, I get a dataset with 108 rows, instead of the aspired 36. Where is my mistake? Am I approaching the process incorrectly?
You are getting 108 rows because the union transformation is combining the 3 separate datasets into 1. If you watch the video in the union transformation documentation page it describes the behavior of this transformation.
To get your desired results you need to use the join transformation. Using the CustomerID as your join condition this will join the datasets together keeping your row count at 36.
One thing to watch out for is the type of join you choose. If you have customers in one file that are not in another you can drop records. This post describes the different types of joins very well. I suggest you get a firm understanding of this different types of joins.
Related
I have two data sources that are loaded into Azure Synapse. Both raw data sources contain an 'Apple' table.
I merge these into a single 'Apple' table in my Enriched data store.
SELECT * FROM datasource1.apple JOIN datasource2.apple on datasource1.apple.id = datasource2.apple.id
However, both data sourecs also contain a one to many relation AppleColours.
Please could someone help me understand the correct approach to creating a single AppleColours table in my enriched zone?
You need data from both sources when you want to merge them. JOIN(INNER JOIN) will bring only the apple.id that is in both datasource1 and datasource2.
You should try a CROSS JOIN
For the AppleColours 1 to many relation there are 2 methods:
You could put direct the color in the Apple table in this case there is no need for a separate AppleColours table
Apple
ID| Color
1 | red
2 | green
To get data into Color column make another JOIN this time with the AppleColours on the colorID from Apple table and AppleColours
Create a separate table AppleColours with ID and color. IN this table import both AppleColours tables from both datasources using a CROSS JOIN
Add a column in Apple table named AppleColorId which has the id's from AppleColours
If you want an Applet table that has all the data and don't need any join's to determine the apple color use method 1.
If you want a 'slim' apple table which has minimal data inside use method 2.
In this case to get the apple color you have to make an extra JOIN(INNER JOIN) to the AppleColour table
Maybe including a subquery making an UNION (you will get only one of each), but your problem still will be that, as each table has its own relationship with colours and you are joining both, same item can give you two different colours. My proposal: to make a switch to choose only one, If first is null, choose second, and if second is also null, a default value (some colour code). Other options are to use the lower id, because it was early created, or higher because it was the last...
Something like that
SELECT datasource1.*, datasource2.*, Q.Name, Q.Value FROM datasource1.apple
JOIN datasource2.apple on datasource1.apple.id = datasource2.apple.id
JOIN
(SELECT ColourID, Name, Value FROM datasource1.AppleColours UNION SELECT ColourID, Name, Value FROM datasource2.AppleColours) Q
ON Q.ColourID = COALESCE(datasource1.ColourID, datasource2.ColourID, {DefaultColor})
Are the two data sources supposed to represent slices of the same real population?
I.e., if full joining datasource1 with datasource2 on apple.id is logically consistent, then full joining AppleColours between the 2 datasources should be logically correct as well.
The one-to-many then logically preserves the information from the two datasets, and remains correctly one-to-many. If there are any relationships cardinality violations as the results of this join - those weren't the right cardinalities to begin with.
(btw, should be a full join)
I have a table called "parts" which stores information on electrical connectors including contacts, backshells etc. all parts that are part of an assembly have a value in a column called "assemblyID". There is also a column called "partDefID", in this column connectors will have a value of 2, contacts will be 3. I need to get rows for all connectors that have a unique assemblyID. It's easy to get rows that represent connectors just by selecting rows with a partDefID of 2 but this will return multiple rows of connectors that may be part of the same assembly. I need only those rows of connectors with a unique assemblyID. How can I do this?
see image below:
what I am trying to get is just ONE of the rows shown below, any one of them would be fine as they are all part of the same assembly.
just one of these rows needed
[update]
it seems my question was not well formed and the use of images is frowned upon. Inserting a text version of my table looked REALLY horrible though! I'll try to do better. yes, I'm a newb at both sql AND this website
If you want just one "connector" row per assembly ID, you can filter with a subquery. Assuming that PartRefID is a unique key:
select *
from parts as p
where [PartRefID] = (
select max(p1.[PartRefID])
from parts as p1
where p1.[AssemblyID] = p.[AssemblyID] and p1.[PartDefID] = 2
)
I don't know if this is what you are looking for
SELECT assemblyid,count(partdefid)
FROM parts
WHERE partdefid=2
GROUP BY partdefid,assemblyid
HAVING COUNT(partdefid)=1
I have two unrelated tables (Table A and Table B) that I would like to join to create a unique list of pairings of the two. So, each row in Table A will pair with each row in Table B creating a list of unique pairings between the two tables.
My ideas of what can be done:
I can either do this in the query (SQL) by creating one dataset and having two fields outputted (each row equaling a unique pairing).
Or by creating two different datasets (one for each table) and have a data region embedded within a different data region; each data region pulling from a different dataset (of the two created for each table).
I have tried implementing the second method but it would not allow me to select a different dataset for the embedded data region from the parent data region.
The first method I have not tried but do not understand how or even if it is possible through the SQL language.
Any help or guidance in this matter would be greatly appreciated!
The first is called a cross join:
select t1.*, t2.*
from t1 cross join
t2;
Whether you should do this in the application or in the database is open to question. It depends on the size of the tables and the bandwidth to the database -- there is an overhead to pulling rows from a database.
If each table has 2 rows, this is a non-issue. If each table has 100 rows, then you would be pulling 10,000 rows from the database and it might be faster to pull 2*100 rows and do the looping in the application.
im planning to build a new ads system and we are considering to use google bigquery.
ill quickly describe my data flow :
Each User will be able to create multiple ADS. (1 user, N ads)
i would like to store the ADS impressions and i thought of 2 options.
1- create a table for impressions , for example table name is :Impressions fields : (userid,adsid,datetime,meta data fields...)
in this options of all my impressions will be stored in a single table.
main pros : ill be able to big data queries quite easily.
main cons: table will be hugh, and with multiple queries, ill end up paying too much (:
option 2 is to create table per ads
for example, ads id 1 will create
Impression_1 with fields (datetime,meta data fields)
pros: query are cheaper, data table is smaller
cons: todo big dataquery sometimes ill have to create a union and things will complex
i wonder what are your thoughts regarding this ?
In BigQuery it's easy to do this, because you can create tables per each day, and you have the possibility to query only those tables.
And you have Table wildcard functions, which are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. Table wildcard functions are specified in the query's FROM clause.
Assuming you have some tables like:
mydata.people20140325
mydata.people20140326
mydata.people20140327
You can query like:
SELECT
name
FROM
(TABLE_DATE_RANGE(mydata.people,
TIMESTAMP('2014-03-25'),
TIMESTAMP('2014-03-27')))
WHERE
age >= 35
Also there are Table Decorators:
Table decorators support relative and absolute <time> values. Relative values are indicated by a negative number, and absolute values are indicated by a positive number.
To get a snapshot of the table at one hour ago:
SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle#-3600000]
There is also TABLE_QUERY, which you can use for more complex queries.
I have two access db sources both with the same columns representing data from different time periods. The files have two identity columns UPC and StoreNbr. The resulting table within the DB being inserted to has the two identity columns and the data columns from each file "concatenated" into one table as seen below.:
File 1 Columns:
UPC StoreNbr data1 data2 data3
File 2 Columns:
UPC StoreNbr data1 data2 data3
DB Table Columns:
UPC StoreNbr data1(File 1) data2(File 1) data3(File 1) data1(File 2) data2(File 2) data3(File 2)
I am new to SSIS and have been confronted with the task of merging these two sources into one table to insert into the final DB table. Can I join the two tables on the identity colummns and then insert the data into one result set? FYI, this was originally imported in one filed reflecting the layout of the DB table but the client had the bright idea to split it into two files. Any direction is very appreciated thanks.
It should look something like this.
The sources must be sorted by the join keys. In your case UPC AND StoreNbr
In the merge join editor you can select which columns from the different files that will continue on the flow. You can also give them an alias in order to differentiate two similarly named columns.
After that you can just dump it all back into your DB. Cheers!
Depending on whether an item can exist in one Access source and not in another source, an alternative to TsSkTo's implementation would be to route it as
[Access Source 1]
|
[Lookup Transformation to Access Source 2]
|
[OLE DB Destination]
Lookup Transformation