SSIS 2005 Can Merge Join accommodate one-to-many joins - sql-server-2005

I have a Data Flow Task that does some script component tasks, sorts, then does a Merge Join. I'd like to have the Merge Join do the join as a 1-many. If I do an Inner Join, I get too few records:
If I do a Left Outer Join, I get WAY too many records:
I'm looking for the Goldilocks version of 'Just Right' (which would be 39240 records).

You can add a Conditional Split after your left join version of the Merge Join, with a non-matching condition like
isnull(tmpAddressColumn)
and send the relevant matching flow condition (the default output) to your destination.
If you still don't get the correct number, you'll need to check the merge join conditions and check if there are duplicate IDs in each source.

The number of rows shouldn't be what you're using to gauge if you're using the correct options for the Merge Join. The resulting data set should be the driving factor. Do the results look correct in the tmpManAddress table?
For development you might want to push the output of the script components to tables so you can see what data you're starting with. This will allow you to work out which type of join, and on which columns, give you the results you want.

Related

Left join vs EXCEPT

I am trying to create a stored procedure that calculates difference between a large table's last week's version vs this week's version (Current data).
Both LEFT JOIN and EXCEPT will eventually give same results. However I would like to know if there is a preferred approach to do so in terms of performance.
LEFT JOIN and EXCEPT do not produce the same results.
EXCEPT is set operator that eliminates duplicates. LEFT JOIN is a type of join, that can actually produce duplicates. It is not unusual in SQL that two different things produce the same result set for a given set of input data.
I would suggest that you use the one that best fits your use-case. If both work, test which one is faster and use that.

Is there anyway to perform outer join in SSIS without using Merge Join transformation?

I need to do many left joins to create my fact table which has more than 150 M Records. When i do outer join using merge join and sort transformation, it takes many hours to load data. So need a help to do this without merge join transformation.
The fastest way to do this is to load the data directly into staging tables on your destination database server, and then run a stored procedure that does the joins to load from the staging tables to the fact table. If the staging tables are indexed on the join keys, that will be the fastest solution.
In the SSIS dataflow, you can use the Lookup transformation instead of the Merge Join to do the same outer join, but it's even slower than the merge join, so if performance is what you're after, it's not a good solution.

Does count() produces the underlying table it needs to count?

My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.
In case it's useful, the query looks like this :
SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.
My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).
Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :
SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).
Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)
(NB : I am currently testing it, but the count has been running for 35mn now)
Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.
Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.
Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.
What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.
I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.
You can do an explain to see what's going on.

Joining multiple tables to achieve result set in a specific style

I have 3 tables in the same database, with couple of columns as common and rest non-matching columns as well. I need to show them together in such a fashion that the user should be able to distinguish between the source tables (Refer below diagrams). I want to know if I can achieve this in database itself, before passing its result on to my report UI or code behind?
I have tried achieving this using OUTER JOIN, FULL OUTER JOIN.
See here
Also here

JOIN SQL query over subsequent tables

I have a doubt about how to properly use JOIN SQL queries.
Imagine that I have 3 tables. I want to make a RIGHT JOIN between two of them. This is, I want to show all the records from the left table and just those records from the right table where the join is equal. Once I have this, I want to make another JOIN (inner or whatever) between the table that was on the right (now is the LEFT table) and the third table (that is the RIGHT table). So that, I would have 3 tables connected. My problem is that I get a message error from access that is:
The SQL statement could not be executed because it contains ambiguous
outer joins. To force one of the joins to be performed first, create a
separate query that performs the first join and then include that
query in your SQL statement.
So, Access is forcing me to use two separates queries but I don't want to use two. I think that this must be possible in just one. Am I right? Do you know if there is a method for this?
Thank you all.
Can you try this ?
Put the inner join first
Source : Source