What are the uses for Cross Join? - sql

A cross join performs a cartesian product on the tuples of the two sets.
SELECT *
FROM Table1
CROSS JOIN Table2
Which circumstances render such an SQL operation particularly useful?

If you have a "grid" that you want to populate completely, like size and color information for a particular article of clothing:
select
size,
color
from
sizes CROSS JOIN colors
Maybe you want a table that contains a row for every minute in the day, and you want to use it to verify that a procedure has executed each minute, so you might cross three tables:
select
hour,
minute
from
hours CROSS JOIN minutes
Or you have a set of standard report specs that you want to apply to every month in the year:
select
specId,
month
from
reports CROSS JOIN months
The problem with maintaining these as views is that in most cases, you don't want a complete product, particularly with respect to clothes. You can add MINUS logic to the query to remove certain combinations that you don't carry, but you might find it easier to populate a table some other way and not use a Cartesian product.
Also, you might end up trying the cross join on tables that have perhaps a few more rows than you thought, or perhaps your WHERE clause was partially or completely missing. In that case, your DBA will notify you promptly of the omission. Usually he or she will not be happy.

Generate data for testing.

You're typically not going to want a full Cartesian product for most database queries. The whole power of relational databases is that you can apply whatever restrictions you might be interested in to allow you to avoid pulling unnecessary rows from the db.
I suppose one contrived example where you might want that is if you have a table of employees and a table of jobs that need doing and want to see all possible assignments of one employee to one job.

The key is "show me all possible combinations". I've used these in conjunction with other calculated fields an then sorted/filtered those.
For example, say you are building an arbitrage (trading) application. You have sellers offering products at a price and buyers asking for products at a cost. You do a cross join on the product key (to match up the potential buyers and sellers), calculate the spread between cost and price, then sort desc. on this to give you (the middleman) the most profitable trades to execute. Almost always you'll have other bounding filter criteria of course.

Ok, this probably won't answer the question, but, if it's true (and I'm not even sure of that) it's a fun bit of history.
In the early days of Oracle, one of the developers realized that he needed to duplicate every row in a table (for example, it's possible it was a table of events and he needed to change it separate "start event" and "end event" entries). He realized that if he had a table with just two rows, he could do a cross join, selecting just the columns in the first table, and get exactly had he needed. So he created a simple table, which he naturally enough called "DUAL".
Later, he need to do something which could only be done via a select from a table, even though the action itself had nothing to do with the table, (perhaps he forgot his watch and wanted to read the time via SELECT SYSDATE FROM...) He realized that he still had his DUAL table lying around, and used that. After a while, he tired of seeing the time printed twice, so he eventual deleted one of the rows.
Others at Oracle started using his table, and eventually, it was decided to include it in the standard Oracle installation.
Which explains why a table whose only significance is that it has one row has a name which means "two".

Takes something like a digits table, which has ten rows for the digits 0-9. You can use cross join on that table a few times to a get a result with however many rows you need, and each row will be numbered appropriately. This has a number of uses. For example, you can combine it with a dateadd() function to get a set for every day in a given year.
Note: this post is old now. Today I'd use generate_series() or a recursive CTE to do this job instead.

This is an interesting way to use a cross join to create a crosstab report. I found it in Joe Celko's SQL For Smarties, and have used it several times. It does take a little setup, but has been worth the time invested.

you can use it CROSS JOIN to:
generate data for testing purposes
combine all properties - you need all possible combination of e.g blood groups (A,B,..) with Rh-/+, etc...
--tune it for your purposes;) - I'm not expert in this area;)
CREATE TABLE BL_GRP_01 (GR_1 text);
CREATE TABLE RH_VAL_01 (RH_VAL text);
INSERT INTO BL_GRP_01 VALUES ('A'), ('B'), ('AB'), ('O');
INSERT INTO RH_VAL_01 VALUES ('+'), ('-');
SELECT CONCAT(x.GR_1, y.RH_val)
FROM BL_GRP_01 x
CROSS JOIN RH_VAL_01 y
ORDER BY CONCAT(x.GR_1, y.RH_VAL);
create a join for 2 tables without a common id and then group it using max(),etc.. to find highest possible combination

Imagine you had a series of queries you want to issue over a specific combination of items and dates (prices, availability, etc..). You could load the items and dates into separate temp tables and have your queries cross join the tables. This may be more convenient than the alternative of enumerating the items and dates in IN clauses, especially since some databases limit the number of elements in an IN clause.

Related

Adding a SUM statement increases run time way too much, is there a better method?

I have a table with invoice payments, which can be partial or full. I am comparing this calculated field to the total amount of the invoice. I have it twice in the query, once in the Select statement and again in the Where clause. Even if I remove one so it's only in either the Where or the Select, it takes more than an hour to run. If I remove the SUM entirely, it takes 10 seconds to run.
Is there a better method to get the sum? Should I use an index view? A temp table? Note that an invoice number is unique only to a vendor, not unique in general. The initial FROM is a view, if this makes a difference.
select distinct
transdate,
invoicedate,
PAY.OrderAccount,
v.VendorName,
invoiceamountmst,
(select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount) as "InvoiceSUM",
settleamountcur,
Currencycodeinvoice,
PAY.Description,
Voucher
from VIEW_INVOICE_PAYMENT PAY
inner join INVOICE on INVOICE_DOC_NO =invoiceid
JOIN VENDOR V on PAY.OrderAccount=v.VendorAccount
where TRANSDATE is not null
and (select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount)=total_cost_on_invoice
In this answer, when I refer to 'that select', I'm referring to the sub-query in the middle select sum(pay1.settlamountcur) ...
Note that aliases in 'that select' looks a little strange e.g., select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] AX1. Where does the PAY1 alias come from? I may have missed something. If that's a typo in your code, it could be doing bad things (if it even runs). Assuming it's not, however...
For your broader problem, I believe that it will be running that select statement once for every row being returned by your overall table. Indeed, it may be doing it more often, depending on when it's doing your filtering in the execution plan.
Note I'm assuming SQL Server in this answer - but it should apply to other databases as well.
A couple of options
Instead of referring to the view, instead bring the tables into your current query and modify the query as such
Try removing aggregation from the subquery, and instead do it over the whole data set etc e.g., GROUP BY relevant fields, sum across relevant fields. This can be combined with option 1.
Put the sub-query as a CTE, or a sub-query within the FROM component. This may make it use it as a single table rather than running many times (or it may not)
(Sometimes my preferred option for large tables) Get the relevant data from the view into a temporary table first e.g.,
SELECT INVOICEId, OrderAccount, SUM(settleamountcur) AS total_settleamountcur
INTO #Temp
FROM [VIEW_INVOICE_PAYMENT]
GROUP BY INVOICEId, OrderAccount
-- Add any where/having clauses you can to filter
-- Consider creating temp table first with primary key, making joins easier for SQL Server
Then use the #Temp table instead of that select sub-query.

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

SQL Query with multiple possible joins (or condition in join)

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!

union with join to common table

I've got a query that presently looks something like this (I'll simplify it to get to the point):
select shipment.f1, type1detail.f2
from shipment
join type1detail using (shipmentid)
where shipment.otherinfo='foo'
union
select shipment.f1, type2detail.f2
from shipment
join type2detail using (shipmentid)
where shipment.otherinfo='foo'
That is, we have two types of detail that can be on a shipment, and I need to find all the records for both types for all the shipments meeting a given condition. (Imagine that the two detail tables have some fields with matching data. In real life there are several more joins in there; as I say I'm trying to simplify the query to get to the question I'm trying to ask today.)
But the "where" clause requires a full-file scan, so it's pretty inefficient to do this full file scan twice. Logically, the database engine should be able to find the shipment records that meet the condition once, and then find all the records from type1detail and type2detail with shipmentid matching those values.
But how do I say this in SQL?
select shipment.f1, detail.f2
from shipment
join
(select shipmentid, f2
from type1detail
union
selection shipmdentid, f2
from type2detail
) d using (shipmentid)
where shipment.otherinfo='foo'
would required a full file scan of type1detail, union'ed with a full file scan of type2detail, and then join this to shipment, which would be worse than reading shipment twice. (As each of the detail tables are bigger than shipment.)
This seems like a straightforward thing to want to do, but I just can't think of way that I could express it.
I'm using Postgres if you know a solution particular to that engine, but I was hoping for a generic SQL solution.
I would use a temp table to dump shipmentid's into that match your "otherinfo" where clause. Then use that temp table in your joins (not replacing any existing join, but adding another one) so that the join criteria is only using shipmentids and no where clause is needed (unless you are also filtering on columns not in the shipment table, but the point is you still scanned that table only once).
If you run this query more than occasionally, you're probably better off indexing the otherinfo column to avoid the full scans altogether. I don't work much with Postgres, but it looks like there is at least one full text indexing option already.

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?
I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')
It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.
Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'
The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.
Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.
The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it