I don't even know how to properly name this question. I know how to join the same table, but this situation is slightly different. I'll try to make everything as simple as possible. Here are 2 tables:
ingredients (contains ingredient IDs and names)
ingredient_id|ingredient_name
1 |Water
2 |Salt
3 |Fancy Sauce
4 |Spices
5 |Pepper
6 |Chili
ingredients_to_ingredients (contains optional sub-ingredients for each ingredient)
ingredient_id|mapped_ingredient_id
3 |1
3 |2
3 |4
4 |5
4 |6
I need to get all the sub-ingredients (if any) for specified ingredient. So if I want to get sub-ingredients of #3 (Fancy Sauce), I use:
SELECT * FROM ingredients AS main
LEFT JOIN ingredients_to_ingredients AS sub
ON main.ingredient_id=sub.ingredient_id
LEFT JOIN ingredients
ON sub.mapped_ingredient_id=ingredients.ingredient_id
WHERE main.ingredient_id=3
And get the list:
Water
Salt
Spices
Easy. But as you see, Spices contains other sub-ingredients (Pepper and Chili), and I need to list them as well.
Some will say I can add 2nd sub-query, but there's a catch: there's no fixed number. In other words, one ingredient might not have sub-ingredients at all, while another might have lots of sub-sub-sub-ingredients (which would require dozens of sub-queries).
How do I write a query which keeps selecting mapped sub-ingredients as long as any ingredient has sub-ingredients? Now, I just use lots of repeated sub-queries (like dozens of them, just to be sure every possible sub-sub-sub ingredient is included), but query looks ugly and I believe it's not the best way to do it.
Any suggestions how to modify the query above?
P.S. I'm sure someone will say using sub-sub-sub ingredients mapped to themselves is a bad design. Well, I can't say to restaurant's chief - "don't use sub-ingredients in your ingredients because it's bad design from database point of view". Hope you got the idea why the design is as it is and can't be changed.
WITH RECURSIVE
cte AS ( SELECT *
FROM ingredients_to_ingredients
WHERE ingredient_id = 3
UNION ALL
SELECT ingredients_to_ingredients.*
FROM ingredients_to_ingredients
JOIN cte ON cte.mapped_ingredient_id = ingredients_to_ingredients.ingredient_id )
SELECT ingredients.ingredient_name
FROM cte
JOIN ingredients ON cte.mapped_ingredient_id = ingredients.ingredient_id;
https://dbfiddle.uk/?rdbms=mariadb_10.4&fiddle=ccfbd2838775fc0f69775ef86ab29093
Related
Recently stumbled on this situation. Doing both queries might be "light" in my situation, I just want to know when it comes to big dataset on what is better. Better in overall (performance, speed, etc etc).
Currently I do single queries of 2 1:N (has-many) relationship and reduce/transform the data in the application.
It looks like this transformed/reduced:
[
'field' => 'value',
'hasMany-1' => [],
'hasMany-2' => []
]
I'm actually somehow tempted to just do separate queries as it eliminates the pain of reducing it if I had more than 2 hasMany queries and is more quite readable but code currently works so I'll maybe just do it next time.
Is the compromise worth it? Again, in my situation it might be very "light" as I only have few rows (< 100) and structure is not complex as it is on early stage yet.
But asked in case I stumble upon this next time and when dataset grows larger.
** EDIT **
So the has-many relationship I'm talking about are: A customer has-many phones and pets.
My current query returns me this result (simplified):
customer_id | pet_name | phone
1 | john | 1234
1 | john | 5678
2 | jane | 1357
2 | jane | 2468
2 | joe | 1357
2 | joe | 2468
I think my query is fine. It seems logical for some rows to repeat because the other field has different value.
In general, you should issue a single query and let the optimizer do the work for you. At the very least, this saves multiple round-trips to the database and query compilation.
There are cases where multiple queries can have better performance, but I think it is better to start with a single query.
You have a particular issue regarding joins along multiple many-to-many dimensions. There is no need to do the joins "generally" and then "reduce" the results. There are more efficient methods.
I would suggest that you ask another question. Provide sample data, desired results, and an explanation of the logic you are attempting. You may be able to learn a more efficient way to write a single query.
You did not describe your table structures so i assume few things.
If you want to have pets and phones as one row do:
select c.customer_id, c.name,
array_to_string((array_agg(p.pet_name)),',') pet_names
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name
If you want to have row per pet_name with all possible phone numbers:
select c.customer_id, c.name,
p.pet_name
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name, p.pet_name
If we talk about performance it will be faster to do 2 queries to pets and phones separetly by customer_id. But until you have milions of rows it is not so important.
Of course you should have indexes on customer_id.
Being new with SQL and SSRS and can do many things already, but I think I must be missing some basics and therefore bang my head on the wall all the time.
A report that is almost working, needs to have more results in it, based on conditions.
My working query so far is like this:
SELECT projects.project_number, project_phases.project_phase_id, project_phases.project_phase_number, project_phases.project_phase_header, project_phase_expensegroups.projectphase_expense_total, invoicerows.invoicerow_total
FROM projects INNER JOIN
project_phases ON projects.project_id = project_phases.project_id
LEFT OUTER JOIN
project_phase_expensegroups ON project_phases.project_phase_id = project_phase_expensegroups.project_phase_id
LEFT OUTER JOIN
invoicerows ON project_phases.project_phase_id = invoicerows.project_phase_id
WHERE ( projects.project_number = #iProjectNumber )
AND
( project_phase_expensegroups.projectphase_expense_total >0 )
The parameter is for selectionlist that is used to choose a project to the report.
How to have also records that have
( project_phase_expensegroups.projectphase_expense_total ) with value 0 but there might be invoices for that project phase?
Tried already to add another condition like this:
WHERE ( projects.project_number = #iProjectNumber )
AND
( project_phase_expensegroups.projectphase_expense_total > 0 )
OR
( invoicerows.invoicerow_total > 0 )
but while it gives some results - also the one with projectphase_expense_total with value 0, but the report is total mess.
So my question is: what am I doing wrong here?
There is a core problem with your query in that you are left joining to two tables, implying that rows may not exist, but then putting conditions on those tables, which will eliminate NULLs. That means your query is internally inconsistent as is.
The next problem is that you're joining two tables to project_phases that both may have multiple rows. Since these data are not related to each other (as proven by the fact that you have no join condition between project_phase_expensegroups and invoicerows, your query is not going to work correctly. For example, given a list of people, a list of those people's favorite foods, and a list of their favorite colors like so:
People
Person
------
Joe
Mary
FavoriteFoods
Person Food
------ ---------
Joe Broccoli
Joe Bananas
Mary Chocolate
Mary Cake
FavoriteColors
Person Color
------ ----------
Joe Red
Joe Blue
Mary Periwinkle
Mary Fuchsia
When you join these with links between Person <-> Food and Person <-> Color, you'll get a result like this:
Person Food Color
------ --------- ----------
Joe Broccoli Red
Joe Bananas Red
Joe Broccoli Blue
Joe Bananas Blue
Mary Chocolate Periwinkle
Mary Chocolate Fuchsia
Mary Cake Periwinkle
Mary Cake Fuchsia
This is essentially a cross-join, also known as a Cartesian product, between the Foods and the Colors, because they have a many-to-one relationship with each person, but no relationship with each other.
There are a few ways to deal with this in the report.
Create ExpenseGroup and InvoiceRow subreports, that are called from the main report by a combination of project_id and project_phase_id parameters.
Summarize one or the other set of data into a single value. For example, you could sum the invoice rows. Or, you could concatenate the expense groups into a single string separated by commas.
Some notes:
Please, please format your query before posting it in a question. It is almost impossible to read when not formatted. It seems pretty clear that you're using a GUI to create the query, but do us the favor of not having to format it ourselves just to help you
While formatting, please use aliases, Don't use full table names. It just makes the query that much harder to understand.
You need an extra parentheses in your where clause in order to get the logic right.
WHERE ( projects.project_number = #iProjectNumber )
AND (
(project_phase_expensegroups.projectphase_expense_total > 0)
OR
(invoicerows.invoicerow_total > 0)
)
Also, you're using a column in your WHERE clause from a table that is left joined without checking for NULLs. That basically makes it a (slow) inner join. If you want to include rows that don't match from that table you also need to check for NULL. Any other comparison besides IS NULL will always be false for NULL values. See this page for more information about SQL's three value predicate logic: http://www.firstsql.com/idefend3.htm
To keep your LEFT JOINs working as you intended you would need to do this:
WHERE ( projects.project_number = #iProjectNumber )
AND (
project_phase_expensegroups.projectphase_expense_total > 0
OR project_phase_expensegroups.project_phase_id IS NULL
OR invoicerows.invoicerow_total > 0
OR invoicerows.project_phase_id IS NULL
)
I found the solution and it was kind easy after all. I changed the only the second LEFT OUTER JOIN to INNER JOIN and left away condition where the query got only results over zero. Also I used SELECT DISTINCT
Now my report is working perfectly.
I'm working on exercise 17 in the Teach Yourself SQL program GalaXQL (based on SQLite). I've got three tables:
Stars that contains starid;
Planets that contains planetid and starid;
Moons that contains moonid and planetid.
I want to return the starid associated with the greatest number of planets and moons combined.
I've got a query that will return the starid, planetid and total planets + moons.
How do I change this query so it only returns the single starid corresponding to the max(total) and not a table? This is my query so far:
select
stars.starid as sid,
planets.planetid as pid,
(count(moons.moonid)+count(planets.planetid)) as total
from stars, planets, moons
where planets.planetid=moons.planetid and stars.starid=planets.starid
group by stars.starid
Let's visualize a system that might be represented by this database structure, and see if we can't translate your question into working SQL.
I drew you a galaxy:
To distinguish stars and planets from moons, I've used capital Roman numerals for starid values and lower-case Roman numerals for moonid values. And since everyone knows that astronomers have nothing to do on those long nights in the observatory but drink, I put an unexplained gap in the middle of your planetid values. Gaps like these will occur when using so-called "surrogate" IDs, because their values hold no meaning; they are simply unique identifiers for rows.
If you'd like to follow along, here's the galaxy naively loaded into SQL Fiddle (if you get a popup about switching to WebSQL, you may need to hit "cancel" and stick with SQL.js for this example to work).
Let's see, what was it you wanted again?
I want to return the starid associated with the greatest number of planets and moons combined
Awesome. Rephrased, the question is: Which star is associated with the greatest number of orbiting bodies?
Star (I) has 1 planet with 3 moons;
Star (II) has 1 planet with 1 moon and 1 planet with 2 moons;
Star (III) has 1 planet with 1 moon and 2 planets with no moons.
All we're doing here is counting the different entities associated with each star. With a total of 5 orbiting bodies, star (II) is the winner! So the final result we expect from a working query is:
| starid |
|--------|
| 2 |
I intentionally drew this awesome galaxy such that the "winning" star doesn't have the most planets and isn't associated with the planet that has the most moons. If those astronomers weren't all three sheets to the wind, I might have gotten an extra moon out of planet (1) as well, so that our winning star isn't tied for most moons total. It'll be convenient for us in this demonstration if star (II) only answers the question we're asking and not any other questions with potentially similar queries, to reduce our chances of arriving at the right answer via the wrong query.
Lost in translation
The first thing I want to do is introduce you to the explicit JOIN syntax. This is going to be your very close friend. You will always JOIN your tables, no matter what some silly tutorial says. Trust in my far sillier advice instead (and optionally, read Explicit vs implicit SQL joins).
The explicit JOIN syntax shows how we're requiring our tables to relate to each other and reserves the WHERE clause for the sole purpose of filtering rows from the result set. There are a few different types, but what we're going to start with is a plain old INNER JOIN. This is essentially what your original query performed and it implies that all you want to see in your result set is the data that overlaps in all three tables. Check out a skeleton of your original query:
SELECT ... FROM stars, planets, moons
WHERE planets.planetid = moons.planetid
AND planets.starid = stars.starid;
Given those conditions, what happens to an orphaned planet somewhere off in space that isn't associated with a star (i.e., its starid is NULL)? Since an orphaned planet has no overlap with the stars table, an INNER JOIN wouldn't include it in the result set.
In SQL any equality or inequality comparison with NULL gives a result of NULL—even NULL = NULL isn't true! Now your query has a problem, because the other condition is that planets.planetid = moons.planetid. If there's a planet for which no corresponding moon exists, that turns into planets.planetid = NULL and the planet will not appear in your query result. That's no good! Lonely planets must be counted!
The OUTER limits
Fortunately there's a JOIN for you: An OUTER JOIN, which will ensure that at least one of the tables always shows up in our result set. They come in LEFT and RIGHT flavors to indicate which table gets special treatment, relative to the position of the JOIN keyword. What joins does SQLite support? confirms that the INNER and OUTER keywords are optional, so we can use LEFT JOIN, noting that:
stars and planets are linked by a common starid;
planets and moons are linked by a common planetid;
stars and moons are linked indirectly by the above two links;
we always want to count all the planets and all the moons.
SELECT
*
FROM
stars
LEFT JOIN
planets ON stars.starid = planets.starid
LEFT JOIN
moons ON planets.planetid = moons.planetid;
Notice that instead of having a big bag o' tables and a WHERE clause, you now have one ON clause for each JOIN. As you find yourself working with more tables, this is going to be far easier to read; and because this is standard syntax, it's relatively portable between SQL databases.
Lost in space
Our new query basically grabs everything in our database. But does this correspond to everything in our galaxy? Actually, there's some redundancy here, because two of our ID fields (starid and planetid) exist in more than one table. This is only one of many reasons to avoid the SELECT * catch-all syntax in actual use cases. We only really need the three ID fields, and I'm going to throw in two more tricks while we're at it:
Aliases! You can give your tables more convenient names by using the table_name AS alias syntax. This can be really convenient when you have to refer to many different columns in a multi-table query and you don't want to type out the full table names each time.
Grab starid from the planets table and leave stars out of the JOIN entirely! Having stars LEFT JOIN planets ON stars.starid = planets.starid means that the starid field is going to be the same, no matter which table we get it from—as long as the star has any planets. If we were counting stars, we'd need this table, but we're counting planets and moons; moons by definition orbit planets, so a star with no planets also has no moons and can be ignored. (This is an assumption; check your data to make sure it's justified! Maybe your astronomers are more drunk than usual!)
SELECT
p.starid, -- This could be S.starid, if we kept using `stars`
p.planetid,
m.moonid
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid;
Result:
| starid | planetid | moonid |
|--------|----------|--------|
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 1 | 3 |
| 2 | 2 | 6 |
| 2 | 3 | 4 |
| 2 | 3 | 5 |
| 3 | 7 | |
| 3 | 8 | 7 |
| 3 | 9 | |
Mathematical!
Now our task is to decide which star is the winner, and for that we have to do some simple calculation. Let's count moons first; since they have no "children" and only one "parent" each, they're easy to aggregate:
SELECT
p.starid,
p.planetid,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid, p.planetid;
Result:
| starid | planetid | moon_count |
|--------|----------|------------|
| 1 | 1 | 3 |
| 2 | 2 | 1 |
| 2 | 3 | 2 |
| 3 | 7 | 0 |
| 3 | 8 | 1 |
| 3 | 9 | 0 |
(Note: Usually we like to use COUNT(*) because it's simple to type and to read, but it would get us into trouble here! Since two of our rows have a NULL value for the moonid, we have to use COUNT(moonid) to avoid counting moons that don't exist.)
So far, so good—I see six planets, we know which star each belongs to, and the right number of moons are shown for each planet. Next step, counting the planets. You might think this requires a subquery in order to also add up the moon_count column for each planet but it's actually simpler than that; if we GROUP BY the star, our moon_count will switch from counting "moons per planet, per star" to "moons per star" which is just fine:
SELECT
p.starid,
COUNT(p.planetid) AS planet_count,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | planet_count | moon_count |
|--------|--------------|------------|
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 3 | 1 |
Now we've run into trouble. The moon_count is correct, but you should see right away that the planet_count is wrong. Why is this? Look back at the ungrouped query result and notice that there are nine rows, with three rows for each starid, and each row has a non-null value for planetid. That's what we asked the database to count with this query, when what we really meant to ask was how many different planets are there? Planet (1) appears three times with star (I) but it's the same planet each time. The fix is to stick the DISTINCT keyword inside the COUNT() function call. At the same time, we can add the two columns together:
SELECT
p.starid,
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | total_bodies |
|--------|--------------|
| 1 | 4 |
| 2 | 5 |
| 3 | 4 |
And the winner is...
Counting the orbiting bodies around each star in the drawing, we can see that the total_bodies column is correct. But you didn't ask for all this information; you just want to know who won. Well, there are a bunch of ways to get there, and depending on the size and makeup of your galaxy (database), some may be more efficient than others. One approach is to ORDER BY the total_bodies expression so that the "winner" appears at the top, LIMIT 1 so that we don't see the losers, and select only the starid column (see it on SQL Fiddle).
The problem with that approach is that it hides ties. What if we gave the losing stars in our galaxy each an extra planet or moon? Now we've got a three way tie—everyone's a winner! But who shows up first when we ORDER BY a value that's always the same? In the SQL standard, this is undefined; there's no telling who will come out on top. You might run the same query twice on the same data and get two different results!
For this reason, you might prefer to ask which stars have the greatest number of orbital bodies, instead of specifying in your question that you know there is only one value. This is a more typically set-based approach and it's not a bad idea to get used to set-based thinking when working with relational databases. Until you execute the query, you don't know the size of the result set; if you're going to assume there's not a tie for first place, you have to justify that assumption somehow. (Since astronomers regularly find new moons and planets, I'd have a hard time justifying this one!)
The way I'd prefer to write this query is with something called a Common Table Expression (CTE). These are supported in recent versions of SQLite and in many other databases but last I checked GalaXQL was using an older version of the SQLite engine that doesn't include this feature. CTEs let you refer to a subquery multiple times using an alias, rather than having to write it out in full each time. A solution using CTEs could look like this:
WITH body_counts AS
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
SELECT
starid
FROM
body_counts
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM body_counts);
Result:
| STARID |
|--------|
| 2 |
Check out this query in action on SQLFiddle. To confirm that this query can show more than one row in the case of a tie, try changing MAX() on the last line to MIN().
Just for you
Doing this without CTEs is ugly but it can be done if the table size is manageable. Looking at the query above, our CTE is aliased as body_counts and we refer to it twice—in the FROM clause and in the WHERE clause. We can replace both of those references with the statement that we used to define body_counts (removing the id column once in the second subquery, where it's not used):
SELECT
starid
FROM
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM
(SELECT
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
);
This is the tie-friendly approach that should work for you in GalaXQL. See it working here in SQLFiddle.
Now that you've seen both, isn't the CTE version easier to understand? MySQL, which didn't support CTEs until the 2018 release of version 8.0, would additionally demand aliases for our subqueries. Fortunately, SQLite does not, because in this case it's just extra verbiage to add to an already over-complicated query.
Well, that was fun—are you sorry you asked? ;)
(P.S., if you were wondering what's up with planet number nine: giant space potato chips tend to have very eccentric orbits.)
Maybe something like this is what you want?
select
stars.starid as sid,
(count(distinct moons.moonid)+count(distinct planets.planetid)) as total
from stars
left join planets on stars.starid=planets.starid
left join moons on planets.planetid=moons.planetid
group by stars.starid
order by 2 desc
limit 1
Sample SQL Fiddle
I have a couple of tables, which I have simplified to the extreme for the purpose of this example.
Table 1, 'Units'
ID UnitName PartNumber
1 UnitX 1
2 UnitX 1
3 UnitX 2
4 UnitX 3
5 UnitX 3
6 UnitY 1
7 UnitY 2
8 UnitY 3
Table 2, 'Parts'
ID PartName Quantity
1 Screw 2
2 Wire 1
3 Ducttape 1
I would like to query on these tables which of these units would be Possible to build, AND if so, which one could be built first ideally to make efficient use of these parts.
Now the question is: can this be done using SQL, or is a background application required/more efficient?
So in this example, it is easy, because only one unit (unit Y) can be built.. But I guess you get the idea. (I'm not looking for a shake and bake answer, just your thoughts on this.)
Thanks
As you present it, it is efficient to use sql. As you described PartNumber column of table Units is a foreign key on ID column of Parts table, so a simple outer join or selecting units that the PartNumber are "NOT IN" the Parts table would give you the units that can not be build.
However if your db schema consists of many non normalised tables, or is very complex without indexes, other "bad" things etc
it could be examined whether specific application code is faster. But i really doubt it for the particular case, the query seems trivial.
I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.