Partitioning join to limit records in SQL - sql

I have 2 tables:
- first one containing spatial data - geometry of circles
- second contains geometries of lines.
I want to find all lines which are inside each circle. I have a query which can do that, however there are millions of records so it is unusably slow.
There is a column in both tables which is area_id and essentially all circles are assigned to particular area and all lines as well, so if I can do the intersect of the circles only with the lines in the matching area this will reduce the load a lot. The problem is I can't think of solution e.g. using windowing function. The query I am using is:
Select ct.AREA_ID, ct.Circle_descr, lt.Line_descr from circles_table as ct
JOIN lines_table as lt
ON
circles_table.Circle_location.STIntersects(points_table.Point_location)=1
*using a where clause at the end makes no difference as it is essentially part of the slow join...
+---------------+----------------------+--------------------------+
| AREA_ID (int) | Circle_descr(varchar) | Circle_location(geometry)|
+---------------+----------------------+--------------------------+
+---------------+---------------------+-------------------------+
| AREA_ID (int) | Line_descr(varchar) | Line_location(geometry) |
+---------------+---------------------+-------------------------+

Add an additional join criterion to partition the rows by area_id before comparing them. Something like
Select ct.AREA_ID, ct.Circle_descr, lt.Line_descr
from circles_table as ct
JOIN lines_table as lt
ON ct.Circle_location.STIntersects(lt.Point_location)=1
AND ct.area_id = lt.area_id

Related

SQL i have data without matching and I want to show them all with 0 in cell or blank cell

I have two tables as shown here:
Rate | soldsale
------ | ----------
keyrate | keysoldsale
rate | keyrate
tva | keyproduct
| soldsale
I wrote this SQL query:
select
rate.keyrate, rate.rate, soldsale.keysoldsale, soldsale.keyrate,
soldsale.keyproduct,soldsale.soldsale
from
soldsale, rate
where
soldsale.keyrate = rate.keyrate and keyproduct = 20
It works well but it didn't show all record of table RATE.
For example table rate don't have a sold sale on product it does not shown I want it shown and show it blank or with 0 in cell
Use a left join here and move the restriction in the WHERE clause to the ON clause of the join.
SELECT
r.keyrate,
r.rate,
s.keysoldsale,
s.keyrate,
s.keyproduct,
s.soldsale
FROM soldsale s
LEFT JOIN rate r
ON s.keyrate = r.keyrate AND s.keyproduct = 20;
Note that the explicit join I used above is the preferred way of writing queries these days. It also makes it easier to see that we are doing a left join.

How do you 'join' multiple SQL data sets side by side (that don't link to each other)?

How would I go about joining results from multiple SQL queries so that they are side by side (but unrelated)?
The reason I am thinking of this is so that I can run 1 query in Google Big Query and it will return 1 single table which I can import into Excel and do some charts.
e.g. Query 1 looks at dataset TableA and returns:
**Metric:** Sales
**Value:** 3,402
And then Query 2 looks at dataset TableB and returns:
**Name:** John
**DOB:** 13 March
They would both use different tables and different filters, etc.
What would I do to make it look like:
---Sales----------John----
---3,402-------13 March----
Or alternatively:
-----Sales--------3,402-----
-----John-------13 March----
Or is there a totally different way to do this?
I can see the use case for the above, I've used something similar to create a single table from multiple tables with different metrics to query in Data Studio so that filters apply to all data in the dataset for example. However in that case, the data did share some dimensions that made it worthwhile doing.
If you are going to put those together with no relationship between the tables, I'd have 4 columns with TYPE describing the data in that row to make for easier filtering.
Type | Sales | Name | DOB
Use UNION ALL to put the rows together so you have something like
"Sales" | 3402 | null | null
"Customer Details" | null | John | 13 March
However, like the others said, make sure you have a good reason to do that otherwise you're just creating a bigger table to query for no reason.

Using SELECT...GROUP BY...HAVING in SQLite

I'm working on exercise 17 in the Teach Yourself SQL program GalaXQL (based on SQLite). I've got three tables:
Stars that contains starid;
Planets that contains planetid and starid;
Moons that contains moonid and planetid.
I want to return the starid associated with the greatest number of planets and moons combined.
I've got a query that will return the starid, planetid and total planets + moons.
How do I change this query so it only returns the single starid corresponding to the max(total) and not a table? This is my query so far:
select
stars.starid as sid,
planets.planetid as pid,
(count(moons.moonid)+count(planets.planetid)) as total
from stars, planets, moons
where planets.planetid=moons.planetid and stars.starid=planets.starid
group by stars.starid
Let's visualize a system that might be represented by this database structure, and see if we can't translate your question into working SQL.
I drew you a galaxy:
To distinguish stars and planets from moons, I've used capital Roman numerals for starid values and lower-case Roman numerals for moonid values. And since everyone knows that astronomers have nothing to do on those long nights in the observatory but drink, I put an unexplained gap in the middle of your planetid values. Gaps like these will occur when using so-called "surrogate" IDs, because their values hold no meaning; they are simply unique identifiers for rows.
If you'd like to follow along, here's the galaxy naively loaded into SQL Fiddle (if you get a popup about switching to WebSQL, you may need to hit "cancel" and stick with SQL.js for this example to work).
Let's see, what was it you wanted again?
I want to return the starid associated with the greatest number of planets and moons combined
Awesome. Rephrased, the question is: Which star is associated with the greatest number of orbiting bodies?
Star (I) has 1 planet with 3 moons;
Star (II) has 1 planet with 1 moon and 1 planet with 2 moons;
Star (III) has 1 planet with 1 moon and 2 planets with no moons.
All we're doing here is counting the different entities associated with each star. With a total of 5 orbiting bodies, star (II) is the winner! So the final result we expect from a working query is:
| starid |
|--------|
| 2 |
I intentionally drew this awesome galaxy such that the "winning" star doesn't have the most planets and isn't associated with the planet that has the most moons. If those astronomers weren't all three sheets to the wind, I might have gotten an extra moon out of planet (1) as well, so that our winning star isn't tied for most moons total. It'll be convenient for us in this demonstration if star (II) only answers the question we're asking and not any other questions with potentially similar queries, to reduce our chances of arriving at the right answer via the wrong query.
Lost in translation
The first thing I want to do is introduce you to the explicit JOIN syntax. This is going to be your very close friend. You will always JOIN your tables, no matter what some silly tutorial says. Trust in my far sillier advice instead (and optionally, read Explicit vs implicit SQL joins).
The explicit JOIN syntax shows how we're requiring our tables to relate to each other and reserves the WHERE clause for the sole purpose of filtering rows from the result set. There are a few different types, but what we're going to start with is a plain old INNER JOIN. This is essentially what your original query performed and it implies that all you want to see in your result set is the data that overlaps in all three tables. Check out a skeleton of your original query:
SELECT ... FROM stars, planets, moons
WHERE planets.planetid = moons.planetid
AND planets.starid = stars.starid;
Given those conditions, what happens to an orphaned planet somewhere off in space that isn't associated with a star (i.e., its starid is NULL)? Since an orphaned planet has no overlap with the stars table, an INNER JOIN wouldn't include it in the result set.
In SQL any equality or inequality comparison with NULL gives a result of NULL—even NULL = NULL isn't true! Now your query has a problem, because the other condition is that planets.planetid = moons.planetid. If there's a planet for which no corresponding moon exists, that turns into planets.planetid = NULL and the planet will not appear in your query result. That's no good! Lonely planets must be counted!
The OUTER limits
Fortunately there's a JOIN for you: An OUTER JOIN, which will ensure that at least one of the tables always shows up in our result set. They come in LEFT and RIGHT flavors to indicate which table gets special treatment, relative to the position of the JOIN keyword. What joins does SQLite support? confirms that the INNER and OUTER keywords are optional, so we can use LEFT JOIN, noting that:
stars and planets are linked by a common starid;
planets and moons are linked by a common planetid;
stars and moons are linked indirectly by the above two links;
we always want to count all the planets and all the moons.
SELECT
*
FROM
stars
LEFT JOIN
planets ON stars.starid = planets.starid
LEFT JOIN
moons ON planets.planetid = moons.planetid;
Notice that instead of having a big bag o' tables and a WHERE clause, you now have one ON clause for each JOIN. As you find yourself working with more tables, this is going to be far easier to read; and because this is standard syntax, it's relatively portable between SQL databases.
Lost in space
Our new query basically grabs everything in our database. But does this correspond to everything in our galaxy? Actually, there's some redundancy here, because two of our ID fields (starid and planetid) exist in more than one table. This is only one of many reasons to avoid the SELECT * catch-all syntax in actual use cases. We only really need the three ID fields, and I'm going to throw in two more tricks while we're at it:
Aliases! You can give your tables more convenient names by using the table_name AS alias syntax. This can be really convenient when you have to refer to many different columns in a multi-table query and you don't want to type out the full table names each time.
Grab starid from the planets table and leave stars out of the JOIN entirely! Having stars LEFT JOIN planets ON stars.starid = planets.starid means that the starid field is going to be the same, no matter which table we get it from—as long as the star has any planets. If we were counting stars, we'd need this table, but we're counting planets and moons; moons by definition orbit planets, so a star with no planets also has no moons and can be ignored. (This is an assumption; check your data to make sure it's justified! Maybe your astronomers are more drunk than usual!)
SELECT
p.starid, -- This could be S.starid, if we kept using `stars`
p.planetid,
m.moonid
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid;
Result:
| starid | planetid | moonid |
|--------|----------|--------|
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 1 | 3 |
| 2 | 2 | 6 |
| 2 | 3 | 4 |
| 2 | 3 | 5 |
| 3 | 7 | |
| 3 | 8 | 7 |
| 3 | 9 | |
Mathematical!
Now our task is to decide which star is the winner, and for that we have to do some simple calculation. Let's count moons first; since they have no "children" and only one "parent" each, they're easy to aggregate:
SELECT
p.starid,
p.planetid,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid, p.planetid;
Result:
| starid | planetid | moon_count |
|--------|----------|------------|
| 1 | 1 | 3 |
| 2 | 2 | 1 |
| 2 | 3 | 2 |
| 3 | 7 | 0 |
| 3 | 8 | 1 |
| 3 | 9 | 0 |
(Note: Usually we like to use COUNT(*) because it's simple to type and to read, but it would get us into trouble here! Since two of our rows have a NULL value for the moonid, we have to use COUNT(moonid) to avoid counting moons that don't exist.)
So far, so good—I see six planets, we know which star each belongs to, and the right number of moons are shown for each planet. Next step, counting the planets. You might think this requires a subquery in order to also add up the moon_count column for each planet but it's actually simpler than that; if we GROUP BY the star, our moon_count will switch from counting "moons per planet, per star" to "moons per star" which is just fine:
SELECT
p.starid,
COUNT(p.planetid) AS planet_count,
COUNT(m.moonid) AS moon_count
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | planet_count | moon_count |
|--------|--------------|------------|
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 3 | 1 |
Now we've run into trouble. The moon_count is correct, but you should see right away that the planet_count is wrong. Why is this? Look back at the ungrouped query result and notice that there are nine rows, with three rows for each starid, and each row has a non-null value for planetid. That's what we asked the database to count with this query, when what we really meant to ask was how many different planets are there? Planet (1) appears three times with star (I) but it's the same planet each time. The fix is to stick the DISTINCT keyword inside the COUNT() function call. At the same time, we can add the two columns together:
SELECT
p.starid,
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid;
Result:
| starid | total_bodies |
|--------|--------------|
| 1 | 4 |
| 2 | 5 |
| 3 | 4 |
And the winner is...
Counting the orbiting bodies around each star in the drawing, we can see that the total_bodies column is correct. But you didn't ask for all this information; you just want to know who won. Well, there are a bunch of ways to get there, and depending on the size and makeup of your galaxy (database), some may be more efficient than others. One approach is to ORDER BY the total_bodies expression so that the "winner" appears at the top, LIMIT 1 so that we don't see the losers, and select only the starid column (see it on SQL Fiddle).
The problem with that approach is that it hides ties. What if we gave the losing stars in our galaxy each an extra planet or moon? Now we've got a three way tie—everyone's a winner! But who shows up first when we ORDER BY a value that's always the same? In the SQL standard, this is undefined; there's no telling who will come out on top. You might run the same query twice on the same data and get two different results!
For this reason, you might prefer to ask which stars have the greatest number of orbital bodies, instead of specifying in your question that you know there is only one value. This is a more typically set-based approach and it's not a bad idea to get used to set-based thinking when working with relational databases. Until you execute the query, you don't know the size of the result set; if you're going to assume there's not a tie for first place, you have to justify that assumption somehow. (Since astronomers regularly find new moons and planets, I'd have a hard time justifying this one!)
The way I'd prefer to write this query is with something called a Common Table Expression (CTE). These are supported in recent versions of SQLite and in many other databases but last I checked GalaXQL was using an older version of the SQLite engine that doesn't include this feature. CTEs let you refer to a subquery multiple times using an alias, rather than having to write it out in full each time. A solution using CTEs could look like this:
WITH body_counts AS
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
SELECT
starid
FROM
body_counts
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM body_counts);
Result:
| STARID |
|--------|
| 2 |
Check out this query in action on SQLFiddle. To confirm that this query can show more than one row in the case of a tie, try changing MAX() on the last line to MIN().
Just for you
Doing this without CTEs is ugly but it can be done if the table size is manageable. Looking at the query above, our CTE is aliased as body_counts and we refer to it twice—in the FROM clause and in the WHERE clause. We can replace both of those references with the statement that we used to define body_counts (removing the id column once in the second subquery, where it's not used):
SELECT
starid
FROM
(SELECT
p.starid,
COUNT(DISTINCT p.planetid) + COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
WHERE
total_bodies = (SELECT MAX(total_bodies) FROM
(SELECT
COUNT(DISTINCT p.planetid)+ COUNT(m.moonid) AS total_bodies
FROM
planets AS p
LEFT JOIN
moons AS m ON p.planetid = m.planetid
GROUP BY p.starid)
);
This is the tie-friendly approach that should work for you in GalaXQL. See it working here in SQLFiddle.
Now that you've seen both, isn't the CTE version easier to understand? MySQL, which didn't support CTEs until the 2018 release of version 8.0, would additionally demand aliases for our subqueries. Fortunately, SQLite does not, because in this case it's just extra verbiage to add to an already over-complicated query.
Well, that was fun—are you sorry you asked? ;)
(P.S., if you were wondering what's up with planet number nine: giant space potato chips tend to have very eccentric orbits.)
Maybe something like this is what you want?
select
stars.starid as sid,
(count(distinct moons.moonid)+count(distinct planets.planetid)) as total
from stars
left join planets on stars.starid=planets.starid
left join moons on planets.planetid=moons.planetid
group by stars.starid
order by 2 desc
limit 1
Sample SQL Fiddle

SQL JOINing on max value, even if it is 0

I have two tables that look roughly like this:
Airports
uniqueID | Name
0001 | Dallas
Runways
uniqueID | AirportID | Length
000101 | 0001 | 8000
I'm doing a join that looks like this:
SELECT Airports.Name, Runways.Length FROM Airports, Runways
WHERE Airports.uniqueID==Runways.AirportID
Obviously, each runway has exactly one airport, and each airport has 1..n runways.
For an airport with multiple runways, this gives me several rows, one for each runway at that airport.
I want a result set that contains ONLY the row for the longest runway, i.e. MAX(Length).
Sometimes, the Length is 0 for several runways in the database, because the source data is missing. In that case I only want one row with the Length = 0 obviously.
I've tried the approach laid out here: Inner Join table with respect to a maximum value but that's actually not helpful because that's like searching for the longest runway of all, not for the longest at one particular airport.
This seems to simple to be what you want but it seems to meet all the cases you've described...
SELECT A.Name, Max(R.Length)
FROM Airports A
INNER JOIN Runways R
on A.uniqueID=R.AirportID
Group by A.Name
This should give you the max runway for each airport.
If you need additional data elements then use the above as a inline view (Subquery within the joins) to limit the results sets to just those airports and their max runway.

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.
Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;