COUNT in a query with multiple JOINS and a GROUP BY CLAUSE - sql

I am working on a database that contains 3 tables:
A list of companies
A table of the products they sell
A table of prices they offered on each date
I'm doing a query like this in my php to generate a list of the companies offering the lowest prices on a certain product type on a certain date.
SELECT
a.name AS company,
c.id,
MIN(c.price) AS apy
FROM `companies` a
JOIN `company_products` b ON b.company_id = a.id
JOIN `product_prices` c ON c.product_id = b.id
WHERE
b.type = "%s"
AND c.date = "%s"
GROUP BY a.id
ORDER BY c.price ASC
LIMIT %d, %d
This gets me the data I need, but in order to implement a pager in PHP I need to know how many companies offering that product on that day there are in total. The LIMIT means that I only see the first few...
I tried changing the SELECT clause to SELECT COUNT(a.id) or SELECT COUNT(DISTINCT(a.id)) but neither of those seem to give me what I want. I tried removing the GROUP BY and ORDER BY in my count query, but that didn't work either. Any ideas?

Looks to me like you should GROUP BY a.id, c.id -- grouping by a.id only means you'll typically have several c.ids per a.id, and you're just getting a "random-ish" one of them. This seems like a question of basic correctness. Once you have fixed that, an initial SELECT COUNT(*) FROM etc etc should then definitely give you the number of rows the following query will return, so you can prepare your pager accordingly.

This website suggests MySQL has a special trick for this, at least as of version 4:
Luckily since MySQL 4.0.0 you can use SQL_CALC_FOUND_ROWS option in your query which will tell MySQL to count total number of rows disregarding LIMIT clause. You still need to execute a second query in order to retrieve row count, but it’s a simple query and not as complex as your query which retrieved the data.
Usage is pretty simple. In you main query you need to add SQL_CALC_FOUND_ROWS option just after SELECT and in second query you need to use FOUND_ROWS() function to get total number of rows. Queries would look like this:
SELECT SQL_CALC_FOUND_ROWS name, email
FROM users
WHERE name LIKE 'a%'
LIMIT 10;
SELECT FOUND_ROWS();
The only limitation is that you must call second query immediately after the first one because SQL_CALC_FOUND_ROWS does not save number of rows anywhere.

Related

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

When to Use * in SQL Query Containing JOINs & Aggregations?

Question
Web_events table contain id,..., channel,account_id
accounts table contain id, ..., sales_rep_id
sales_reps table contains id, name
Given the above tables, write an SQL query to determine the number of times a particular channel was used in the web_events table for each name in sales_reps. Your final table should have three columns - the name of the sales_reps, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
Answer
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
The COUNT(*) is confusing to me. I don't get how SQL figure out thatCOUNT(*) is COUNT(w.channel). Can anyone clarify?
I don't get how SQL figure out that COUNT(*) is COUNT(w.channel)
COUNT() is an aggregation function that counts the number of rows that match a condition. In fact, COUNT(<expression>) in general (or COUNT(column) in particular) counts the the number of rows where the expression (or column) is not NULL.
In general, the following do exactly the same thing:
COUNT(*)
COUNT(1)
COUNT(<primary key used on inner join>)
In general, I prefer COUNT(*) because that is the SQL standard for this. I can accept COUNT(1) as a recognition that COUNT(*) is just feature bloat. However, I see no reason to use the third version, because it just requires excess typing.
More than that, I find that new users often get confused between these two constructs:
COUNT(w.channel)
COUNT(DISTINCT w.channel)
People learning SQL often think the first really does the second. For this reason, I recommend sticking with the simpler ways of counting rows. Then use COUNT(DISTINCT) when you really want to incur the overhead to count unique values (COUNT(DISTINCT) is more expensive than COUNT()).

How to combine many single result queries into a single query with multiple results

I have currently have a table (post) with the following columns:
id, stock_code, posted_at
For a given stock code SC and time T1, I can retrieve the newest post after a certain time with something like
SELECT * FROM post WHERE stock_code = SC AND time > T1 ORDER BY time asc LIMIT 1; (not actually tested, but you get the gist)
However, I want to get that result for a set of multiple stocks (or even for every distinct stock code in the table). I could simply run this query multiple times, however that quickly becomes inefficient, and it would be best to combine into one SQL query, however I can't wrap my head around how to do that. I would like each row to be the newest post after a certain time for a given stock, and have one row for each stock. How do I go about doing this?
P.S. Using Postgres 9.4.8, and SqlAlchemy on the python side. Would be happy with just SQL, however if there is some SqlAlchemy magic to get to the same result that would be awesome.
Use distinct on:
SELECT DISTINCT ON (stock_code) p.*
FROM post p
WHERE p.stock_code = 'SC' AND p.time > T1
ORDER BY p.stock_code, time asc;
Obviously, with the WHERE clause, this will return one row. You can remove the p.stock_code = 'SC' and get one row per stock_code.
Use union or union all to group results from may queries in one.

Specifying SELECT, then joining with another table

I just hit a wall with my SQL query fetching data from my MS SQL Server.
To simplify, say i have one table for sales, and one table for customers. They each have a corresponding userId which i can use to join the tables.
I wish to first SELECT from the sales table where say price is equal to 10, and then join it on the userId, in order to get access to the name and address etc. from the customer table.
In which order should i structure the query? Do i need some sort of subquery or what do i do?
I have tried something like this
SELECT *
FROM Sales
WHERE price = 10
INNER JOIN Customers
ON Sales.userId = Customers.userId;
Needless to say this is very simplified and not my database schema, yet it explains my problem simply.
Any suggestions ? I am at a loss here.
A SELECT has a certain order of its components
In the simple form this is:
What do I select: column list
From where: table name and joined tables
Are there filters: WHERE
How to sort: ORDER BY
So: most likely it was enough to change your statement to
SELECT *
FROM Sales
INNER JOIN Customers ON Sales.userId = Customers.userId
WHERE price = 10;
The WHERE clause must follow the joins:
SELECT * FROM Sales
INNER JOIN Customers
ON Sales.userId = Customers.userId
WHERE price = 10
This is simply the way SQL syntax works. You seem to be trying to put the clauses in the order that you think they should be applied, but SQL is a declarative languages, not a procedural one - you are defining what you want to occur, not how it will be done.
You could also write the same thing like this:
SELECT * FROM (
SELECT * FROM Sales WHERE price = 10
) AS filteredSales
INNER JOIN Customers
ON filteredSales.userId = Customers.userId
This may seem like it indicates a different order for the operations to occur, but it is logically identical to the first query, and in either case, the database engine may determine to do the join and filtering operations in either order, as long as the result is identical.
Sounds fine to me, did you run the query and check?
SELECT s.*, c.*
FROM Sales s
INNER JOIN Customers c
ON s.userId = c.userId;
WHERE s.price = 10

does the order of columns in a SQL select matters?

my question is regarding a left join I've tried to count how many people are tracking a certain project.
(there can be zero followers)
now the only way i can get it to work is by adding
group by idproject
my question is if the is a way to avoid using this and only selecting and implicitly
setting that group option.
SQL:
select `project_view`.`idproject` AS `idproject`,
count(`track`.`iduser`) AS `c`,`name`
from `project_view` left join `track` using(idproject)
I expected it count null as zero but it doesn't appear at all, if i neglect counting then it shows as null where there are no followers.
If you have a WHERE clause to specify a certain project then you don't need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
WHERE idproject = 4
If you want a count for each project then you do need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
GROUP BY idproject
Yes the order of selecting matters. For performance reasons you (typically) want your most limiting select first to narrow your data set. This makes every subsequent query operate on a smaller dataset.