How to run a subquery to split a table into two groups? - sql

I've got a table called spending (actually in BigQuery, though I don't think that's necessarily relevant for this question) that is about 2.9GB and 19 million rows.
The data structure is like this:
product,org,spend,to_include,proportion_overseas
----------------------------------
SK001,03V,"Yes",0.1
SK002,03V,2.4,"Yes",0.1
SK001,O3T,66.1,"No",0.47
SK002,03T,87.1,"No",0.47
SK001,04C,16.1,"Yes",0
SK002,04C,27.1,"Yes",0
...
For info, it is slightly denormalised, in that to_include and proportion_overseas are actually properties of each organisation.
Now I want to work out, for each product:
the total amount that all organisations with no overseas spending spent on that product, and
the total amount that all organisations with non-zero overseas spending spent on that product.
I also only want to include in this calculation only rows where to_include='Yes'.
I'm not sure what the best approach to do this is in SQL. I don't mind whether I end up with two tables, or one.
I know how to get all spending by code, for all relevant rows:
SELECT product, SUM(spend)
FROM spending
WHERE to_include='Yes'
GROUP BY product;
But what I don't know is how to split each row into two groups: one group where proportion_overseas=0 and one group where proportion_overseas>0.
I don't think 'subquery' is the right term, so I don't really know what to Google for!

You can use conditional aggregation:
SELECT product, SUM(spend),
SUM(CASE WHEN proportion_overseas = 0 THEN spend ELSE 0 END) as not_overseas,
SUM(CASE WHEN proportion_overseas > 0 THEN spend ELSE 0 END) as overseas
FROM spending
WHERE to_include='Yes'
GROUP BY product;

Related

SQL multiple constrained counts in query

I am trying to get with 1 query multiple count results where each one is a subset of the previous one.
So my table would be called Recipe and has these columns:
recipe_num(Primary_key) decimal,
recipe_added date,
is_featured bit,
liked decimal
And what I want is to make a query that will return the amount of likes grouped by day for any particular month with
total recipes as total_recipes,
total recipes that were featured as featured_recipes,
total number of recipes that were featured and had more than 100 likes liked_recipes
So as you can see each they are all counts with each being a subset of the previous one.
Ideally I don't want to run separate select count's where that query the whole table but rather get from the previous one.
I am not very good at using count with Where, Having, etc... and not exactly sure how to do it, so far I have the following which I managed via digging around here.
select
recipe_added,
count(*) total_recipes,
count(case is_featured when 1 then 1 else null end) total_featured_recipes
from
RECIPES
group by
recipe_added
I am not exactly sure why I have to use case inside the count but I wasn't able to get it to work using WHERE, would like to know if this is possible as well.
Thanks
With a CASE expression inside COUNT() you are doing conditional aggregation and this is exactly what you need for this requirement:
select recipe_added,
count(*) total_recipes,
count(case when is_featured = 1 then 1 end) total_featured_recipes,
count(case when is_featured = 1 and liked > 100 then 1 end) liked_recipes
from Recipes
group by recipe_added
There is no need for ELSE null because the default behavior of a CASE expression is to return null when no other branch returns a value.
If you want results for a specific month, say October 2020, you can add a WHERE clause before the GROUP BY:
where format(recipe_added, 'yyyyMM') = '202010'
This will work for SQL Server.
If you are using a different database then you can use a similar approach.

How to retrieve data when I have following type of data in a database?

I am learning MySQL(Beginner). And I am trying to solve one question in which I got stuck.
From the above database ORDER_TABLE, I am trying to extract out the ORDER_ID when user inputs the various fruits, stationary items and drinks. Name of respective items (Apple, Book, Tea) will be provided us. We need to check whether the database contains the items provided by the user and if exists we need to return ORDER_ID of it, if not a message saying "No such order yet".
How to check and extract it? Do we need to use loop here? If yes, let us suppose, there is quantity of items provided to us. But how to use loop here? I am totally confused. Beginners tutorials does not teach this kind of question and its solution.
Or, can we break this data in several tables and then get the ORDER_ID??
One simple approach uses aggregation. For example, to find all orders which have (Apple, Book, Tea), we can try:
SELECT Order_ID
FROM yourTable
GROUP BY Order_ID
HAVING
COUNT(CASE WHEN Fruits = 'Apple' THEN 1 END) > 0 AND
COUNT(CASE WHEN Stationary = 'Book' THEN 1 END) > 0 AND
COUNT(CASE WHEN Drinks = 'Teat' THEN 1 END) > 0;

Query for grouping of successful attempts when order matters

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.
You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Counting Customers in different ways - Do I need a SubQuery?

Thanks for taking the time to read my question.
Imagine a situation where a customer is given a free gift if they open an account. Some ‘customers’ open an account to get the free gift, but never add any money to their account. Other ‘customers’ open accounts, get the free gift and also fund their accounts by adding money to it.
I need to compare the funded accounts to the overall count of all customers.
Things at first appeared quite easy …
SELECT Audits.AuditDate, Count(Audits.NickName) AS AllAccounts
FROM Audits
Group By Audits.AuditDate
This obviously gives me the count of all accounts on a daily basis.
To get the ‘Funded’ count, I do …
SELECT Audits.AuditDate, Count(Audits.NickName) AS Funded
FROM Audits
WHERE Audits.CurrGBP > 0
GROUP BY Audits.AuditDate;
This time I get the count of the ‘Funded’ accounts.
Now, this is where I get stuck … I want both counts from the same query so my results would be like this …
AuditDate (DD/MM/YYYY) AllAccounts Funded
01/01/2012 50 45
02/01/2012 60 50
03/01/2012 70 55
Something is telling me I need to use a Sub Query, but after googling a few pages. Sub Queries are baffling to me.
May I ask for some help please ? Can you show me how to write a Sub Query to give me the results I need.
Regards,
John.
PS - My Audits table has the following fields, Audit_ID, Audit_Date, NickName, CurrGBP and I am using MS Access 2010.
Cant remember if access supports case; or if IIF is the way to go... but something like...
A sub query isn't really needed, you can get the results in one query, just limit what you count when using a case or IIF.
Select A.AuditDate,
count (A.NickName) as AllAccounts,
sum(CASE when A.CurrGBP > 0 then 1 else 0 end) as Funded
FROM Audits A
GROUP BY A.AuditDate
if IIF
Select A.AuditDate,
count (A.NickName) as AllAccounts,
sum(IIF(A.CurrGBP >0,1,0)) as Funded
FROM Audits A
GROUP BY A.AuditDate
EDIT, was missing some commas in selects.
try this:
SELECT AuditA.AuditDate,
Count(AuditA.AllAcounts) as AllAccount,
iTable.iCount
FROM Audits as AuditA
INNER JOIN
(SELECT AuditB.Audit_ID, COUNT(AuditB.Audit_ID) as iCount
FROM Audit as AuditB
WHERE AuditB.CurrGBP > 0) as iTable
ON AuditA.Audit_ID = iTable.Audit_ID

Which is faster: Sum(Case When) Or Group By/Count(*)?

I can write
Select
Sum(Case When Resposta.Tecla = 1 Then 1 Else 0 End) Valor1,
Sum(Case When Resposta.Tecla = 2 Then 1 Else 0 End) Valor2,
Sum(Case When Resposta.Tecla = 3 Then 1 Else 0 End) Valor3,
Sum(Case When Resposta.Tecla = 4 Then 1 Else 0 End) Valor4,
Sum(Case When Resposta.Tecla = 5 Then 1 Else 0 End) Valor5
From Resposta
Or
Select
Count(*)
From Resposta Group By Tecla
I tried this over a large number of rows and it seems like taking the same time.
Anyone can confirm this?
I believe the Group By is better because there are no specific treatments.
It can be optimized by the database engine.
I think the results may depend on the database engine you use.
Maybe the one you are using optimizes the first query anderstanding it is like a group by !
You can try the "explain / explain plan" command to see how the engine is computing your querys but with my Microsoft SQL Server 2008, I just can see a swap between 2 operations ("Compute scalar" and "agregate").
I tried such queries on a database table :
SQL Server 2k8
163000 rows in the table
12 cathegories (Valor1 -> Valor12)
the results are quite differents :
Group By : 2seconds
Case When : 6seconds !
So My choice is "Group By".
Another benefit is the query is simplyer to write !
What the DB does internally with the second query is practically the same as what you explicitly tell it to do with the first. There should be no difference in the execution plan and thus in the time the query takes. Taking this into account, clearly using the second query is better:
it's much more flexible, when there are more values of Tecla you
don't need to change your query
it's easier to understand. If you have a lot of values for Tecla
it'll be harder to read the first query and realize it just counts
distinct values
it's smaller - you're sending less information to the DB server and it will probably parse the query faster, which is the only performance difference I see in this queries. This makes a difference, albeit small
Either one is going to have to read all rows from Resposta, so for any reasonably sized table, I'd expect the I/O cost to dominate - giving approximately the same overall runtime.
I'd generally use:
Select
Tecla,
Count(*)
From Resposta
Group By Tecla
If there's a reasonable chance that the range of Tecla values will change in the future.
In my opinion GROUP BY statement will always be faster than SUM(CASE WHEN ...) because in your example for SUM ... there would be 5 different calculations while when using GROUP BY, DB will simply sort and calculate.
Imagine, you have a bag with different coins and you need to know, how much of earch type of coins do you have. You can do it this ways:
The SUM(CASE WHEN ...) way would be to compare each coin with predefined sample coins and do the math for each sample (add 1 or 0);
The GROUP BY way would be to sort coins by their types and then count earch group.
Which method would you prefer?
To fairly compete with count(*), Your first SQL should probably be:
Select
Sum(Case When Resposta.Tecla >= 1 AND Resposta.Tecla <=5 Then 1 Else 0 End) Valor
From Resposta
And to answer your question, I'm not noticing a difference at all in speed between SUM CASE WHEN and COUNT. I'm querying over 250,000 rows in POSTGRESQL.