Group by multiple criteria - sql

Given the table like
| userid | active | anonymous |
| 1 | t | f |
| 2 | f | f |
| 3 | f | t |
I need to get:
number of users
number of users with 'active' = true
number of users with 'active' = false
number of users with 'anonymous' = true
number of users with 'anonymous' = false
with single query.
As for now, I only came out with the solution using union:
SELECT count(*) FROM mytable
UNION ALL
SELECT count(*) FROM mytable where active
UNION ALL
SELECT count(*) FROM mytable where anonymous
So I can take first number and find non-active and non-anonymous users with simple deduction .
Is there any way to get rid of union and calculate number of records matching these simple conditions with some magic and efficient query in PostgreSQL 9?

You can use an aggregate function with a CASE to get the result in separate columns:
select
count(*) TotalUsers,
sum(case when active = 't' then 1 else 0 end) TotalActiveTrue,
sum(case when active = 'f' then 1 else 0 end) TotalActiveFalse,
sum(case when anonymous = 't' then 1 else 0 end) TotalAnonTrue,
sum(case when anonymous = 'f' then 1 else 0 end) TotalAnonFalse
from mytable;
See SQL Fiddle with Demo

Assuming your columns are boolean NOT NULL, this should be a bit faster:
SELECT total_ct
,active_ct
,(total_ct - active_ct) AS not_active_ct
,anon_ct
,(total_ct - anon_ct) AS not_anon_ct
FROM (
SELECT count(*) AS total_ct
,count(active OR NULL) AS active_ct
,count(anonymous OR NULL) AS anon_ct
FROM tbl
) sub;
Find a detailed explanation for the techniques used in this closely related answer:
Compute percents from SUM() in the same SELECT sql query
Indexes are hardly going to be of any use, since the whole table has to be read anyway. A covering index might be of help if your rows are bigger than in the example. Depends on the specifics of your actual table.
-> SQLfiddle comparing to #bluefeet's version with CASE statements for each value.
SQL server folks are not used to the proper boolean type of Postgres and tend to go the long way round.

Related

Selecting a group with or without certain conditions across many rows in SQL

I have data like this:
ID SomeVar
123 0
123 1
123 2
234 1
234 2
234 3
456 3
567 0
567 1
I'm trying to group by my ID to to return all of the IDs that do not have a record with the value 0. That is, my selection would look like this:
ID
234
456
Is there an easy way to do this without creating a subset table with all records not containing 0 then joining it back to the full data set where the tables don't match?
I generally try to avoid subqueries, but you could use one for this case. Do the same group by, and check that the id isn't in a subquery of ids that have 0 for SomeVar. In this case, distinct will do the same and more efficiently, so I'll do that first:
SELECT DISTINCT ID
FROM [table_name]
WHERE ID NOT IN (
SELECT ID FROM [table_name] WHERE SomeVar = 0
);
And if you want to get other information by using a GROUP BY:
SELECT ID, max(SomeVar), count(*), sum(SomeVar)
FROM [table_name]
WHERE ID NOT IN (
SELECT ID FROM [table_name] WHERE SomeVar = 0
)
GROUP BY ID;
You can use aggregation and having:
select id
from t
group by id
having min(somevar) > 0;
This assumes that somevar is never negative. If that is a possibility, then you can use the slightly more verbose:
select id
from t
group by id
having sum(case when somevar = 0 then 1 else 0 end) = 0;
Use case statement with count or sum aggregation, filter by count using having:
select ID
from
(
select ID, count(case when SomeVar=0 then 1 end) cnt
from mytable
group by ID having count(case when SomeVar=0 then 1 end) = 0
) s
;

Multiple Selects from Subquery

I have multiple queries that look like this:
select count(*) from (
SELECT * FROM TABLE1 t
JOIN TABLE2 e
USING (EVENT_ID)
) s1
WHERE
s1.SOURCE_ID = 1;
where the only difference is the t1.SOURCE_ID = (some other number). I would like to turn these into a single query that just selects from the subquery using a different SOURCE_ID for each column in the result, like this:
+----------------+----------------+----------------+
| source_1_count | source_2_count | source_3_count | ... so on
+----------------+----------------+----------------+
I am trying to avoid using the multiple queries as the join is on a very large table and takes some time, so I would rather do it once and query the result multiple times.
This is on a Snowflake data warehouse which I think uses something similar to PostgreSQL (also I'm fairly new to SQL so feel free to suggest a completely different solution as well).
Use conditional aggregation
SELECT sum(case when sourceid=1 then 1 else 0 end) source_1_count, sum(case when sourceid=2 then 1 else 0 end) source_2_count...
FROM TABLE1 t
JOIN TABLE2 e
USING (EVENT_ID)
You would put the results in separate rows, using group by:
SELECT SOURCE_ID, COUNT(*)
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID)
GROUP BY SOURCE_ID;
Putting the separate sources in columns is troublesome, unless you know the exact list of sources that you want in the result set.
EDIT:
If you know the exact list of sources, you can use conditional aggregation or pivot:
SELECT SUM(CASE WHEN SOURCE_ID = 1 THEN 1 ELSE 0 END) as source_id_1,
SUM(CASE WHEN SOURCE_ID = 2 THEN 1 ELSE 0 END) as source_id_2,
SUM(CASE WHEN SOURCE_ID = 3 THEN 1 ELSE 0 END) as source_id_3
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID);
All the comments so far ignore the fact that you won't have the possible benefits of pruning the data during the scan, as there are no WHERE predicates. Join can also be slower than it needs to be because of that.
This is a possible improvement:
SELECT SUM(CASE WHEN SOURCE_ID = 1 THEN 1 ELSE 0 END) as source_id_1,
SUM(CASE WHEN SOURCE_ID = 2 THEN 1 ELSE 0 END) as source_id_2,
SUM(CASE WHEN SOURCE_ID = 3 THEN 1 ELSE 0 END) as source_id_3
FROM TABLE1 t JOIN
TABLE2 e
USING (EVENT_ID);
WHERE SOURCE_ID IN (1, 2, 3)

Impala SQL, return value if a string exists within a subset of values

I have a table where the id field (not a primary key) contains either 1 or null. Over the past several years, any given part could have been entered multiple times with one, or both of these possible options.
I'm trying to write a statement that will return some value if there is ever a 1 associated with the select statement. There are lots of semi-duplicate rows, some with 1 and some with null, but if there is ever a 1, I want to return true, and if there are only null values, I want to return false. I'm not sure how to code this though.
If this is my SELECT part,id from table where part = "ABC1234" statement
part id
ABC1234 1
ABC1234 null
ABC1234 null
ABC1234 null
ABC1234 1
I want to write a statement that returns true, because 1 exists in at least one of these rows.
The closest I've come to this is by using a CASE statement, but I'm not quite there yet:
SELECT
a1.part part,
CASE WHEN a2.id is not null
THEN
'true'
ELSE
'false'
END AS id
from table.parts a1, table.ids a2 where a1.part = "ABC1234" and a1.key = a2.key;
I also tried the following case:
CASE WHEN exists
(SELECT id from table.ids where id = 1)
THEN
but I got the error subqueries are not supported in the select list
For the above SELECT statement, how do I return 1 single line that reads:
part id
ABC1234 true
You can use conditional aggregation to check if a part has atleast one row with id=1.
SELECT part,'True' id
from parts
group by part
having count(case when id = 1 then 1 end) >= 1
To return false when the id's are all nulls use
select part, case when id_true>=1 then 'True'
when id_false>=1 and id_true=0 then 'False' end id
from (
SELECT part,
count(case when id = 1 then 1 end) id_true,
count(case when id is null then 1 end) id_false,
from parts
group by part) t

select result set row to columns transformation

I've a table remarks with columns id, story_id, like like can be +1, -1
I want my select query to return the following columns story_id, total, n_like, n_dislike where total = n_like + n_dislike without sub queries.
I am currently doing a group by on like and selecting like as like_t, count(like) as total which is giving me an output like
-- like_t --+ --- total --
-1 | 2
1 | 6
and returning two rows in result set. But what I want is to get 1 row where n_like is 6 and n_dislike is 2 and total is 8
First, LIKE is a reserved word in PostgreSQL, so you have to double-quote it. Maybe a better name should be picked for this column.
CREATE TABLE testbed (id int4, story_id int4, "like" int2);
INSERT INTO testbed VALUES
(1,1,'+1'),(1,1,'+1'),(1,1,'+1'),
(1,1,'+1'),(1,1,'+1'),(1,1,'+1'),
(1,1,'-1'),(1,1,'-1');
SELECT
story_id,
sum(CASE WHEN "like" > 0 THEN abs("like") ELSE 0 END) AS n_like,
sum(CASE WHEN "like" < 0 THEN abs("like") ELSE 0 END) AS n_dislike,
count(story_id) AS total
-- for cases +2 / -3 in the "like" field, use following construct instead
-- sum(abs("like")) AS total
FROM testbed
GROUP BY story_id;
I used abs("like") for cases when you'll have +2 or -3 in your "like" column.

How do I modify this query without increasing the number of rows returned?

I've got a sub-select in a query that looks something like this:
left outer join
(select distinct ID from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
It's pretty straightforward. Checks to see if a certain relation exists between the main object being queried for and the object represented by OTHER_TABLE by whether or not MYJOIN.ID is null on the row in question.
But now the requirements have changed a little. There's another row in OTHER_TABLE that can have a value of 1 or 0, and the query needs to know whether a relation exists between the primary for a 1-value, and also if it exists for a 0 value. The obvious solutions is to put:
left outer join
(select distinct ID, TYPE_VALUE from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
But that would be wrong because if 0-type and 1-type objects both exist for the same ID, it will increase the number of rows returned by the query, which isn't acceptable. So what I need is some sort of subselect that will return 1 row for each distinct ID, with a "1-type exists" column and a "0-type exists" column. And I have no idea how to code that in SQL.
For example, for the following table,
ID | TYPE_VALUE
_________________
1 | 1
3 | 0
3 | 1
4 | 0
I'd like to see a result set like this:
ID | HAS_TYPE_0 | HAS_TYPE_1
______________________________
1 | 0 | 1
3 | 1 | 1
4 | 1 | 0
Anyone know how I could set up a query to do this? Hopefully with a minimum of ugly hacks?
In the general case, you would use EXISTS:
SELECT DISTINCT ID,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 0 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_0,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 1 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_1
FROM Table1 x;
If you have a very large number of elements in the table, this won't perform so great - those nested subselects are often a kiss of death when it comes to performance.
For your specific case, you could also use GROUP BY and MAX() and MIN() to speed things up:
SELECT
ID,
CASE WHEN MIN(TYPE_VALUE) = 0 THEN '1' ELSE 0 END AS HAS_TYPE_0,
CASE WHEN MAX(TYPE_VALUE) = 1 THEN '1' ELSE 0 END AS HAS_TYPE_1
FROM Table1
GROUP BY ID;
Instead of select distinct ID, TYPE_VALUE from OTHER_TABLE
use
select ID,
MAX(CASE WHEN TYPE_VALUE =0 THEN 1 END) as has_type_0,
MAX(CASE WHEN TYPE_VALUE =1 THEN 1 END) as has_type_1
from OTHER_TABLE
GROUP BY ID;
You can do the same using PIVOT opearator...