SQL Select random rows partitioned by a column - sql

I have a dataset looks like this
| Country | id |
-------------------
| a | 5 |
| a | 1 |
| a | 2 |
| b | 1 |
| b | 5 |
| b | 4 |
| b | 7 |
| c | 5 |
| c | 1 |
| c | 2 |
and i need a query which returns 2 random values from where country in ('a', 'c'):
| Country | id |
------------------
| a | 2 | -- Two random rows from Country = 'a'
| a | 1 |
| c | 1 |
| c | 5 | --Two random rows from Country = 'c'

This should work:
select Country, id from
(select Country,
id,
row_number() over(partition by Country order by rand()) as rn
from table_name
) t
where Country in ('a', 'c') and rn <= 2
Replace rand() with random() if you're using Postgres or newid() in SQL Server.

Related

SQL return only rows where value exists multiple times and other value is present

I have a table like this in MS SQL SERVER
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
| 3 | C |
| 3 | C |
+------+------+
I don't know the values in column "Cust" and I want to return all rows where the value of "Cust" appears multiple times and where at least one of the "ID" values is "1".
Like this:
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
+------+------+
Any ideas? I can't find it.
You may use COUNT window function as the following:
SELECT ID, Cust
FROM
(
SELECT ID, Cust,
COUNT(*) OVER (PARTITION BY Cust) cn,
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) cn2
FROM table_name
) T
WHERE cn>1 AND cn2>0
ORDER BY ID, Cust
COUNT(*) OVER (PARTITION BY Cust) to check if the value of "Cust" appears multiple times.
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) to check that at least one of the "ID" values is "1".
See a demo.

SQL: Get row number which increases every time a value changes

I have the following table in Vertica:
+----------+----------+----------+
| column_1 | column_2 | column_3 |
+----------+----------+----------+
| a | 1 | 1 |
| a | 2 | 1 |
| a | 3 | 1 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 1 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 2 |
| c | 3 | 2 |
+----------+----------+----------+
The table is ordered by column_1 and column_3.
I would like to add a row number, which increases every time when column_1 or column_3 change their value. It would look something like this:
+----------+----------+----------+------------+
| column_1 | column_2 | column_3 | row_number |
+----------+----------+----------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 1 | 1 | 2 |
| b | 2 | 1 | 2 |
| b | 3 | 1 | 2 |
| c | 1 | 1 | 3 |
| c | 2 | 1 | 3 |
| c | 3 | 1 | 3 |
| c | 1 | 2 | 4 |
| c | 2 | 2 | 4 |
| c | 3 | 2 | 4 |
+----------+----------+----------+------------+
I tried using partition over but I can't find the right syntax.
Vertica has the CONDITIONAL_CHANGE_EVENT() analytic functions.
It starts at 0, and increments by 1 every time the expression that makes the first argument undergoes a change.
Like so:
WITH
indata(column_1,column_2,column_3,rn) AS (
SELECT 'a',1,1,1
UNION ALL SELECT 'a',2,1,1
UNION ALL SELECT 'a',3,1,1
UNION ALL SELECT 'b',1,1,2
UNION ALL SELECT 'b',2,1,2
UNION ALL SELECT 'b',3,1,2
UNION ALL SELECT 'c',1,1,3
UNION ALL SELECT 'c',2,1,3
UNION ALL SELECT 'c',3,1,3
UNION ALL SELECT 'c',1,2,4
UNION ALL SELECT 'c',2,2,4
UNION ALL SELECT 'c',3,2,4
)
SELECT
*
, CONDITIONAL_CHANGE_EVENT(
column_1||column_3::VARCHAR
) OVER w + 1 AS rownum
FROM indata
WINDOW w AS (ORDER BY column_1,column_3,column_2)
;
-- out column_1 | column_2 | column_3 | rn | rownum
-- out ----------+----------+----------+----+--------
-- out a | 1 | 1 | 1 | 1
-- out a | 2 | 1 | 1 | 1
-- out a | 3 | 1 | 1 | 1
-- out b | 1 | 1 | 2 | 2
-- out b | 2 | 1 | 2 | 2
-- out b | 3 | 1 | 2 | 2
-- out c | 1 | 1 | 3 | 3
-- out c | 2 | 1 | 3 | 3
-- out c | 3 | 1 | 3 | 3
-- out c | 1 | 2 | 4 | 4
-- out c | 2 | 2 | 4 | 4
-- out c | 3 | 2 | 4 | 4
In the absence of an ORDER BY, SQL data sets are unordered. To establish the order in your example therefore, I've assumed the dataset can be sorted with ORDER BY column_1, column_3, column_2
If that assumption doesn't work, you MUST add additional columns that the data can be deterministically sorted by.
That gives the following query...
SELECT
yourTable.*,
DENSE_RANK() OVER (ORDER BY column_1, column_3) AS row_number
FROM
yourTable
ORDER BY
column_1, column_3, column_2
This would also work and doesn't require table sorting
Find distinct value from column_1 and column_3 and give new index for them
Merge the previous with origin table on column_1 and column_3
select t1.*, t2.row_number
from
your_table t1
join
(select column_1, column_2, row_number() over (partition by temp) as row_number from (select distinct column_1, column_2, 1 as temp from your_table) foo) t2
on
t1.column_1=t2.column_1 and t1.column_2=t2.column_2;

Count NULL values by column in SQL

Suppose I have the following table:
table
| a | b | c |
|:-----|:----|:-----|
| 1 | a | NULL |
| NULL | b | NULL |
| 3 | c | NULL |
| 4 | d | 23 |
| NULL | e | 231 |
How can I count the number of NULL values by each column?
My final result would be:
| column_name | n_nulls |
|:---------------|:----------|
| a | 2 |
| b | 0 |
| c | 3 |
You can use union all:
select 'a', count(*) - count(a) as n_nulls from t
union all
select 'b', count(*) - count(b) as n_nulls from t
union all
select 'c', count(*) - count(c) as n_nulls from t;
Redshift is a column-store database, so there probably is not a more efficient method.

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

group by top two results based on order

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);
You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo