How do I create a frequency distribution? - sql

I'm trying to create a frequency distribution to show how many customers have transacted 1x, 2x, 3x, etc.
I have a database transactions and column user_id. Each row indicates a transaction, and if a user_id shows up in multiple rows, that user has done multiple transactions.
Now I'd like to get a list that looks something like this:
Tra. | Freq.
0 | 345
1 | 543
2 | 45
3 | 20
4 | 0
5 | 3
Currently I have this, but it just shows a list of users and how many transactions they have had.
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
ORDER BY number_of_transactions DESC;
I did some digging and was suggested that generate_series might help, but I'm stuck and don't know how to move forward.

Use the first result as input to an outer query where you apply the count again, but this time grouping on number_of_transactions:
SELECT number_of_transactions, COUNT(*) AS freq
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
) A
GROUP BY number_of_transactions;
This would transform a result like:
user_id number_of_transactions
----------- ----------------------
1 2
2 1
3 2
4 4
to this:
number_of_transactions freq
---------------------- -----------
1 1
2 2
4 1


Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
select r.individual_id, cal_month_nbr, count(*) as count_e
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions
You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Identifying Records Where a String Appears More Than Once

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT ','+ST1.Medication AS [text()]
Order BY [ID]
), 1, 200000) [Result]
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO
For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

Count first occurring record per time period

In my table trips , I have two columns: created_at and user_id
Unique users take many different trips. My goal is to count the very first trip made unique per each user_ids per year-month. I understand that in this case the min() function should be applied.
In a previous query, all unique users per year-month were aggregated:
SELECT to_char(created_at, 'YYYY-MM') as yyyymm, COUNT(DISTINCT user_id)
FROM trips
GROUP BY yyyymm
ORDER BY yyyymm;
Where in the above query should min() be integrated? In other words, instead of counting all unique user id's per month, I only need to count the first occurrence of unique user id per month.
The sample input would look like:
> routes
user_id created_at
1 1 2015-08-07 07:18:21
2 2 2015-05-06 20:43:52
3 3 2015-05-06 20:53:54
4 1 2015-03-30 20:09:07
5 2 2015-10-01 18:28:32
6 3 2015-08-07 07:29:29
7 1 2015-08-28 13:45:44
8 2 2015-08-07 07:37:31
9 3 2015-03-30 20:14:04
10 1 2015-08-07 07:08:50
And the output would be:
count Y-m
1 0 2015-01
2 0 2015-02
3 2 2015-03
4 0 2015-04
5 1 2015-05
Because the first occurrences of user_id 1 and 3 were in March and the first occurrence of user_id 2 was in May
You can do this with 2 levels of aggregation. Get the min time per user_id and then count.
SELECT to_char(first_time, 'YYYY-MM'),count(*)
from (
SELECT user_id,MIN(created_at) as first_time
FROM trips
GROUP BY user_id
) t
GROUP BY to_char(first_time, 'YYYY-MM')

Oracle SQL Help Data Totals

I am on Oracle 12c and need help with the simple query.
Here is the sample data of what I currently have:
Table Name: customer
Table DDL
create table customer(
customer_id varchar2(50),
name varchar2(50),
activation_dt date,
space_occupied number(50)
Sample Table Data:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Sample Data Output
The query I am looking for will provide the following:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Total_Space null null 30
Here is a slightly hack-y approach to this, using the grouping function ROLLUP(). Find out more.
SQL> select coalesce(customer_id, 'Total Space') as customer_id
2 , name
3 , activation_dt
4 , sum(space_occupied) as space_occupied
5 from customer
6 group by ROLLUP(customer_id, name, activation_dt)
7 having grouping(customer_id) = 1
8 or (grouping(name) + grouping(customer_id)+ grouping(activation_dt)) = 0;
------------ ------------ --------- --------------
abc abc-001 12-SEP-16 20
xyz xyz-001 12-SEP-16 10
Total Space 30
ROLLUP() generates intermediate totals for each combination of column; the verbose HAVING clause filters them out and retains only the grand total.
What you want is a bit unusual, as if customer_id is integer, then you have to cast it to string etc, but it this is your requirement, then if be achieved this way.
SELECT customer_id,
(SELECT 1 AS seq,
FROM customer
SELECT 2 AS seq,
'Total_Space' AS customer_id,
NULL AS name,
NULL AS activation_dt,
sum(space_occupied) AS space_occupied
FROM customer
Inner query:
First part of union all; I added 1 as seq to give 1
hardcoded with your resultset from customer.
Second part of union
all: I am just calculating sum(space_occupied) and hardcoding other
columns, including 2 as seq
Outer query; Selecting the data
columns and order by seq, so Total_Space is returned at last.
| abc | abc-001 | 12-SEP-16 | 20 |
| xyz | xyz-001 | 12-SEP-16 | 10 |
| Total_Space | null | null | 30 |
Seems like a great place to use group by grouping sets seems like this is what they were designed for. Doc link
SELECT coalesce(Customer_Id,'Total_Space') as Customer_ID
, Name
, ActiviatioN_DT
, sum(Space_occupied) space_Occupied
FROM customer
GROUP BY GROUPING SETS ((Customer_ID, Name, Activation_DT, Space_Occupied)
The key thing here is we are summing space occupied. The two different grouping mechanisms tell the engine to keep each row in it's original form and 1 records with space_occupied summed; since we group by () empty set; only aggregated values will be returned; along with constants (coalesce hardcoded value for total!)
The power of this is that if you needed to group by other things as well you could have multiple grouping sets. imagine a material with a product division, group and line and I want a report with sales totals by division, group and line. You could simply group by () to get grand total, (product_division, Product_Group, line) to get a product line (product_Divsion, product_group) to get a product_group total and (product_division) to get a product Division total. pretty powerful stuff for a partial cube generation.

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
VALUES (1), (2), (3);
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
| DayC | count |
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day