How to get the count of distinct values until a time period Impala/SQL? - sql

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?

Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"

try this one:
select day, count(distinct(id)) from yourtable group by day

Related

How to calculate average monthly number of some action in some perdion in Teradata SQL?

I have table in Teradata SQL like below:
ID trans_date
------------------------
123 | 2021-01-01
887 | 2021-01-15
123 | 2021-02-10
45 | 2021-03-11
789 | 2021-10-01
45 | 2021-09-02
And I need to calculate average monthly number of transactions made by customers in a period between 2021-01-01 and 2021-09-01, so client with "ID" = 789 will not be calculated because he made transaction later.
In the first month (01) were 2 transactions
In the second month was 1 transaction
In the third month was 1 transaction
In the nineth month was 1 transactions
So the result should be (2+1+1+1) / 4 = 1.25, isn't is ?
How can I calculate it in Teradata SQL? Of course I showed you sample of my data.
SELECT ID, AVG(txns) FROM
(SELECT ID, TRUNC(trans_date,'MON') as mth, COUNT(*) as txns
FROM mytable
-- WHERE condition matches the question but likely want to
-- use end date 2021-09-30 or use mth instead of trans_date
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id, mth) mth_txn
GROUP BY id;
Your logic translated to SQL:
--(2+1+1+1) / 4
SELECT id, COUNT(*) / COUNT(DISTINCT TRUNC(trans_date,'MON')) AS avg_tx
FROM mytable
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id;
You should compare to Fred's answer to see which is more efficent on your data.

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.
Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.
Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

Oracle SQL Help Data Totals

I am on Oracle 12c and need help with the simple query.
Here is the sample data of what I currently have:
Table Name: customer
Table DDL
create table customer(
customer_id varchar2(50),
name varchar2(50),
activation_dt date,
space_occupied number(50)
);
Sample Table Data:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Sample Data Output
The query I am looking for will provide the following:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Total_Space null null 30
Here is a slightly hack-y approach to this, using the grouping function ROLLUP(). Find out more.
SQL> select coalesce(customer_id, 'Total Space') as customer_id
2 , name
3 , activation_dt
4 , sum(space_occupied) as space_occupied
5 from customer
6 group by ROLLUP(customer_id, name, activation_dt)
7 having grouping(customer_id) = 1
8 or (grouping(name) + grouping(customer_id)+ grouping(activation_dt)) = 0;
CUSTOMER_ID NAME ACTIVATIO SPACE_OCCUPIED
------------ ------------ --------- --------------
abc abc-001 12-SEP-16 20
xyz xyz-001 12-SEP-16 10
Total Space 30
SQL>
ROLLUP() generates intermediate totals for each combination of column; the verbose HAVING clause filters them out and retains only the grand total.
What you want is a bit unusual, as if customer_id is integer, then you have to cast it to string etc, but it this is your requirement, then if be achieved this way.
SELECT customer_id,
name,
activation_dt,
space_occupied
FROM
(SELECT 1 AS seq,
customer_id,
name,
activation_dt,
space_occupied
FROM customer
UNION ALL
SELECT 2 AS seq,
'Total_Space' AS customer_id,
NULL AS name,
NULL AS activation_dt,
sum(space_occupied) AS space_occupied
FROM customer
)
ORDER BY seq
Explanation:
Inner query:
First part of union all; I added 1 as seq to give 1
hardcoded with your resultset from customer.
Second part of union
all: I am just calculating sum(space_occupied) and hardcoding other
columns, including 2 as seq
Outer query; Selecting the data
columns and order by seq, so Total_Space is returned at last.
Output
+-------------+---------+---------------+----------------+
| CUSTOMER_ID | NAME | ACTIVATION_DT | SPACE_OCCUPIED |
+-------------+---------+---------------+----------------+
| abc | abc-001 | 12-SEP-16 | 20 |
| xyz | xyz-001 | 12-SEP-16 | 10 |
| Total_Space | null | null | 30 |
+-------------+---------+---------------+----------------+
Seems like a great place to use group by grouping sets seems like this is what they were designed for. Doc link
SELECT coalesce(Customer_Id,'Total_Space') as Customer_ID
, Name
, ActiviatioN_DT
, sum(Space_occupied) space_Occupied
FROM customer
GROUP BY GROUPING SETS ((Customer_ID, Name, Activation_DT, Space_Occupied)
,())
The key thing here is we are summing space occupied. The two different grouping mechanisms tell the engine to keep each row in it's original form and 1 records with space_occupied summed; since we group by () empty set; only aggregated values will be returned; along with constants (coalesce hardcoded value for total!)
The power of this is that if you needed to group by other things as well you could have multiple grouping sets. imagine a material with a product division, group and line and I want a report with sales totals by division, group and line. You could simply group by () to get grand total, (product_division, Product_Group, line) to get a product line (product_Divsion, product_group) to get a product_group total and (product_division) to get a product Division total. pretty powerful stuff for a partial cube generation.

SQL Server: how to divide the result of sum of total for every customer id

I have 4 tables like this (you can ignore table B because this problem did not use that table)
I want to show the sum of 'total' for each 'sales_id' from table 'sales_detail'
What I want (the result) is like this:
sales_id | total
S01 | 3
S02 | 2
S03 | 4
S04 | 1
S05 | 2
S05 | 3
I have tried with this query:
select
sum(total)
from
sales_detail
where
sales_id = any (select sales_id
from sales
where customer_id = any (select customer_id
from customer)
)
but the query returns a value if 15 because they are the sum of those rows of data.
I have tried to use "distinct" before sum
and the result is [ 1, 2, 3 ] because those are distinct of those rows of data (not sum of each sales_id)
It's all about subquery
You are just so far off track that a simple comment won't help. Your query only concerns one table, sales_detail. It has nothing to do with the other two.
And, it is just an aggregation query:
select sd.sales_id, sum(sd.total)
from sales_detail sd
group by sd.sales_id;
This is actually pretty close to what the question itself is asking.

How do I create a frequency distribution?

I'm trying to create a frequency distribution to show how many customers have transacted 1x, 2x, 3x, etc.
I have a database transactions and column user_id. Each row indicates a transaction, and if a user_id shows up in multiple rows, that user has done multiple transactions.
Now I'd like to get a list that looks something like this:
Tra. | Freq.
0 | 345
1 | 543
2 | 45
3 | 20
4 | 0
5 | 3
etc
Currently I have this, but it just shows a list of users and how many transactions they have had.
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
ORDER BY number_of_transactions DESC;
I did some digging and was suggested that generate_series might help, but I'm stuck and don't know how to move forward.
Use the first result as input to an outer query where you apply the count again, but this time grouping on number_of_transactions:
SELECT number_of_transactions, COUNT(*) AS freq
FROM (
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
) A
GROUP BY number_of_transactions;
This would transform a result like:
user_id number_of_transactions
----------- ----------------------
1 2
2 1
3 2
4 4
to this:
number_of_transactions freq
---------------------- -----------
1 1
2 2
4 1