I have dataset:
timestamp event user
2020-04-28 20:07:55.503 log_in john
2020-04-28 20:08:01.996 log_out john
2020-04-28 20:08:02.470 log_in john
2020-04-28 20:08:03.996 log_out john
2020-04-28 20:08:05.729 log_failed john
2020-04-29 10:06:45.683 log_in mark
2020-04-29 10:08:58.299 password_change mark
2020-04-30 14:19:24.921 log_in jeff
2020-04-30 14:20:31.266 log_out jeff
2020-04-30 14:21:44.438 create_new_user jeff
2020-04-30 14:22:44.455 create_new_user jeff
How to write a sql query to count all unique events per day. the unclear part for me is the presence of hours in timestamp. The desired result looks like this:
timestamp count
2020-04-28 3
2020-04-29 2
2020-04-30 3
I think the Clickhouse syntax is:
select distinct toDate(timestamp), event
from t;
EDIT:
If you want to count the events, use count(distinct):
select toDate(timestamp), count(distinct event)
from t
group by toDate(timestamp);
create table xx(timestamp DateTime64(3), event String, user String) Engine=Memory;
insert into xx values
('2020-04-28 20:07:55.503','log_in', 'john'),
('2020-04-28 20:08:01.996','log_out','john'),
('2020-04-28 20:08:02.470','log_in','john'),
('2020-04-28 20:08:03.996','log_out','john'),
('2020-04-28 20:08:05.729','log_failed','john'),
('2020-04-29 10:06:45.683','log_in','mark'),
('2020-04-29 10:08:58.299','password_change','mark'),
('2020-04-30 14:19:24.921','log_in','jeff'),
('2020-04-30 14:20:31.266','log_out','jeff'),
('2020-04-30 14:21:44.438','create_new_user','jeff'),
('2020-04-30 14:22:44.455','create_new_user','jeff')
SELECT
toDate(timestamp) AS d,
uniq(event)
FROM xx
GROUP BY d
┌──────────d─┬─uniq(event)─┐
│ 2020-04-28 │ 3 │
│ 2020-04-29 │ 2 │
│ 2020-04-30 │ 3 │
└────────────┴─────────────┘
Related
I'm new to sql ,I use pandas a lot ,but my boss asks me to use sql replace lots of pandas code.
I have a table my_table:
case_id first_created last_paid submitted_time
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00
1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00
9073 None None 2021-09-12 10:25:30.845687+00:00
6891 2021-08-03 2021-09-17 None
First I need create 2 variables:
create_duration = first_created-submitted_time
paid_duration= last_paid-submitted_time
And if submitted_time is none just ignore that row ,else if create_duration or paid_duration
is negative ,convert it to 0,the unit should be days.
The ideal output should something similliar :
case_id first_created last_paid submitted_time create_duration paid_duration
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00 1 3
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00 0 0
1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00 0 null
9073 None None 2021-09-12 10:25:30.845687+00:00 null null
6891 2021-08-03 2021-09-17 null null null
My code:
select * from my_table
first_created-submitted_time as create_duration
last_paid-submitted_time as paid_duration
I have to say I'm too bad at sql,I have no idea how to continue,any friend can help ?
I currently have the dataset below:
Group
Start
End
A
2021-01-01
2021-04-05
A
2021-01-01
2021-06-05
A
2021-03-01
2021-06-05
B
2021-06-13
2021-08-05
B
2021-06-13
2021-09-05
B
2021-07-01
2021-09-05
C
2021-10-07
2021-10-17
C
2021-10-07
2021-11-15
C
2021-11-12
2021-11-15
I want like the following final dataset: Essentially, I would like to remove all observations that don't equal the minimum start value and I want to do this by group.
Group
Start
End
A
2021-01-01
2021-04-05
A
2021-01-01
2021-06-05
B
2021-06-13
2021-08-05
B
2021-06-13
2021-09-05
C
2021-10-07
2021-10-17
C
2021-10-07
2021-11-15
I tried the following code but I cannot do a min statement in a where clause. Any help would be appreciated.
Delete from #df1
where start != min(start)
If you want to remove all rows, that have not the same [start] you can join a subquery which find the earliest day, you can add additional ON clauses if you need to find other rows as well
DELETE
o1
FROM observations o1
INNER JOIN(SELECT MIN([Start]) minstart , [Group] FROM observations GROUP BY [Group] ) o2
ON o1.[Group] = o2.[Group] AND o1.[Start] <> o2.minstart
SELECT *
FROM observations
Group | Start | End
:---- | :--------- | :---------
A | 2021-01-01 | 2021-04-05
A | 2021-01-01 | 2021-06-05
B | 2021-06-13 | 2021-08-05
B | 2021-06-13 | 2021-09-05
C | 2021-10-07 | 2021-10-17
C | 2021-10-07 | 2021-11-15
db<>fiddle here
You can try this
DELETE FROM table_name WHERE start IN (SELECT MIN(start) FROM table_name GROUP BY start)
Another alternative using a CTE:
with keepers as (
select [Group], min(Start) as mStart
from #df1 group by [Group]
)
delete src
from #df1 as src
where not exists (select * from keepers
where keepers.[Group] = src.[Group] and keepers.mStart = src.Start)
;
You should make an effort to avoid using reserved words as names, since that requires extra effort to write sql using those.
I have the following dataset:
A
B
C
1
John
2018-08-14
1
John
2018-08-20
1
John
2018-09-03
2
John
2018-11-13
2
John
2018-12-11
2
John
2018-12-12
1
John
2020-01-20
1
John
2020-01-21
3
John
2021-03-02
3
John
2021-03-03
1
John
2020-05-10
1
John
2020-05-12
And I would like to have the following result:
A
B
C
1
John
2018-08-14
2
John
2018-11-13
1
John
2020-01-20
3
John
2021-03-02
1
John
2020-05-10
If I group by A, B the 1st row and the third just concatenate which is coherent. How could I create another columns to still use a group by and have the result I want.
If you have another ideas than mine, please explain it !
I tried to use some first, last, rank, dense_rank without success.
Use lag(). Looks like B is a function of A in your data. So checking lag(A) will suffice.
select A,B,C
from (
select *, case when lag(A) over(order by C) = A then 0 else 1 end startFlag
from mytable
) t
where startFlag = 1
order by C
I need to get deduped conversions for each unique user. The rule here is that I need a column where I only get the count of the first conversion made within a day. So I can trigger 10 conversions for 3/03/2019, but the 'Deduped' column will only pull in the count for 1. The code should be scalable for TB of data.
This is my original data in BigQuery:
Date User_ID Total_Conversions
3/3/19 1234 1
3/3/19 1234 1
3/3/19 1234 1
3/3/19 12 1
3/3/19 12 1
3/4/19 1234 1
3/4/19 1234 1
3/5/19 1 1
3/6/19 1 0
I want my final output to look like this:
Date User_ID Total_Conversions Deduped
3/3/19 1234 3 1
3/3/19 12 2 1
3/5/19 1 1 1
3/4/19 1234 2 1
3/6/19 1 0 0
I think you just need a basic GROUP BY query here:
SELECT
date,
User_ID,
SUM(Total_Conversions) AS Total_Conversions,
CASE WHEN SUM(Total_Conversions) > 0 THEN 1 ELSE 0 END AS Deduped
FROM yourTable
GROUP BY
date,
User_ID;
Demo
(Demo shown in MySQL just for illustrative purposes)
This assumes that logically the Deduped column is always one, for any number of conversions in that group, unless no conversions at all happened, in which it becomes zero.
I have a table that contains a list of expiration dates for various companies. The table looks like the following:
ID CompanyID Expiration
--- ---------- ----------
1 1 2016-01-01
2 1 2015-01-01
3 2 2016-04-02
4 2 2015-04-02
5 3 2014-01-03
6 4 2015-04-09
7 5 2015-07-20
8 5 2016-05-01
I am trying to build a TSQL query that will return just the most recent record for every company (i.e. CompanyID). Such as:
ID CompanyID Expiration
--- ---------- ----------
1 1 2016-01-01
3 2 2016-04-02
5 3 2014-01-03
6 4 2015-04-09
8 5 2016-05-01
It looks like there is a exact correlation between ID and Expiration. If that is true, ie the later the Expiration the higher the ID, then you could simply pull Max(ID) and Max(Expiration) which are 1:1 and group by CompanyID:
Select max(ID), CompanyID, max(Expiration) from Table group by Company ID