I have a SQL database of customer actions, a customer is defined by a UniqueId and an action is given a date and time of action timestamp. A user can have more than one action on any one day as so:
UniqueID | actionDate | actionTime |
1 17-01-18 13:01
1 17-01-18 13:15
2 17-01-18 13:15
1 18-01-18 12:56
I want to understand multiple things from the database ideally in a single query.
The first is how many times has each uniqueId preformed an action over a given time period (day, week, month) so for the example above there would be a count of 2 for id1 for 17-01-18, a count of 1 for 18-01-18 and assuming they are the only two actions that week a count of 3 for id 1 for that week.
On days that have more than one action (17-01-18 in the above example) I would want to understand the distribution of actions across the day and more importantly the number of actions that occurred within a time frame of an hour. In this case id want to understand that 2 actions occurred between 13:00 - 14:00 for id 1 but the other 23 hours had 0 actions.
The end goal would be to have a time series that looks back over three months and be able to view monthly, weekly and importantly daily / intra-daily counts of actions for each unique ID.
Desired result may look something like this:
ID | M1W1D1H1|M1W1D1H2|->|M1W1D1H13|->|M1W1D2H12|
1 0 0 2 1
2 0 0 1 0
M=Month, W=Week, D=Day, H=Hour. AC = ActionCount
So the above shows that on month 1, week 1, day 1, hour 1, id1 had no actions. The first action was on M1W1D1H13, in which time they had two actions. There next action was on D2 of W1, M1. Could then aggregate up to get the respective, weekly, daily monthly actions. The result will be fairly sparse with many 0 action.
Any help and guidance appreciated.
If I am understanding your question you have an id with a date and time details in a normalized data structure. You, however, want to denormalize this data so that you have only one line per id aggregated at the conditions you desire.
To do this you could use a simple group by and nest your aggregations into case statements qualifying them for the column range you desire. If you can not hard code your time slices and need that to be dynamic that may be possible but I would need more information about your requirements. You can also nest case statments into case statements and use derived tables to further enable more complex rules.
So, using your example...
sel
UniqueID
, sum(
case when actionDate between <someDate> and <someDate> then 1
end) as evnt_cnt_in_range01
, count(distinct(
case when actionDate between <someDate> and <someDate> then actionDate
end)) as uniq_dates_in_range01
, min(
case when actionDate between <someDate> and <someDate> then actionTime
end) as earliest_action_in_range01
, max(
case when actionDate between <someDate> and <someDate> then actionTime
end) as latest_action_in_range01
, max(
case when actionDate between <someDate> and <someDate> then
CASE WHEN actionTime > '12:00' THEN 1 ELSE 0 END -- I flip caps to keeps nests straight
end) as cnt_after_noon_action_range1
FROM <sometable>
group by 1
Related
I'm new to SQL and have been trying to combine two queries that give me a count of unique uses by day of the week ('weekday' - with days of the week coded 1-7) and by user type ('member_casual' - member or casual user). I managed to use a case statement to combine them into one table, with the following query:
SELECT weekday,
CASE WHEN member_casual = 'member' THEN COUNT (*) END AS member_total,
CASE WHEN member_casual = 'casual' THEN COUNT (*) END AS casual_total,
FROM
`case-study-319921.2020_2021_Trip_Data.2020_2021_Rides_Merged`
GROUP BY weekday, member_casual;
Resulting in a table that looks like this:
Row
weekday
member_total
casual_total
1
1
null
330529
2
1
308760
null
3
2
null
188687
4
2
316228
null
5
3
330656
null
6
3
null
174799
etc...
I can see that this likely has to do with the fact that I grouped by 'weekday' and 'member_casual', however I get errors if remove 'member casual' from the GROUP BY statement. I have tried to play around with a CASE IF statement instead, but have yet to find a solution.
You want countif():
SELECT weekday,
COUNTIF(member_casual = 'member') AS member_total,
COUNTIF(member_casual = 'casual') AS casual_total,
FROM`case-study-319921.2020_2021_Trip_Data.2020_2021_Rides_Merged`
GROUP BY weekday;
i have a table like this
order_id | user_id | createdAt | transaction_amount
order_id as the id of the transaction, user_id as the user, createdAt as the dates, and transaction_amount is the transaction of each id order.
so on tableau i want to find out users in time range '2020-01-01' until '2020-01-31' with 2 conditions
the users are doing transaction before last date in range ('2020-01-31') and atleast doing more than 1 transaction
and the users are at least doing 1 transaction in date range ('2020-01-01' until '2020-01-31')
on mysql that conditions can be described with this query
HAVING SUM(createdAt <= '2020-01-31') > 1
AND SUM(createdAt BETWEEN '2020-01-01' AND '2020-01-31')
on tableau i did this
[![enter image description here][1]][1]
on first filter (createdAt) i made range of dates ('2020-01-01' until '2020-01-31')
on second filter (createdAt copy) i made range before last range ( < '2020-01-31')
on filter CNTD(user_id) i made count distinct at least 1.
so it appear 2223 users, instead when i check it in mysql, its appear 1801 user, and the mysql was always right since i used mysql and im new in tableau. so what did i missed in here?
Edit: Take this example
user1 doing 2 transactions, 1 each in Dec-19 and Jan-20
user2 doing 1 transaction in Jan-20
user3 doing 2 transactions in Dec-19
user4 doing 2 transactions in Feb-20
user5 doing 2 transactions in Jan-20
data snapshot
Now if your date range is Jan-20 (say). If you want users doing at least two transactions before end of date range (31 Jan 2020) and at least one of these should be in Jan-2020 then user1 and user5 satisfy the condition.
In this case proceed like this-
Step-1 Create two date type parameters (parameter 1 and parameter 2 respectively for start and end of date ranges)
Step-2 create a calculated field condition with calculation as
{Fixed [User]: sum(
if [Trans Date]<=[Parameter 2] then 1 else 0 end)}>=2
AND
{FIXED [User]: sum(
IF [Trans Date]<=[Parameter 2] AND
[Trans Date] >= [Parameter 1] THEN 1 ELSE 0 END)}>=1
this will evaluate to TRUE whenever your given condition is satisfied. See this screenshot.
Needless to say trans date is your createdAt
i had a sql test for a job i wanted but unfortunately i didn't get the job,
i was hoping someone can help me with the right answer for a question in the test,
so here is the question:
ETL Part
Our “Events” table (Data source) is created in real time. The table has no updates only appends.
event_id event_type time user_id OS Country
1 A 01/12/2018 15:39 1111 iOS ES
2 B 01/12/2018 10:43 2222 iOS Ge
3 C 02/12/2018 16:05 3333 Android IN
4 A 02/12/2018 16:39 3333 Android IN
Presented below Fact_Events table that is part of our DWH. This table aggregates the number of events on hourly level. The ETL process is running every 30 min.
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 1 0 1
Please answer the following questions:
Define the steps to create the Fact_Events table
For each step provide the output Table.
Write the query for each step.
What loading method you would use?
I really appreciate any help as i wish to learn for future job interviews.
thanks in advance,
Ido.
/***here is my answer/
please tell me if i am correct or if there is a better solution,
ETL Part
1.
Create a table for each event, in this case we will need 3 tables,
Use UNION ALL to concatenate all the table to one table.
2.
First step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
Second step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
Third step:
date hour event_type_A event_type_B event_type_C
01/12/2018 15:00 1 0 0
02/12/2018 16:00 1 0 0
01/12/2018 10:00 0 1 0
02/12/2018 16:00 0 0 1
SELECT Date(time) as date,
Hour(time) as hour,
Count(event_type) as event_type_A,
0 as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'A'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
Count(event_type) as event_type_B,
0 as event_type_C
FROM Events
WHERE event_type = 'B'
Union All
SELECT Date(time) as date,
Hour(time) as hour,
0 as event_type_A,
0 as event_type_B,
Count(event_type) as event_type_C
FROM Events
WHERE event_type = 'C'
I would use incremental load,
At the first time we will run the script over all the data and save the table,
From now on we will concatenate only the new events that aren’t exists in the saved table.
The loading query is way off. You need to group and pivot.
Should be something like this:
select Date(time) as date,
datepart(hour,time) as hour,
Sum(case when event_type='A' then 1 else 0 end) as event_type_A,
Sum(case when event_type='B' then 1 else 0 end) as event_type_B,
Sum(case when event_type='C' then 1 else 0 end) as event_type_C
from Events
group by Date(time), datepart(hour,time)
In my shop, I'm not heavily involved in hiring, but I do take a large part in determining if anyone is any good - and hence, you could say I play a big part in firing :(
I don't want people stuffing up my data - or giving inaccurate data to clients - so understanding the overall data flow (start to finish) and the final output are key.
As such, you'd need a much bigger answer to question 1, potentially including questions back.
We group data hourly, but load data every half hour. The answer has to work for these.
What do we want if there are no transactions in a given hour? No line, or a line with all 0's? My gut feel is the latter (as it's useful to know that no transactions occurred but is not guaranteed.
Then we'd like to see options, and an evaluation of them e.g.,
The first run each hour creates new data for the hour, and the second updates the data, or
Create the rows as a separate process, then the half-hour process simply updates those.
However, note that the options depend heavily on answers to the previous questions e.g.,
If all-zero rows are allowed, then you can create the rows then update them
But if all-zero rows are not allowed, then you need to do insert on the first round, then updates and inserts in the second round (as the row may not have been created in the first round).
After you have the strategy and have evaluated it, then yes - write the SQL. First make it accurate/correct and understandable/maintainable. Then do efficiency.
I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.
I have to create a report which has AccountSegment as rows and a 2-week date range as column header. The column values will be a count of the number of records in the table having the associated segment/date range.
So the desired output looks something like this:
AcctSeg 4/9/12-4/20/12 4/23/12-5/4/12 5/7/12-5/18/12
Segment1 100 200 300
Segment2 110 220 330
Segment3 120 230 340
The following query does what I want, but just looks so inefficient and ugly. I was wondering if there is a better way to accomplish the same thing:
SELECT
AccountSegment = S.Segment_Name,
'4/9/2012 - 4/20/2012' = SUM(CASE WHEN date_start BETWEEN '2012-04-09' AND '2012-04-20' THEN 1 END),
'4/23/2012 - 5/4/2012' = SUM(CASE WHEN date_start BETWEEN '2012-04-23' AND '2012-05-04' THEN 1 END),
'5/7/2012 - 5/18/2012' = SUM(CASE WHEN date_start BETWEEN '2012-05-07' AND '2012-05-18' THEN 1 END),
'5/21/2012 - 6/1/2012' = SUM(CASE WHEN date_start BETWEEN '2012-05-21' AND '2012-06-01' THEN 1 END),
'6/4/2012 - 6/15/2012' = SUM(CASE WHEN date_start BETWEEN '2012-06-04' AND '2012-06-15' THEN 1 END),
'6/18/2012 - 6/29/2012' = SUM(CASE WHEN date_start BETWEEN '2012-06-18' AND '2012-06-29' THEN 1 END),
'7/2/2012 - 7/13/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-02' AND '2012-07-13' THEN 1 END),
'7/16/2012 - 7/27/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-16' AND '2012-07-27' THEN 1 END),
'7/30/2012 - 8/10/2012' = SUM(CASE WHEN date_start BETWEEN '2012-07-30' AND '2012-08-10' THEN 1 END)
FROM
dbo.calls C
JOIN dbo.accounts a ON C.parent_id = a.id
JOIN dbo.accounts_cstm a2 ON a2.id_c = A.id
JOIN dbo.Segmentation S ON a2.[2012_segmentation_c] = S.Segment_Num
WHERE
c.deleted = 0
GROUP BY
S.Segment_Name
ORDER BY
MIN(S.Sort_Order)
It looks fine, but I would suggest one performance improvement:
where c.deleted = 0 and
date_start between '2012-04-09' AND '2012-08-10'
This will limit the aggregation only to rows you need . . . unless you want everything listed with empty data.
I would be inclined to add else 0 to the case statements, so 0s appear instead of NULLs.
#PaulStock, happy to do so.
This technique plays to the strengths of RDMS which is data retrieval and set manipulation - leave itteration to other programming languages that are better optimised for it like C#.
First of all you need an IndexTable, I keep mine in the master database but if you do not have write access to this by all means keep it in your db.
It looks like this:
Id
0
1
2
...
n
Where n is a sufficiently large number for all your scenarios, 100,000 is good, 1,000,000 is better, 10,000,000 is even better still. Column id is cluster indexed of course.
I'm not going to relate it directly to your query becuase I don't really get it and I'm too lazy to work it out.
Instead I'll relate it to this table called Transactions, where we want to roll up all the transactions that happened on each day (or week or month etc):
Date Amount
2012-18-12 04:58:56.453 10
2012-18-12 06:34:21.456 100
etc
The following query will roll up the data by day
SELECT i.Id, SUM(t.Amount) AS DailyTotal
FROM IndexTable i
INNER JOIN
Transactions t ON i.Id=DATEDIFF(DAY, 0, t.Date)
GROUP BY i.Id
The DATEDIFF function returns the number of dateparts between 2 dates, in this case the number of days between 1900-01-01 0:00:00.000 (DateTime = 0 in SQL Server) and the Date of the transaction (btw there have been 41,261 days since then - see why we need a big table)
All the transactions on the same day will have the same number. Changing to week or month or second (a very big number) is as easy as changing the datepart.
You can put in a startdate later than this of course so long as it is earlier than the data you are interested in but it makes litte to no differance to performance.
I have used an INNER JOIN here so if there are no transactions on a given day then we have no row but a LEFT JOIN will give these empty dates with NULL as the Total (use an ISNULL statement if you want to get 0.
With the normalised data you can then PIVOT as desired to get the output you are looking for.