Im trying to model some data using sql, the column i would like to generate is date_started - all others is given.
date_started = the minimal date_created with the same id1 and id2 in range of 2 hours
that not belong to any other group of rows.
for example, for date_created = 2021-11-02 05:23:41.769,
date_started = 2021-11-02 05:23:41.769 itself.
because 2021-11-02 04:10:39.823 is in range of two hours but belong to 2021-11-02 02:16:28.544 group already.
id1
id2
date_created
date_started
1
2
2021-11-02 02:16:28.544
2021-11-02 02:16:28.544
1
2
2021-11-02 02:52:52.504
2021-11-02 02:16:28.544
1
2
2021-11-02 04:10:39.823
2021-11-02 02:16:28.544
1
2
2021-11-02 05:23:41.769
2021-11-02 05:23:41.769
1
2
2021-11-02 06:33:11.564
2021-11-02 05:23:41.769
1
2
2021-11-02 08:30:14.564
2021-11-02 08:30:14.564
It is a little bit unclear what you mean as your description could be interpreted differently from what is done in your example (should a new session start whenever there is a 2h gap between the previous "first" activity or whenever there is 2h of no activity?). Either way I think looking into sessionization might be helpful here (includes lots of example code) as this is ultimately what you're trying to do.
Related
I have a table that looks like the following:
Transaction ID
Timestamp
User ID
1
2021-11-02 8:08
USER1
2
2021-11-02 8:10
USER2
3
2021-11-02 8:07
USER2
4
2021-11-02 8:15
USER1
5
2021-11-02 8:18
USER2
I want to create a third column, that essentially says, for a given transaction, how long since that users last transaction. Essentially, subtract the users last timestamp. The output table would look like this:
Transaction ID
Timestamp
User ID
Time Taken
1
2021-11-02 8:08
USER1
None
2
2021-11-02 8:10
USER2
3
3
2021-11-02 8:07
USER2
None
4
2021-11-02 8:15
USER1
7
5
2021-11-02 8:18
USER2
8
How can I do this with a query in SQlite3?
We can use LAG() along with the JULIANDAY() function here:
SELECT
TransactionID,
Timestamp,
UserID,
COALESCE(CAST((JULIANDAY(Timestamp) -
JULIANDAY(LAG(Timestamp) OVER (PARTITION BY UserID
ORDER BY Timestamp))) * 1440 AS INTEGER), 'None') AS "TimeTaken"
FROM yourTable
ORDER BY Timestamp;
Note that in order for the above to work, your text timestamps will have to be in a valid literal format. So instead of:
2021-11-02 8:08
you would need:
2021-11-02 08:08:00
I have a table that looks like this:
user
type
quantity
order_id
purchase_date
john
travel
10
1
2022-01-10
john
travel
15
2
2022-01-15
john
books
4
3
2022-01-16
john
music
20
4
2022-02-01
john
travel
90
5
2022-02-15
john
clothing
200
6
2022-03-11
john
travel
70
7
2022-04-13
john
clothing
70
8
2022-05-01
john
travel
200
9
2022-06-15
john
tickets
10
10
2022-07-01
john
services
20
11
2022-07-15
john
services
90
12
2022-07-22
john
travel
10
13
2022-07-29
john
services
25
14
2022-08-01
john
clothing
3
15
2022-08-15
john
music
5
16
2022-08-17
john
music
40
18
2022-10-01
john
music
30
19
2022-11-05
john
services
2
20
2022-11-19
where i have many different users, multiple types making purchases daily.
I want to end up with a table of this format
user
label
month
john
travel
2022-01-01
john
travel
2022-02-01
john
clothing
2022-03-01
john
travel-clothing
2022-04-01
john
travel-clothing
2022-05-01
john
travel-clothing
2022-06-01
john
travel
2022-07-01
john
travel
2022-08-01
john
services
2022-10-01
john
music
2022-11-01
where the label would record the most popular type (based on % of quantity sold) for each user in a timeframe of the last 4 months (including the current month). So for instance, for March 2022 john ordered 200/339 clothing (Jan to and including Mar) so his label is clothing. But for months where two types are almost even I'd want to use a double label like for April (185 travel 200 clothing out of 409). In terms of rules this is not set in stone yet but it's something like, if two types are around even (e.g. >40%) then use both types in the label column; if three types are around even (e.g. around 30% each) use three types as label; if one label is 40% but the rest is made up of many small % keep the first label; and of course where one is clearly a majority use that. One other tricky bit is that there might be missing months for a user.
I think regarding the rules I need to just compare the % of each type, but I don't know how to retrieve the type as label afterwards. In general, I don't have the SQL/BigQuery logic very clearly in my head. I have done somethings but nothing that comes close to the target table.
Broken down in steps, I think I need 3 things:
group by user, type, month and get the partial and total count (I have done this)
then retrieve the counts for the past 4 months (have done something but it's not exactly accurate yet)
compare the ratios and make the label column
I'm not very clear on the sql/bigquery logic here, so please advise me on the correct steps to achieve the above. I'm working on bigquery but sql logic will also help
Consider below approach. It looks a little bit messy and has a room to optimize but hope you get some idea or a direction to address your problem.
WITH aggregation AS (
SELECT user, type, DATE_TRUNC(purchase_date, MONTH) AS month, month_no,
SUM(quantity) AS net_qty,
SUM(SUM(quantity)) OVER w1 AS rolling_qty
FROM sample_table, UNNEST([EXTRACT(YEAR FROM purchase_date) * 12 + EXTRACT(MONTH FROM purchase_date)]) month_no
GROUP BY 1, 2, 3, 4
WINDOW w1 AS (
PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
)
),
rolling AS (
SELECT user, month, ARRAY_AGG(STRUCT(type, net_qty)) OVER w2 AS agg, rolling_qty
FROM aggregation
QUALIFY ROW_NUMBER() OVER (PARTITION BY user, month) = 1
WINDOW w2 AS (PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW)
)
SELECT user, month, ARRAY_TO_STRING(ARRAY(
SELECT type FROM (
SELECT type, SUM(net_qty) / SUM(SUM(net_qty)) OVER () AS pct,
FROM r.agg GROUP BY 1
) QUALIFY IFNULL(FIRST_VALUE(pct) OVER (ORDER BY pct DESC) - pct, 0) < 0.10 -- set threshold to 0.1
), '-') AS label
FROM rolling r
ORDER BY month;
Query results
I have below SQL database and would like to group them in sequence and assign ID to each group.
Time
Line
Colour
2021-11-02 3:00:00PM
1
Black
2021-11-02 3:00:01PM
1
White
2021-11-02 3:00:02PM
1
Red
2021-11-02 3:00:04PM
1
Red
2021-11-02 3:00:05PM
1
Black
2021-11-02 3:00:06PM
1
Black
2021-11-02 3:00:00PM
2
Black
2021-11-02 3:00:01PM
2
Black
2021-11-02 3:00:02PM
2
White
2021-11-02 3:00:03PM
2
White
2021-11-02 3:00:03PM
2
White
2021-11-02 3:00:03PM
2
Black
2021-11-02 3:00:03PM
2
Black
Result that I am looking for is
Time
Line
Colour
Qty
Group ID
2021-11-02 3:00:00PM
1
Black
1
1
2021-11-02 3:00:01PM
1
White
1
2
2021-11-02 3:00:02PM
1
Red
2
3
2021-11-02 3:00:04PM
1
Red
2
3
2021-11-02 3:00:05PM
1
Black
2
4
2021-11-02 3:00:06PM
1
Black
2
4
2021-11-02 3:00:00PM
2
Black
2
1
2021-11-02 3:00:01PM
2
Black
2
1
2021-11-02 3:00:02PM
2
White
3
2
2021-11-02 3:00:02PM
2
White
3
2
2021-11-02 3:00:03PM
2
White
3
2
2021-11-02 3:00:04PM
2
Black
2
3
2021-11-02 3:00:05PM
2
Black
2
3
Qty is basically # of same colour from line in a row.
Group ID is sequential ID for colour change by line.
I just couldn't figure out as it needs to be sequential in 'Time' then 'Line' columns and unable to aggregate.
Here is how you can do it:
SELECT * , COUNT(*) OVER (PARTITION BY Line, groupId) Qty
FROM (
SELECT *
, rank() OVER (PARTITION BY Line ORDER BY Insertdate)
- rank() OVER (PARTITION BY Line, colour ORDER BY Insertdate) AS GroupId
FROM tablename
) t ORDER BY line, Insertdate
db<>fiddle here
Please note: this is not for an Access project as such, but a legacy application that uses an Access database for its back end.
Setup
Part of the application is a kind of Gantt chart, fixed to single day columns, where each row represents a single resource. Resources are booked out for a range of days and a booking is for a single resource, so they cannot overlap on a row. The range of dates that is in view is user selectable, open ended, and can be changed by various methods, including horizontal scrolling using mouse or keyboard.
Problem
I've been tasked with adding a row to the top of the chart to indicate overall resource usage for each day. Of course that's trivially easy to do by simply querying for each day in the range separately, but unfortunately that is proving to be an expensive process and therefore slows down horizontal scrolling a lot. So I'm looking for a way to do it more efficiently, hopefully with fewer database reads.
Here is a highly simplified example of the bookings table:
booking_ID | start_Date | end_Date | resource_ID
----------- -------------- ------------- -------------
1 2014-07-17 2014-07-20 21
2 2014-08-24 2014-08-29 4
3 2014-08-26 2014-09-02 21
4 2014-08-28 2014-09-04 19
Ideally, I would like a single query that returns each day within the specified range, along with a count of how many bookings there are on those days. So querying the data above for 20 days from 2014-07-17 would produce this:
check_Date | resources_Used
----------- ---------------
2014-07-17 1
2014-07-18 1
2014-07-19 1
2014-07-20 1
2014-07-21 0
2014-07-22 0
2014-07-23 0
2014-08-24 1
2014-08-25 1
2014-08-26 2
2014-08-27 2
2014-08-28 3
2014-08-29 3
2014-08-30 2
2014-08-31 2
2014-09-01 2
2014-09-02 2
2014-09-03 1
2014-09-04 1
2014-09-05 0
I can get a list of dates in the range by using a table of integers (starting at 0), with this:
SELECT CDATE('2014-07-17') + ID AS check_Date FROM Integers WHERE ID < 20
And I can get the count of resources used for a single day with something like this:
SELECT COUNT(*) AS resources_Used
FROM booking
WHERE start_Date <= CDATE('2014-09-04')
AND end_Date >= CDATE('2014-09-04')
But I can't figure out how (or if) I can tie them both together to get the desired results. Is this even possible?
Create a table called "calendar" and put a list of dates into it covering the necessary timeframe. It just needs one column called check_date with one row for each date. Use Excel, start at whatever date and just drag down, then import into the new table.
After your calendar table is set up you can run the following:
select c.check_date, count(b.resource_id) as resources_used
from calendar c, bookings b
where c.check_date between b.start_date and b.end_date
group by c.check_date
Perhaps my title is misleading, but I am not sure how else to phrase this. I have two tables, tblL and tblDumpER. They are joined based on the field SubjectNumber. This is a one (tblL) to many (tblDumpER) relationship.
I need to write a query that will give me, for all my subjects, a value from tblDumpER associated with a date in tblL. This is to say:
SELECT tblL.SubjectNumber, tblDumpER.ER_Q1
FROM tblL
LEFT JOIN tblDumpER ON tblL.SubjectNumber=tblDumpER.SubjectNumber
WHERE tblL.RandDate=tblDumpER.ER_DATE And tblDumpER.ER_Q1 Is Not Null
This is straightforward enough. My problem is the value RandDate from tblL is different for every subject. However, it needs to be displayed as Day1 so I can have tblDumpER.ER_Q1 as Day1 for every subject. Then I need RandDate+1 As Day2, etc until I hit either null or Day84. The 'dumb' solution is to write 84 queries. This is obviously not practical. Any advice would be greatly appreciated!
I appreciate the responses so far but I don't think that I'm explaining this correctly so here is some example data:
SubjectNumber RandDate
1001 1/1/2013
1002 1/8/2013
1003 1/15/2013
SubjectNumber ER_DATE ER_Q1
1001 1/1/2013 5
1001 1/2/2013 6
1001 1/3/2013 2
1002 1/8/2013 1
1002 1/9/2013 10
1002 1/10/2013 8
1003 1/15/2013 7
1003 1/16/2013 4
1003 1/17/2013 3
Desired outcome:
(Where Day1=RandDate, Day2=RandDate+1, Day3=RandDate+2)
SubjectNumber Day1_ER_Q1 Day2_ER_Q1 Day3_ER_Q1
1001 5 6 2
1002 1 10 8
1003 7 4 3
This data is then going to be plotted on a graph with Day# on the X-axis and ER_Q1 on the Y-axis
I would do this in two steps:
Create a query that gets the MIN date for each SubjectNumber
Join this query to your existing query, so you can perform a DATEDIFF calculation on the MIN date and the date of the current record.
I'm not entirely sure of what it is that you need, but perhaps a calendar table would be of help. Just create a local table that contains all of the days of the year in it, then use that table to JOIN your dates up?