Conditional lead/lag function PostgreSQL?

Conditional lead/lag function PostgreSQL? - sql

I have a table like this:
Name activity time
user1 A1 12:00
user1 E3 12:01
user1 A2 12:02
user2 A1 10:05
user2 A2 10:06
user2 A3 10:07
user2 M6 10:07
user2 B1 10:08
user3 A1 14:15
user3 B2 14:20
user3 D1 14:25
user3 D2 14:30
Now, I need a result like this:
Name activity next_activity
user1 A2 NULL
user2 A3 B1
user3 A1 B2
I would like to check for every user the last activity from group A and what type of activity took place next from group B (activity from group B always takes place after activity from group A). Other types of activity are not interesting for me. I've tried to use the lead() function, but it hasn't worked.
How I can solve my problem?

Your definition:
activity from group B always takes place after activity from group A.
.. logically implies that there is, per user, 0 or 1 B activity after 1 or more A activities. Never more than 1 B activities in sequence.
You can make it work with a single window function, DISTINCT ON and CASE, which should be the fastest way for few rows per user (also see below):
SELECT name
, CASE WHEN a2 LIKE 'B%' THEN a1 ELSE a2 END AS activity
, CASE WHEN a2 LIKE 'B%' THEN a2 END AS next_activity
FROM (
SELECT DISTINCT ON (name)
name
, lead(activity) OVER (PARTITION BY name ORDER BY time DESC) AS a1
, activity AS a2
FROM t
WHERE (activity LIKE 'A%' OR activity LIKE 'B%')
ORDER BY name, time DESC
) sub;
db<>fiddle here
An SQL CASE expression defaults to NULL if no ELSE branch is added, so I kept that short.
Assuming time is defined NOT NULL. Else, you might want to add NULLS LAST. Why?
Sort by column ASC, but NULL values first?
(activity LIKE 'A%' OR activity LIKE 'B%') is more verbose than activity ~ '^[AB]', but typically faster in older versions of Postgres. About pattern matching:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Conditional window functions?
That's actually possible. You can combine the aggregate FILTER clause with the OVER clause of window functions. However:
The FILTER clause itself can only work with values from the current row.
More importantly, FILTER is not implemented for pure genuine functions like lead() or lag() (up to Postgres 13) - only for aggregate functions.
If you try:
lead(activity) FILTER (WHERE activity LIKE 'A%') OVER () AS activity
Postgres will tell you:
FILTER is not implemented for non-aggregate window functions
About FILTER:
Aggregate columns with additional (distinct) filters
Referencing current row in FILTER clause of window function
Performance
For few users with few rows per user, pretty much any query is fast, even without index.
For many users and few rows per user, the first query above should be fastest. See:
Select first row in each GROUP BY group?
For many rows per user, there are (potentially much) faster techniques, depending on details of your setup. See:
Optimize GROUP BY query to retrieve latest row per user

select distinct on(name) name,activity,next_activity
from (select name,activity,time
,lead(activity) over (partition by name order by time) as next_activity
from t
where left(activity,1) in ('A','B')
) t
where left(activity,1) = 'A'
order by name,time desc

Related

sql query count rows per id to by selecting range between 2 min dates in different columns

temp
|id|received |changed |ur|context|
|33|2019-02-18|2019-11-18|
|33|2019-08-02|2019-09-18|
|33|2019-12-27|2019-12-18|
|18|2019-07-14|2019-10-18|
|50|2019-03-20|2019-05-26|
|50|2019-01-19|2019-06-26|
temp2
|id|min_received |min_changed |
|33|2019-02-18 |2019-09-18 |
|18|2019-04-14 |2019-09-18 |
|50|2019-01-11 |2019-05-25 |
The 'temp' table shows users who received a request for an activity. A user can make multiple requests. Hence the received column has multiple dates showing when the requests was received. The 'changed' table shows when the status was changed. There are also multiple values for it.
There is another temp2 column which shows the min dates for received and changed. Need to count total requests per user between the range of values in temp2
The expected result should look like this :- The third row of id- 33 should not be selected because the received date is after the changed date.
|id|total_requests_sent|
|33|2 |
|18|1 |
|50|2 |
Tried Creating 2 CTE's for both MIN date values and joined with the original one

I may be really over-simplifying your task, but wouldn't something like this work?
select
t.id, count (*) as total_requests_sent
from
temp t
join temp2 t2 on
t.id = t2.id
where
t.received between t2.min_received and t2.min_changed
group by
t.id
I believe the output will match your example on the use case you listed, but with a limited dataset it's hard to be sure.

SQL index match to find duplicate data

I have the following table
Code Name Task
aa jones DC
ab dave DC
aca james IF
aca james DC
ab trevor IF
aa jones IF
ag francis DC
ag francis IF
af derek SF
af derek DC
This is a very big table, above is just a quick example.
So, I would like some help finding the code and name that have completed a IF or SF task and a DC task.
I would like it to show where one person has touched both of these tasks. The hierarchy of the tasks is; it comes in as either a SF or IF then someone will do that, then off the back of that we receive a DC task, and I want the ones where it has been completed by the same person, with the same reference number.
I am able to do this in excel with an INDEX MATCH function, but this takes up a tremendous amount of calculation time due to the size of the table.

One way to approach this is using group by with a having. This is a flexible way of expressing these types of conditions:
select code, name
from table t
group by code, name
having sum(case when task = 'DC' then 1 else 0 end) > 0 and
sum(case when task in ('IF', 'SF') then 1 else 0 end) > 0;
Each condition in the having clause counts the number of rows that meet the particular condition. The first, for instance, counts the rows that match 'DC' and takes only the code, name pairs that have at least one such match.

SELECT code,name FROM YOUR_TABLE_NAME WHERE task = 'DC' AND (task = 'IF' OR task = 'SF') GROUP BY name
try this query

Gordon Linoff's query can be made easier under the hypothesys that IF and SF are synonym and cannot be both present for the same Code-Name couple, as the data provided by the OP suggests
SELECT code, name
FROM table t
GROUP BY code, name
HAVING SUM(CASE WHEN task IN ('IF', 'SF', 'DC') THEN 1 ELSE 0 END) = 2;

select code,name from (select distinct code,name from table1 where task='SF' or task='IF') as temp1 inner join (select distinct code as code2,name as name2 from table1 where task='DC') as temp2 on code=code2,name=name2;
I'm assuming that you have the table in table1. The code constructs two tables temp1 and temp2. temp1 contains those codes and names which have been assigned SF and IF. temp2 contains those codes and names which have been assigned DC. Finally, I join the two tables together to find code-name pairs in both tables. This is faster than in Excel because the database engine probably temporarily indexes the columns being joined on.
Actually, you can do this in Excel. You sort the table by code and name, then enter the following formulas (assuming "Code" is in A1):
D2=if(and(A2=A1,B2=B1,D1),true,or(C2="IF",C2="SF"))
E2=if(and(A2=A1,B2=B1,E1),true,C2="DC")
Select these two cells, and double-click the fill-handle (the little square at the bottom right of the selection). Then, with the two columns selected, copy, and then "Paste Special..." > "Values". Then, filter (Alt-D-F-F) for the rows with values in columns D and E being both true. That is the result you want. Select these rows and copy to a new sheet if desired.
Alternatively, you can follow the SQL "group by" solution given by Gordon, so that you do not need to sort: Create two new columns like the above, but:
D1: "D"
E1: "E"
D2=if(or(C2="IF",C2="SF"),1,0)
E2=if(C2="DC",1,0)
Then, "Insert" > "PivotTable", drag "Code" and "Name" to be row labels. Drag "D" to be under Values, click on it, "Value Field Settings...", and then select "Max". Do the same for "E", and then the rows with 1 in both D and E will be the result you want.

Finding the maximum value of year difference

I have two tables here
BIODATA
ID NAME
1 A
2 B
YEAR
ID JOIN YEAR GRADUATE YEAR
1 1990 1991
2 1990 1993
I already use
select
NAME,
max(year(JOIN_YEAR) - year(GRADUATE_YEAR)) as MAX
from
DATA_DIRI
right join DATA_KARTU
ON BIODATA.ID = YEAR.ID;
but the result became:
+--------+------+
| NAME | MAX |
+--------+------+
| A | 3 |
+--------+------+
I already try a lot of different kind of joins but I still can't find how the NAME to be "B". Anyone can help me? Thanks a lot before

If you use an aggregate and a non-aggregate in the selection set at once, then the row used for the non-aggregate field is essentially picked at random.
Basically, how max works is this - it gathers all rows for each group by query (if there is no group by, all of them), calculates the max and puts that in the result.
But since you also put in a non-aggregate field, it needs a value for that - so what SQL does is just pick a random row. You might think 'well, why doesn't it pick the same row max did?' but what if you used avg or count? These have no row associated with it, so the best it can do is pick randomly. This is why this behaviour exists in general.
What you need to do is use a subquery. Something like select d1.id from data_diri d1 where d1.graduate_year - d1.join_year = (select max(d2.graduate_year - d2.join_year from data_diri d2))

How to group by a column

Hi I know how to use the group by clause for sql. I am not sure how to explain this so Ill draw some charts. Here is my original data:
Name Location
----------------------
user1 1
user1 9
user1 3
user2 1
user2 10
user3 97
Here is the output I need
Name Location
----------------------
user1 1
9
3
user2 1
10
user3 97
Is this even possible?

The normal method for this is to handle it in the presentation layer, not the database layer.
Reasons:
The Name field is a property of that data row
If you leave the Name out, how do you know what Location goes with which name?
You are implicitly relying on the order of the data, which in SQL is a very bad practice (since there is no inherent ordering to the returned data)
Any solution will need to involve a cursor or a loop, which is not what SQL is optimized for - it likes working in SETS not on individual rows

Hope this helps
SELECT A.FINAL_NAME, A.LOCATION
FROM (SELECT DISTINCT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.NAME,
YT.LOCATION
FROM YOUR_TABLE_7 YT) A
As Jirka correctly pointed out, I was using the Outer select, distinct and raw Name unnecessarily. My mistake was that as I used DISTINCT , I got the resulted sorted like
1 1
2 user2 1
3 user3 97
4 user1 1
5 3
6 9
7 10
I wanted to avoid output like this.
Hence I added the raw id and outer select
However , removing the DISTINCT solves the problem.
Hence only this much is enough
SELECT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.LOCATION
FROM SO_BUFFER_TABLE_7 YT
Thanks Jirka

If you're using straight SQL*Plus to make your report (don't laugh, you can do some pretty cool stuff with it), you can do this with the BREAK command:
SQL> break on name
SQL> WITH q AS (
SELECT 'user1' NAME, 1 LOCATION FROM dual
UNION ALL
SELECT 'user1', 9 FROM dual
UNION ALL
SELECT 'user1', 3 FROM dual
UNION ALL
SELECT 'user2', 1 FROM dual
UNION ALL
SELECT 'user2', 10 FROM dual
UNION ALL
SELECT 'user3', 97 FROM dual
)
SELECT NAME,LOCATION
FROM q
ORDER BY name;
NAME LOCATION
----- ----------
user1 1
9
3
user2 1
10
user3 97
6 rows selected.
SQL>

I cannot but agree with the other commenters that this kind of problem does not look like it should ever be solved using SQL, but let us face it anyway.
SELECT
CASE main.name WHERE preceding_id IS NULL THEN main.name ELSE null END,
main.location
FROM mytable main LEFT JOIN mytable preceding
ON main.name = preceding.name AND MIN(preceding.id) < main.id
GROUP BY main.id, main.name, main.location, preceding.name
ORDER BY main.id
The GROUP BY clause is not responsible for the grouping job, at least not directly. In the first approximation, an outer join to the same table (LEFT JOIN below) can be used to determine on which row a particular value occurs for the first time. This is what we are after. This assumes that there are some unique id values that make it possible to arbitrarily order all the records. (The ORDER BY clause does NOT do this; it orders the output, not the input of the whole computation, but it is still necessary to make sure that the output is presented correctly, because the remaining SQL does not imply any particular order of processing.)
As you can see, there is still a GROUP BY clause in the SQL, but with a perhaps unexpected purpose. Its job is to "undo" a side effect of the LEFT JOIN, which is duplication of all main records that have many "preceding" ( = successfully joined) records.
This is quite normal with GROUP BY. The typical effect of a GROUP BY clause is a reduction of the number of records; and impossibility to query or test columns NOT listed in the GROUP BY clause, except through aggregate functions like COUNT, MIN, MAX, or SUM. This is because these columns really represent "groups of values" due to the GROUP BY, not just specific values.

If you are using SQL*Plus, use the BREAK function. In this case, break on NAME.
If you are using another reporting tool, you may be able to compare the "name" field to the previous record and suppress printing when they are equal.

If you use GROUP BY, output rows are sorted according to the GROUP BY columns as if you had an ORDER BY for the same columns. To avoid the overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
Relying on implicit GROUP BY sorting in MySQL 5.6 is deprecated. To achieve a specific sort order of grouped results, it is preferable to use an explicit ORDER BY clause. GROUP BY sorting is a MySQL extension that may change in a future release; for example, to make it possible for the optimizer to order groupings in whatever manner it deems most efficient and to avoid the sorting overhead.
For full information - http://academy.comingweek.com/sql-groupby-clause/

SQL GROUP BY STATEMENT
SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.
Syntax:
1. SELECT column_nm, aggregate_function(column_nm) FROM table_nm WHERE column_nm operator value GROUP BY column_nm;
Example :
To understand the GROUP BY clauserefer the sample database.Below table showing fields from “order” table:
1. |EMPORD_ID|employee1ID|customerID|shippers_ID|
Below table showing fields from “shipper” table:
1. | shippers_ID| shippers_Name |
Below table showing fields from “table_emp1” table:
1. | employee1ID| first1_nm | last1_nm |
Example :
To find the number of orders sent by each shipper.
1. SELECT shipper.shippers_Name, COUNT (orders.EMPORD_ID) AS No_of_orders FROM orders LEFT JOIN shipper ON orders.shippers_ID = shipper.shippers_ID GROUP BY shippers_Name;
1. | shippers_Name | No_of_orders |
Example :
To use GROUP BY statement on more than one column.
1. SELECT shipper.shippers_Name, table_emp1.last1_nm, COUNT (orders.EMPORD_ID) AS No_of_orders FROM ((orders INNER JOIN shipper ON orders.shippers_ID=shipper.shippers_ID) INNER JOIN table_emp1 ON orders.employee1ID = table_emp1.employee1ID)
2. GROUP BY shippers_Name,last1_nm;
| shippers_Name | last1_nm |No_of_orders |
for more clarification refer my link
http://academy.comingweek.com/sql-groupby-clause/

MySQL: Getting highest score for a user

I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks

Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.

Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!

Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas