PostgreSQL fill out nulls with previous value by category - sql

I am trying fill out some nulls where I just need them to be the previous available value for a name (sorted by date).
So, from this table:
I need the query to output this:
Now, the idea is that for Jane, on the second and third there was no score, so it should be equal to the previous date on which an score was available, for Jane. And the same for Jon. I am trying coalesce and range, but range is not implemented yet in Redshift. I also looked into other questions and they don't fully apply to different categories. Any alternatives?
Thanks!

select day, name,
coalesce(score, (select score
from [your table] as t
where t.name = [your table].name and t.date < [your table].date
order by date desc limit 1)) as score
from [your table]
The query straightforwardly implements the logic you described:
if score is not null, coalesce will return its value without executing the subquery
if score is null, the subquery will return the last available score for that name before the given date

It's a "gaps and islands" problem and a query can be like this
SELECT
day,
name,
MAX(score) OVER (PARTITION BY name, group_id) AS score
FROM (
SELECT
*,
SUM(CASE WHEN score IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY name ORDER BY day) AS group_id
FROM data
) groups
ORDER BY name DESC, day
You can check a working demo here

Related

How to find difference in date between each unique ID across multiple rows when not ordered? (PostgreSQL)

I have a table with id, order sequence and date, and I am trying to add two columns, one with a difference in date function, and another with a status function that is reliant on the value of the difference in date.
Table looks like this:
The issue I am having is that, when I try to find the difference between the dates of each unique id, so that if it's the first order sequence, it should be null, if it's any subsequent order sequence, let's say 3, it will be the 3rd date - 2nd date. Now this all works with the code I have:
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
However, this only works when the table is already ordered. If I jumble up the order that I input the table in, the values come out a little different. I figured it might be because "lag" function only takes the previous row's value, so if the previous row does not belong to the same id, and is not in chronological order, the dates won't subtract well.
My code looks like this at the moment:
select
id,
ord_seq,
ord_date,
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
case
when ord_seq = 1 then 'New'
when ord_date - lag(ord_date) over (order by id, ord_seq) between 1 and 200 then 'Retain'
when ord_date - lag(ord_date) over (order by id, ord_seq) > 200 then 'Reactivated'
end as status
from t1
order by id, ord_seq, ord_date
My db<>fiddle
Am I using the correct function here? How do I find the difference in date between one unique ID, regardless of the order of the table?
Any help would be much appreciated.
In case you want to see end table result (error is on id 'ddd', ord seq '2' and '3'):
Ordered Input:
Not Ordered Input:
When using this:
You miss the partition by in your window frame definition. Here it is, working regardless of any table order:
select *,
ord_date - lag(ord_date) over (partition by id order by ord_seq) as date_diff
from t1;
Please note however that database tables have no natural order that you can not rely upon and can not be considered ordered, no matter in what sequence the records have been inserted. You must specify explicitly an order by clause if you need a specific order.

Selecting 1 column's value in a group after grouping by another column

How would I include the name of any one of the books that belong to that particular type in the below query?
select distinct
(select sum(ob.Balance)),
ob.BookType
from orders.OrderBooks ob
group by ob.BookType
In its current state it does what I need it to and groups books by BookType and sums their balances, as seen below.
However I need the name of any book that belongs to that BookType as part of the result.
If I select the BookName column and then group by it like below, it results in more unique entries and to an extent undoes the original grouping.
select distinct
(select sum(ob.Balance)),
ob.BookType,
ob.BookName
from orders.OrderBooks ob
group by ob.BookType, ob.BookName
;WITH x AS
(
SELECT
Balance = SUM(Balance) OVER (PARTITION BY BookType),
BookType,
BookName,
rn = ROW_NUMBER() OVER (PARTITION BY BookType ORDER BY BookName DESC)
FROM orders.OrderBooks
)
SELECT Balance, BookType, BookName
FROM x
WHERE rn = 1;
db<>fiddle
ORDER BY BookName DESC was dealer's choice. If you truly don't care which title shows up in the result, you can use any ordering you like. If you want the results to be random every time, you can use ORDER BY NEWID().
In general I like this flexibility better than the TOP (1) subquery approach, in addition to a single scan instead of an additional table access per row. But you can also do it a different way; just take min/max of the bookname, too:
SELECT Balance = SUM(Balance),
BookType,
BookName = MIN(BookName) -- or MAX()
FROM dbo.OrderBooks
GROUP BY BookType;
You can see these give similar results in this db<>fiddle. Plan is simpler, too; most notably: no spools. However when you use an aggregate function against that column, it makes it harder to provide arbitrary/random results, and if you intend to add other columns pulled from the right row, you'll need to go back to the row_number solution.
You can use a correlated subquery to get a single book name of that type. This assumes there's an ID field and you want to pull the most recent one:
select
Balance = (select sum(ob.Balance)),
ob.BookType,
BookName = (SELECT TOP(1) ob.BookName FROM orders.OrderBooks ob2 WHERE ob2.BookType = ob.BookType ORDER BY ob2.ID DESC)
from orders.OrderBooks ob
group by ob.BookType, ob.BookName

how to get latest date column records when result should be filtered with unique column name in sql?

I have table as below:
I want write a sql query to get output as below:
the query should select all the records from the table but, when multiple records have same Id column value then it should take only one record having latest Date.
E.g., Here Rudolf id 1211 is present three times in input---in output only one Rudolf record having date 06-12-2010 is selected. same thing with James.
I tried to write a query but it was not succssful. So, please help me to form a query string in sql.
Thanks in advance
You can partition your data over Date Desc and get the first row of each partition
SELECT A.Id, A.Name, A.Place, A.Date FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY Date DESC) AS rn
FROM [Table]
) A WHERE A.rn = 1
you can use WITH TIES
select top 1 PERCENT WITH TIES * from t
order by (row_number() over(partition by id order by date desc))
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=280b7412b5c0c04c208f2914b44c7ce3
As i can see from your example, duplicate rows differ only in Date. If it's a case, then simple GROUP BY with MAX aggregate function will do the job for you.
SELECT Id, Name, Place, MAX(Date)
FROM [TABLE_NAME]
GROUP BY Id, Name, Place
Here is working example: http://sqlfiddle.com/#!18/7025e/2

How to count rows in SQL Server 2012?

I am trying to find whether a person (id = A3) is continuously active in a program at least five months or more in a given year (2013). Any suggestion would be appreciated. My data look like as follows:
You simply use group by and a conditional expression:
select id,
(case when count(ActiveMonthYear) >= 5 then 'YES!' else 'NAW' end)
from table t
where ListOfTheMonths between '201301' and '201312'
group by id;
EDIT:
I suppose "continuously" doesn't just mean any five months. For that, there are various ways. I like the difference of row numbers approach
select distinct id
from (select t.*,
(row_number() over (partition by id order by ListOfTheMonths) -
count(ActiveMonthYear) over (partition by id order by ListOfTheMonths)
) as grp
from table t
where ListOfTheMonths between '201301' and '201312'
) t
where ActiveMonthYear is not null
group by id, grp
having count(*) >= 5;
The difference in the subquery is constant for groups of consecutive active months. This is then used a grouping. The result is a list of all ids that meet this criteria. You can add a where for a particular id (do it in the subquery).
By the way, this is written using select distinct and group by. This is one of the rare cases where these two are appropriately used together. A single id could have two periods of five months in the same year. There is no reason to include that person twice in the result set.

Is it possible to get a function result with columns which are not in the group by (SQL)?

I am trying to get the last registration date of a course, but I want to know the id of thar record. As MAX is a function, I must use group by id, which I do not want, because the result is very different (From only one record to each record per id).
Which is the way to manage a query like this?:
SELECT id, MAX(registration_date) AS registration_date
FROM courses;
Because it gives an error and I must do this to avoid it:
SELECT id, MAX(registration_date) AS registration_date
FROM courses
GROUP BY id;
And I do not want the result of the last one.
You could use the rank() window function for that:
SELECT id
FROM (SELECT id, RANK() OVER (ORDER BY registration_date DESC) AS rk
FROM courses)
WHERE rk = 1
One method is to use a sub query like this:
select *
from [dbo].[Courses]
where registration_date =
(select max(registration_date)
from [dbo].[Courses])
but with only a date to match this may return more than one record.
If possible, include more fields in the where clause to narrow it down.