SQL/Postgres - collapse every N rows into 1 based on row position in group - sql

I have a set of ordered results from a Postgres table, where every group of 4 rows represents a set of related data. I want to process this set of results further, so that every group of 4 rows are collapsed into 1 row with aliased column names where the value for each column is based on that row's position in the group - I'm close, but I can't quite get the query right (nor am I confident that I'm approaching this in the optimal manner). Here's the scenario:
I am collecting survey results - each survey has 4 questions, but each answer is stored in a separate row in the database. However, they are associated with each other by a submission event_id, and the results are guaranteed to be returned in a fixed order. A set of survey_results will look something like:
event_id | answer
----------------------------
a | 10
a | foo
a | 9
a | bar
b | 2
b | baz
b | 4
b | zip
What I would like to be able to do is query this result so that the final output comes out with each set of 4 results on their own line, with aliased column names.
event_id | score_1 | reason_1 | score_2 | reason_2
----------------------------------------------------------
a | 10 | foo | 9 | bar
b | 2 | baz | 4 | zip
The closest that I've been able to get is
SELECT survey_answers.event_id,
(SELECT survey_answers.answer FROM survey_answers FETCH NEXT 1 ROWS ONLY) AS score_1,
(SELECT survey_answers.answer FROM survey_answers OFFSET 1 ROWS FETCH NEXT 1 ROWS ONLY) AS reason_1
(SELECT survey_answers.answer FROM survey_answers OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY) AS score_2,
(SELECT survey_answers.answer FROM survey_answers OFFSET 3 ROWS FETCH NEXT 1 ROWS ONLY) AS reason_2
FROM survey_answers
GROUP BY survey_answers.event_id
But this, understandably, returns the correct number of rows, but with the same values (other than event_id):
event_id | score_1 | reason_1 | score_2 | reason_2
----------------------------------------------------------
a | 10 | foo | 9 | bar
b | 10 | foo | 9 | bar
How can I structure my query so that it applies the OFFSET/FETCH behaviors every batch of 4 rows, or, maybe more accurately, within every unique set of event_ids?

demo: db<>fiddle
First of all, this looks like a very bad design:
There is no guaranteed order! Databases store their data in random order and call them in random order. You really need a order column. In this small case this might work for accident.
You should generate two columns, one for score, one for reason. Mix up the types is not a good idea.
Nevertheless for this simple and short example this could be a solution (remember this is not recommended for productive tables):
WITH data AS (
SELECT
*,
row_number() OVER (PARTITION BY event_id) -- 1
FROM
survey_results
)
SELECT
event_id,
MAX(CASE WHEN row_number = 1 THEN answer END) AS score_1, -- 2
MAX(CASE WHEN row_number = 2 THEN answer END) AS reason_1,
MAX(CASE WHEN row_number = 3 THEN answer END) AS score_2,
MAX(CASE WHEN row_number = 4 THEN answer END) AS reason_2
FROM
data
GROUP BY event_id
The row_number() window function adds a row count for each event_id. In this case from 1 to 4. This can be used to identify the types of answer (see intermediate step in fiddle). In productive code you should use some order column to ensure the order. Then the window function would look like PARTITION BY event_id ORDER BY order_column
This is a simple pivot on event_id and the type id (row_number) which does exactly what you expect

You need a column that specifies the ordering. In your case, that should probably be a serial column, which is guaranteed to be increasing for each insert. I would call such a column survey_result_id.
With such a column, you can do:
select event_id,
max(case when seqnum = 1 then answer end) as score_1,
max(case when seqnum = 2 then answer end) as reason_1,
max(case when seqnum = 3 then answer end) as score_2,
max(case when seqnum = 4 then answer end) as reason_2
from (select sr.*,
row_number() over (partition by event_id order by survey_result_id) as seqnum
from survey_results sr
) sr
group by event_id;
Without such a column, you cannot reliably do what you want, because SQL tables represent unordered sets.

Related

Remove Duplicates Based Off of Two Columns in PostgreSQL

So let's say I have a table named Class with the following fields: userid, time, and score. The table looks like this:
+--------+------------+-------+
| userid | time | score |
+--------+------------+-------+
| 1 | 08-20-2018 | 75 |
| 1 | 10-25-2018 | 50 |
| 1 | 02-01-2019 | 88 |
| 2 | 04-23-2019 | 98 |<remove
| 2 | 04-23-2019 | 86 |
| 3 | 06-05-2019 | 71 |<remove
| 3 | 06-05-2019 | 71 |
+--------+------------+-------+
However, I would like to remove records where the userid and the time is the same (since it doesn't make sense for someone to give another score on the same day). This would also take care of the records where the userid, time, and score are the same. So in this table, rows 4 and 6 should be removed.
The following query gives me a list of the duplicated records:
select userid, time
FROM class
GROUP BY userid, time
HAVING count(*)>1;
However, how do I remove the duplicates while still keeping the userid, time, and score column in the outcome?
You can use the row_number() window function to assign a number to each record in the order of score for each userid and time and then select only the rows where this number is equal to one.
SELECT userid,
time,
score
FROM (SELECT userid,
time,
score,
row_number() OVER (PARTITION BY userid,
time
ORDER BY score) rn
FROM class) x
WHERE rn = 1;
First, you need some criterium to distinguish between two rows that have different scores (unless you want to randomly choose between the two). E.g., you could pick the highest score (like the SATs) or the lowest.
Assuming you want the highest score per day, you can do this:
SELECT distinct on (userid, time)
user_id, time, score
from class
order by userid, time, score desc
Some key things: you have to have the same columns in your distinct on in the left-most positions in your order by but the magic is in the field that comes next in the order by - it’ll pick the first row among dupes of (userid, time) when ordered by score desc.
You have a real problem with your data model. This is easy enough to fix in a select query, as the other answer suggest (I would recommend distinct on) for this.
For actually deleting the row, you can use ctid (as mentioned in a comment. The approach is:
delete from t
where exists (select 1
from t t2
where t2.user_id = t.user_id and t2.time = t.time and
t2.ctid < t.ctid
);
That is, delete any row where there is a smaller ctid for the user_id/time combination.

SQL to show 1 for unique and 0 for reapeat

looking for a quick solution on SQL...
I used to have a clunky formula in excel: =IF(COUNTIF($C$2:C2,C2)>1,0,COUNTIF($C$2:C2,C2)) to print 1 for unique item and 0 for a repeat.
Then moved to =1-(C1-C2) and that kinda did the job... Not an accurate one Now looking for an SQL that could do a similar job... The example below for result needed:
NUMBER UNIQUE
6573455300000 1
6573455300000 0
6573455300000 0
6573455300000 0
6573411981080 1
6573411981080 0
6573411981080 0
6573411981080 0
Does anyone know any kind of code to achieve this?
using row_number():
select
col
, [first] = case when row_number() over (partition by col order by (select 1)) > 1 then 0 else 1 end
from t
rextester demo: http://rextester.com/FWA89661
returns:
+---------------+-------+
| col | first |
+---------------+-------+
| 6573411981080 | 1 |
| 6573411981080 | 0 |
| 6573411981080 | 0 |
| 6573411981080 | 0 |
| 6573455300000 | 1 |
| 6573455300000 | 0 |
| 6573455300000 | 0 |
| 6573455300000 | 0 |
+---------------+-------+
Use window functions. In your case, you seem to want the first row and mark that, so row_number() looks like the solution:
select t.*,
(case when row_number() over (partition by number order by ?) = 1
then 1 else 0 end
end) as flag
from t;
The ? is for the column that specifies the ordering (which is first). If you want just one row but don't care which, then you can use order by number or order by (select null).
UNIQUE is a SQL keyword (think "unique index"), so it is a bad name for a column. That is why I changed to the generic flag, although you might prefer first_row_flag or something like that.
SELECT
[number],
case when rown = 1 then 1 else 0 end as [unique]
FROM
(
SELECT
[number], row_number() OVER(partition by [number] order by [number]) as rown
FROM
t
) a
This doesn't strictly have to be done using a subquery but it's unlikely to make any difference to the overall performance, so it's arranged like this to help you see what is going on. If you run just the inner subquery in isolation you'll see that the most important work is done by row_number; essentially the data is partitioned into buckets based on the value of [number] something like a group by, but it doesn't suppress repeated values. Within the partition each occurrence of [number] is numbered with an incrementing counter. When a different value of [number] is encountered the numbering restarts from 1. The order by clause is just there because sql server demands you have one, and we don't know anything else about your table but if there's something else about your data where one of these occurrences would be more ideal to single out to be labelled with [unique]=1, try and find a way to make it so that row is sorted into position 1; a typical use of this pattern is "latest record" in which case the order by part would be [datecolumn] DESC
Once you have an increment of counter per number that resets itself, all we need to do is use a standard case / else statement to make it a 1 when it's 1 otherwise 0 to match your result desired
select t.Number,case when t.num=1 then t.num else 0 end [Unique] from(
select Number,row_number() over (partition by number order by number) num from MyTbl)t
order by t.Number

SQL Server, complex query

I have an Azure SQL Database table which is filled by importing XML-files.
The order of the files is random so I could get something like this:
ID | Name | DateFile | IsCorrection | Period | Other data
1 | Mr. A | March, 1 | false | 3 | Foo
20 | Mr. A | March, 1 | true | 2 | Foo
13 | Mr. A | Apr, 3 | true | 2 | Foo
4 | Mr. B | Feb, 1 | false | 2 | Foo
This table is joined with another table, which is also joined with a 3rd table.
I need to get the join of these 3 tables for the person with the newest data, based on Period, DateFile and Correction.
In my above example, Id=1 is the original data for Period 3, I need this record.
But in the same file was also a correction for Period 2 (Id=20) and in the file of April, the data was corrected again (Id=13).
So for Period 3, I need Id=1, for Period 2 I need Id=13 because it has the last corrected data and I need Id=4 because it is another person.
I would like to do this in a view, but using a stored procedure would not be a problem.
I have no idea how to solve this. Any pointers will be much appreciated.
EDIT:
My datamodel is of course much more complex than this sample. DateFile and Period are DateTime types in the table. Actually Period is two DateTime columns: StartPeriod and EndPeriod.
Well looking at your data I believe we can disregard the IsCorrection column and just pick the latest column for each user/period.
Lets start by ordering the rows placing the latest on top :
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC), *
And from this result you select all with row number 1:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows where rowIndex=1
The PARTITION BY tells ROW_NUMBER() to reset the counter whenever it encounters change in the columns Period and Name. The ORDER BY tells the ROW_NUMBER() that we want th newest row to be number 1 and then older posts afterwards. We only need the latest row.
The WITH declares a "common table expression" which is a kind of subquery or temporary table.
Not knowing your exact data, I might recommend you something wrong, but you should be able to join your with last query with other tables to get your desired result.
Something like:
;with numberedRows as (
SELECT ROW_NUMBER() OVER (PARTITION BY Period, Name ORDER by DateFile DESC) as rowIndex, *
)
select * from numberedRows a
JOIN periods b on b.empId = a.Id
JOIN msg c on b.msgId = c.Id
where a.rowIndex=1

update column with incremental value

i have a table like this
userid | points | position
1 | 100 | NULL
2 | 89 | NULL
3 | 107 | NULL
i need a query for update the position column ordering by points desc, example result:
userid | points | position
1 | 100 | 2
2 | 89 | 3
3 | 107 | 1
I would not use physical columns that depend on values in other rows, otherwise you have to update the entire table every time one row changes. Use a view or other mechanism to calculate the position on the fly.
The query to calculate "position" would look something like:
SELECT
userid,
points,
RANK() OVER (ORDER BY points DESC) AS position
However, if you have to make it an UPDATE then you could use something like
UPDATE a
SET a.position = b.position
FROM {table_name} a
INNER JOIN
(
SELECT
userid,
RANK() OVER (ORDER BY points DESC) AS position
FROM {table_name}
) b
ON a.userid = b.userid
but keep in mind that you will need to run the update every time the table is updated, so performance may be an issue if it's a decent size table that gets updated a lot.
Also consider using DENSE_RANK() instead of RANK() when you want to increment the ranking of your 'position' by 1 as the 'points' change. RANK() will do what you want, though it will create number sequence gaps according to how many duplicate 'userids' are equal in 'points' standing's (if that's ever the case in your spec).
Refer to this answer for the difference between them.
You can do something like this:
UPDATE t
SET position = t2.position
FROM table t
JOIN (
SELECT
userid,
points,
RANK() OVER (ORDER BY points DESC) AS position
FROM table) t2 ON t2.userid = t.userid

sql create columns from group by collection

I have a table in the following form
chain |branch
________|________|
a |UK
a |US
b |ISRAEL
b |UK
b |FRANCE
b |BELGIUM
c |NIGERIA
and i would like to create a new table in the following format
chain |branch_1|branch_2|branch_3|branch_4
________|________|________|________|________|
a | UK | US |--------|--------|
b | ISRAEL| UK | FRANCE |BELGIUM |
c | NIGERIA|--------|--------|--------|
For further clarification, imagine that you can do a group by (chain) where the aggregate function is the identity so that
group_1->(element1,element2,element3,..,elementM)
group_2->(element1,element2,element3,..,elementN)
...
group_X->(element1,element2,element3,..,elementZ)
so a new table will be created which will have
R+K columns where R are the number of columns that we group by (in our case that is the column 'chain' so R=1) and K is the max count of the groups (in our case that is four, corresponding to chain 'b')
I am sure that this must be a common question, so my apologies if this been answered before, but i could not find anything.
EDIT:
THIS IS NOT A PIVOT TABLE
A pivot table in that case would be
chain |UK |US |ISRAEL |FRANCE |BELGIUM |NIGERIA |
________|________|________|________|________|________|________|
____a___|____1___|____1___|____0___|____0___|____0___|____0___|
____b___|____1___|____0___|____1___|____1___|____1___|____0___|
____c___|____0___|____0___|____0___|____0___|____0___|____1___|
Thanks!
You can do this with conditional aggregation and row_number():
select chain,
max(case when seqnum = 1 then branch end) as branch_01,
max(case when seqnum = 2 then branch end) as branch_02,
max(case when seqnum = 3 then branch end) as branch_03,
max(case when seqnum = 4 then branch end) as branch_04
from (select t.*,
row_number() over (partition by chain order by branch) as seqnum
from table t
) t
group by chain;
Note: Your table doesn't have a column specifying the ordering of the rows. SQL tables represent unordered sets. Without such a column, there is no concept of one row being before or after another. So, this version orders by the branch name. You can order by whatever you like by changing the order by clause for row_number().