Create a group indicator (SQL)

Create a group indicator (SQL) - sql

I am looking to create a group indicator for a query using SQL (Oracle specifically). Basically, I am looking for duplicate entries for certain columns and while I can find those what I also want is some kind of indicator to say what rows the duplicates are from.
Below is an example of what I am looking to do (looking for duplicates on Name, Zip, Phone). The rows with Name = aaa are all in the same group, bb are not, and c are.
Is there even a way to do this? I was thinking something with OVER (PARTITION BY ... but I can't think of a way to only increment for each group.
+----------+---------+-----------+------------+-----------+-----------+
| Name | Zip | Phone | Amount | Duplicate | Group |
+----------+---------+-----------+------------+-----------+-----------+
| aaa | 1234 | 5555555 | 500 | X | 1 |
| aaa | 1234 | 5555555 | 285 | X | 1 |
| bb | 545 | 6666666 | 358 | | 2 |
| bb | 686 | 7777777 | 898 | | 3 |
| aaa | 1234 | 5555555 | 550 | X | 1 |
| c | 5555 | 8888888 | 234 | X | 4 |
| c | 5555 | 8888888 | 999 | X | 4 |
| c | 5555 | 8888888 | 230 | X | 4 |
+----------+---------+-----------+------------+-----------+-----------+

It looks like you can just use
(CASE WHEN COUNT(*) OVER (partition by name, zip, phone) > 1
THEN 'X'
ELSE NULL
END) duplicate,
DENSE_RANK() OVER (ORDER BY name, zip, phone) group_rank
Rows that have the same name, zip, and phone will have the same group_rank. Here is a SQL Fiddle example.

Related

Get row for each unique user based on highest column value

I have the following data
+--------+-----------+--------+
| UserId | Timestamp | Rating |
+--------+-----------+--------+
| 1 | 1 | 1202 |
| 2 | 1 | 1198 |
| 1 | 2 | 1204 |
| 2 | 2 | 1196 |
| 1 | 3 | 1206 |
| 2 | 3 | 1194 |
| 1 | 4 | 1198 |
| 2 | 4 | 1202 |
+--------+-----------+--------+
I am trying to find the distribution of each user's Rating, based on their latest row in the table (latest is determined by Timestamp). On the path to that, I am trying to get a list of user IDs and Ratings which would look like the following
+--------+--------+
| UserId | Rating |
+--------+--------+
| 1 | 1198 |
| 2 | 1202 |
+--------+--------+
Trying to get here, I sorted the list on UserId and Timestamp (desc) which gives the following.
+--------+-----------+--------+
| UserId | Timestamp | Rating |
+--------+-----------+--------+
| 1 | 4 | 1198 |
| 2 | 4 | 1202 |
| 1 | 3 | 1206 |
| 2 | 3 | 1194 |
| 1 | 2 | 1204 |
| 2 | 2 | 1196 |
| 1 | 1 | 1202 |
| 2 | 1 | 1198 |
+--------+-----------+--------+
So now I just need to take the top N rows, where N is the number of players. But, I can't do a LIMIT statement as that needs a constant expression, as I want to use count(id) as the input for LIMIT which doesn't seem to work.
Any suggestions on how I can get the data I need?
Cheers!
Andy

This should work:
SELECT test.UserId, Rating FROM test
JOIN
(select UserId, MAX(Timestamp) Timestamp FROM test GROUP BY UserId) m
ON test.UserId = m.UserId AND test.Timestamp = m.Timestamp
If you can use WINDOW FUNCTIONS then you can use the following:
SELECT UserId, Rating FROM(
SELECT UserId, Rating, ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY Timestamp DESC) row_num FROM test
)m WHERE row_num = 1

How to fill forward time series data in Postgres

I am looking to join three tables together and fill forward null values on the resulting table.
Three tables:
Table 1 (raw.fb_historical_data) - this is the main table on which I would like to join the other two on to. Each row of this table is related to one or more rows in the other two tables through a combination of columns id, clk and timestamp (mkt_id and row_id in the other tables).
+---------------------+-----+-----+--------------+
| timestamp | clk | id | some_columns |
+---------------------+-----+-----+--------------+
| 2016-06-19 06:11:13 | 123 | 126 | a |
| 2016-06-19 06:16:13 | 124 | 127 | b |
| 2016-06-19 06:21:13 | 234 | 126 | c |
| 2016-06-19 06:41:13 | 456 | 127 | d |
| ... | ... | ... | ... |
+---------------------+-----+-----+--------------+
Table 2 (raw.fb_runner_changes) - this table essentially gives price changes for a wide range of different markets
+---------------------+--------+--------+-------+
| timestamp | row_id | mkt_id | price |
+---------------------+--------+--------+-------+
| 2016-06-19 06:11:13 | 123 | 126 | 1 |
| 2016-06-19 06:21:13 | 123 | 126 | 2 |
| 2016-06-19 06:41:13 | 123 | 126 | 3 |
| 2016-06-06 18:54:06 | 124 | 127 | 1 |
| 2016-06-06 18:56:06 | 124 | 127 | 2 |
| 2016-06-06 18:57:06 | 124 | 127 | 3 |
| ... | ... | ... | ... |
+---------------------+--------+--------+-------+
Table 3 (raw.fb_runners) - a table with extra information about market changes that I would like to join
+---------------------+--------+--------+---------------+
| timestamp | row_id | mkt_id | other_columns |
+---------------------+--------+--------+---------------+
| 2016-06-19 06:15:13 | 234 | 126 | ab |
| 2016-06-19 06:31:13 | 234 | 126 | cd |
| 2016-06-19 06:56:13 | 234 | 126 | ef |
| 2016-06-06 18:54:06 | 456 | 127 | gh |
| 2016-06-06 18:56:06 | 456 | 127 | jk |
| 2016-06-06 18:57:06 | 456 | 127 | lm |
| ... | ... | ... | ... |
+---------------------+--------+--------+---------------+
Essentially what I want to do is fill NULL information forward (ordered by timestamp) while grouping by market id.
So far, I have tried to join the tables together using
SELECT *
FROM raw.fb_historical_data AS h
LEFT JOIN raw.fb_runner_changes AS rc
ON rc.row_id = h.clk
AND rc.timestamp = h.timestamp
AND rc.mkt_id = h.id
LEFT JOIN raw.fb_runners AS r
ON r.row_id = h.clk
AND r.timestamp = h.timestamp
AND r.mkt_id = h.id
Which has worked as intended, though now there are nulls in the resulting dataset which i'd like to fill in with the last available value for that market.

With some of the other SQL dialects, fill forward could be done using the window function last_value in combination with the instruction ignore nulls.
Since this is not supported in PostgreSQL (check the note at the bottom of this page), we are using a 2 steps work-around.
select ts, val, val_seq, min(val) over (partition by val_seq) val_fill_fw
from (select ts, val, count(val) over(order by ts) as val_seq
from t
) t
-
+----+----------+---------+-------------+
| ts | val | val_seq | val_fill_fw |
+----+----------+---------+-------------+
| 1 | (null) | 0 | (null) |
| 2 | (null) | 0 | (null) |
| 3 | hello | 1 | hello |
| 4 | (null) | 1 | hello |
| 5 | (null) | 1 | hello |
| 6 | darkness | 2 | darkness |
| 7 | my | 3 | my |
| 8 | (null) | 3 | my |
| 9 | old | 4 | old |
| 10 | (null) | 4 | old |
| 11 | (null) | 4 | old |
| 12 | (null) | 4 | old |
| 13 | friend | 5 | friend |
| 14 | (null) | 5 | friend |
+----+----------+---------+-------------+
SQL Fiddle

This seems to correctly do 'forward fill' in postgres. However I am a postgres newbie so I would appreciate feedback if it's wrong.
DROP TABLE IF EXISTS example;
create temporary table example(id int, str text, val integer);
insert into example values
(1, 'a', null),
(1, null, 1),
(2, 'b', 2),
(2,null ,null );
select * from example
select id, (case
when str is null
then lag(str,1) over (order by id)
else str
end) as str,
(case
when val is null
then lag(val,1) over (order by id)
else val
end) as val
from example

I want to add sum for running value on a column but if sequence fails then we have don't have to add

I have table like this
+----+--------+------+------+
| id | state | num | pop |
+----+--------+------+------+
| 1 | ny | 1 | 100 |
| 1 | ny | 2 | 200 |
| 1 | ny | 3 | 600 |
| 1 | ny | 6 | 400 |
| 1 | ny | 7 | 300 |
| 1 | ny | 14 | 1000 |
| 2 | nj | 3 | 250 |
+----+--------+------+------+
I want output as below
+---+----+----+------+------+
| 1 | ny | 1 | 100 | 900 |
| 1 | ny | 2 | 200 | 900 |
| 1 | ny | 3 | 600 | 900 |
| 1 | ny | 6 | 400 | 700 |
| 1 | ny | 7 | 300 | 700 |
| 1 | ny | 14 | 1000 | 1000 |
| 2 | nj | 3 | 250 | 250 |
+---+----+----+------+------+
So if there is a seq in num column then we have to add the pop column . So first 3 columns num column has 1,2,3 which is in sequence so we are adding pop column 100+200+600 and displaying as new column.
I tried below code but I am not receiving desired out put
select id, state,num, pop,
sum(pop) over (partition by id, state order by num )
from table

If you subtract a sequence, the values will be constant for the values in a row. Then you can use window functions:
select t.*,
sum(pop) over (partition by state, num - seqnum) as new_population
from (select t.*,
row_number() over (partition by state order by num) as seqnum
from t
) t;
Here is a db<>fiddle (using Postgres).

SQL Server 2016 count similar rows as a column without duplicating query

I have a SQL query that returns data similar to this pseudo-table:
| Name | Id1 | Id2 | Guid |
|------+-----+-----+------|
| Joe | 1 | 1 | 1123 |
| Joe | 2 | 1 | 1123 |
| Joe | 3 | 1 | 1120 |
| Jeff | 1 | 1 | 1123 |
| Moe | 3 | 42 | 1120 |
I would like to display an additional column on the output, listing the total number of records that have matching GUIDs to a given row, like this:
| Name | Id1 | Id2 | Guid | # Matching |
+------+-----+-----+------+------------+
| Joe | 1 | 1 | 1123 | 3 |
| Joe | 2 | 1 | 1123 | 3 |
| Joe | 3 | 1 | 1120 | 2 |
| Jeff | 1 | 1 | 1123 | 3 |
| Moe | 3 | 42 | 1120 | 2 |
I was able to accomplish this by joining the query with itself, and doing a count. However, the query is rather large and takes awhile to complete, is there any way I can accomplish this without joining the query with itself?

You want a window function:
select t.*, count(*) over (partition by guid) as num_matching
from t;

SQL - aggregation with column value as column name

For a table like below need to do an aggregation such that for each unique field in one column, need to find the count of occurrences of a discrete value in another column
input table is:
id model datetime driver distance
---|-----|------------|--------|---------
1 | S | 04/03/2009 | john | 399
2 | X | 04/03/2009 | juliet | 244
3 | 3 | 04/03/2009 | borat | 555
4 | 3 | 03/03/2009 | john | 300
5 | X | 03/03/2009 | juliet | 200
6 | X | 03/03/2009 | borat | 500
7 | S | 24/12/2008 | borat | 600
8 | X | 01/01/2009 | borat | 700
Output required
model john juliet | borat
-----|--------|-------|------
S | 1 | 0 | 1
X | 0 | 2 | 2
3 | 1 | 0 | 1
one potential way to do is to group by model with an aggregation like
SUM (CASE WHEN driver = 'value' THEN 1 ELSE 0 END) AS value for each discrete value of driver column. But the challenge is sometimes the number of discrete values is too many ( around 50 in my case) or in some cases do not even know all possible discrete values - I was wondering if there is an alternate way to do this.

The aggregation part need a litle more work.
Here the details:
Need calculate first what are all the combinations
Then use LEFT JOIN to get which combination doesnt have data.
DEMO
WITH "allDrivers" as (
SELECT DISTINCT "driver"
FROM Table1
),
"allModels" as (
SELECT DISTINCT "model"
FROM Table1
),
"source" as (
SELECT d."driver", m."model"
FROM "allDrivers" d
CROSS JOIN "allModels" m
)
SELECT s."model", s."driver", COUNT(t."datetime")
FROM "source" s
LEFT JOIN table1 t
ON s."model" = t."model"
AND s."driver" = t."driver"
GROUP BY s."model", s."driver"
OUTPUT
| model | driver | count |
|-------|--------|-------|
| 3 | borat | 1 |
| 3 | john | 1 |
| 3 | juliet | 0 |
| S | borat | 1 |
| S | john | 1 |
| S | juliet | 0 |
| X | borat | 2 |
| X | john | 0 |
| X | juliet | 2 |
Then you can do the dynamic pivot

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a group indicator (SQL) - sql

It looks like you can just use (CASE WHEN COUNT(*) OVER (partition by name, zip, phone) > 1 THEN 'X' ELSE NULL END) duplicate, DENSE_RANK() OVER (ORDER BY name, zip, phone) group_rank Rows that have the same name, zip, and phone will have the same group_rank. Here is a SQL Fiddle example.

Related

Get row for each unique user based on highest column value

How to fill forward time series data in Postgres

I want to add sum for running value on a column but if sequence fails then we have don't have to add

SQL Server 2016 count similar rows as a column without duplicating query

SQL - aggregation with column value as column name

Categories

Resources