how to select a value based on unique id column - sql

Please help me ,
I have table with 3 column , when i select the column i need to dulicate the value based on the id
Id Days Values
1 5 7
1 NULL NULL
1 NULL NULL
2 7 25
2 NULL NULL
2 8 274
2 NULL NULL
I need a Result as
Id Days Values
1 5 7
1 5 7
1 5 7
2 7 25
2 7 25
2 8 274
2 8 274
`

Generate a set of data with the desired repeating values (B). Then join back to the base set (A) containing the # of record to repeat. This assumes that each ID will only have one record populated. If this is not the case, then you will not get desired results.
SELECT B.ID, B.MDays as Days, B.Mvalues as values
FROM TABLE A
INNER JOIN (SELECT ID, max(days) mDays, Max(values) Mvalues
FROM Table
GROUP BY ID) B
on A.ID = B.ID
And due to updates in question....--
This will get you close but without a way to define grouping within ID's I can't subdivide the records into 2 and 2
SELECT B.ID, B.Days as Days, B.values as values
FROM TABLE A
INNER JOIN (SELECT Distinct ID, days, values
FROM Table
GROUP BY ID) B
on A.ID = B.ID
and A.days is null
This isn't even close enough as we still Don't know how to order the rows...
It assumes order within the table which can't be trusted. We generate a row number for each row in the table using the Row_number Over syntax Grouping (partition by) the ID and days with the order of ID days (which doesn't work because of the null values)
We then join this data set back to a distinct set of values on ID and days
to get us close... but we still need some grouping logic. beyond that of ID that handles the null records and lack of order or grouping.
With CTE AS (
SELECT ID, Days, Values, Row_Number() Over (partition by ID, Days ORDER BY ID, Days) RN
FROM Table)
SELECT *
FROM (SELECT Distinct ID, Days, Values, max(RN) mRN FROM CTE GROUP BY ID, Days, Values) A
INNER JOIN CTE B
ON A.ID = B.ID
and A.Days = B.Ddays
and mRN <= B.RN
Order by B.RN

Related

pick all positive least numbers from data set [duplicate]

This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Oracle group using min date
(3 answers)
GROUP BY with MAX(DATE) [duplicate]
(6 answers)
Closed 3 years ago.
I have below data in a table
ID AMOUNT DAYS
1 10 1
1 20 2
1 30 3
1 1 4
2 34 1
2 234 2
2 234 3
2 34 4
3 3 1
3 3 2
3 23 3
3 20 4
I want below results as all amounts which have least days of a ID
ID AMOUNT DAYS
1 10 1
2 34 1
3 3 1
Please suggest a sql query to pick this desired output
For your example, you can simply do:
select t.*
from t
where t.days = 1;
If 1 is not fixed, then a correlated subquery is one method:
select t.*
from t
where t.days = (select min(t2.days) from t t2 where t2.id = t.id);
Another method is aggregation:
select t.id, min(t.days) as min_days,
min(t.amount) keep (dense_rank first order by t.days asc) as min_amount
from t
group by t.id;
Of course row_number()/rank() is another alternative.
With an index on (id, days) and a large table, one of the above methods may be faster in practice.
You can use rank() function
select ID, Amount, Days from
(
select rank() over (partition by ID order by days) as rn,
t.*
from tab t
)
where rn = 1;
Demo
First group by id to find the min days for each id and then join to the table
select t.*
from tablename t inner join (
select id, min(days) days
from tablename
group by id
) g on g.id = t.id and g.days = t.days

Query to group based on the sorted table result

Below is my table
a 1
a 2
a 1
b 1
a 2
a 2
b 3
b 2
a 1
My Expected output is
a 4
b 1
a 4
b 5
a 1
I want them to be grouped if they are in sequence.
If your dbms supports window functions, you can use the row_number difference to assign the same group to consecutive values (which are the same) in one column. After assigning the groups, it is easy to sum the values for each group.
select col1,sum(col2)
from (select t.*,
row_number() over(order by someid)
- row_number() over(partition by col1 order by someid) as grp
from tablename t
) x
group by col1,grp
Replace tablename, col1,col2,someid with the appropriate column names. someid should be the column to be ordered by.

SQL Running Total Grouped By Limit

I am trying to determine how to group records together based the cumulative total of the Qty column so that the group size doesn't exceed 50. The desired group is given in the group column with sample data below.
Is there a way to accomplish this in SQL (specifically SQL Server 2012)?
Thank you for any assistance.
ID Qty Group
1 10 1
2 20 1
3 30 2 <- 60 greater than 50 so new group
4 40 3
5 2 3
6 3 3
7 10 4
8 25 4
9 15 4
10 5 5
You can use CTE to achieve the goal.
If one of the item exceeds Qty 50, a group still assign for it
DECLARE #Data TABLE (ID int identity(1,1) primary key, Qty int)
INSERT #Data VALUES (10), (20), (30), (40), (2), (3), (10), (25), (15), (5)
;WITH cte AS
(
SELECT ID, Qty, 1 AS [Group], Qty AS RunningTotal FROM #Data WHERE ID = 1
UNION ALL
SELECT data.ID, data.Qty,
-- The group limits to 50 Qty
CASE WHEN cte.RunningTotal + data.Qty > 50 THEN cte.[Group] + 1 ELSE cte.[Group] END,
-- Reset the running total for each new group
data.Qty + CASE WHEN cte.RunningTotal + data.Qty > 50 THEN 0 ELSE cte.RunningTotal END
FROM #Data data INNER JOIN cte ON data.ID = cte.ID + 1
)
SELECT ID, Qty, [Group] FROM cte
The following query gives you most of what you want. One more self-join of the result would compute the group sizes:
select a.ID, G, sum(b.Qty) as Total
from (
select max(ID) as ID, G
from (
select a.ID, sum(b.Qty) / 50 as G
from T as a join T as b
where a.ID >= b.ID
group by a.ID
) as A
group by G
) as a join T as b
where a.ID >= b.ID
group by a.ID
ID G Total
---------- ---------- ----------
2 0 30
3 1 60
8 2 140
10 3 160
The two important tricks:
Use a self-join with an inequality to get running totals
Use integer division to calculate group numbers.
I discuss this and other techniques on my canonical SQL page.
You need to create a stored procedure for this.
If you have Group column in your database then you have to take care about it while inserting a new record by fetching the max Group value and its sum of Qty column otherwise if you want Group column as computed in select statement then you have to code stored procedure accordingly.

Detect and delete gaps in time series

I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.
We can work with the following example data:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.
Combining other answers in this forum I made up the following code:
Add an autoincrement primary key to identify each row
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq');
UPDATE test SET id = NEXTVAL('id_seq');
ALTER TABLE test ADD PRIMARY KEY (id);
Detect the gaps in the time series
CREATE TABLE to_del AS WITH count3 AS
( SELECT *,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
AS cnt FROM test)
SELECT company, id FROM count3 WHERE cnt >= 3;
Delete the gaps from mytable
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.
Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?
I would create only one query:
DELETE FROM mytable
WHERE company in (
SELECT Company
FROM (
SELECT Company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company)
/
COUNT(*)
OVER (PARTITION BY company) As p50
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
A composite index on (company + value) columns can help to gain a maximum speed of this query.
EDIT
The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric
SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
-- WHERE cnt >= 3 OR p50 > 0.5
and now the delete query should work:
DELETE FROM mytable
WHERE company in (
SELECT company
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
For the 50% criteria, you could select all the companies for which the number of distinct dates in lower than half the number of days between the min and max dates.
I have not tested this but it should give you an idea. I used a CTE to make it easier to read.
WITH MinMax AS
(
SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
GROUP By Company
)
SELECT Company FROM MinMax
WHERE (calendars_days / 2) > days

Delete null values until first value is not null

I have daily timeseries for companies in my dataset and use PostgreSQL.
For every company all rows with NULL in column3 shall be deleted until the first NOT NULL entry in this column for this company. Then all consecutive missing values are filled in with the value of the last observable value for this company that is NOT NULL.
You can imagine the following example data:
date company column3
1 2004-01-01 A 5
2 2004-01-01 B NULL
3 2004-01-01 C NULL
4 2004-01-02 A NULL
5 2004-01-02 B 7
6 2004-01-02 C NULL
7 2004-01-03 A 6
8 2004-01-03 B 7
9 2004-01-03 C 9
10 2004-01-04 A NULL
11 2004-01-04 B NULL
12 2004-01-04 C NULL
It would be great if I manage to write a query that delivers
date company column3
1 2004-01-01 A 5
2 2004-01-02 A 5
3 2004-01-02 B 7
4 2004-01-03 A 6
5 2004-01-03 B 7
6 2004-01-03 C 9
7 2004-01-04 A 6
8 2004-01-04 B 7
9 2004-01-04 C 9
I tried:
SELECT a.date, a.company, COALESCE(a.column3, (SELECT b.column3 FROM mytable b
WHERE b.company=a.company AND b.colmun3 IS NOT NULL ORDER BY b.company=a.company
DESC LIMIT 1)) FROM mytable a;
There are two problems with the code:
It does not delete all records with NULL values until the first NOT NULL value, but
fills in all missing values.
...with the first observation in the column and not with the last observation before
the missing value.
I suggest two subquery levels with window functions instead of correlated subqueries:
SELECT *
FROM (
SELECT the_date, company, max(col3) OVER (PARTITION BY company, grp) AS col3
FROM (
SELECT *, count(col3) OVER (PARTITION BY company ORDER BY the_date) AS grp
FROM tbl
) sub1
) sub2
WHERE col3 IS NOT NULL
ORDER BY the_date, company;
Produces the requested result.
db<>fiddle here
Old sqlfiddle
This assumes unique entries per (company, the_date). Should be much faster for tables with more than just a few rows. A (unique to enforce uniqueness?!) index would help performance a lot:
CREATE INDEX tbl_company_date_idx ON tbl (company, the_date);
How?
The aggregate function count() ignores NULL values when counting. Used as aggregate-window function, it computes the running count of a the column according to the default window definition, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This results in the count being "stuck" for rows with NULL values, thereby forming a peer group that shares the same (non-null) value.
In the second window function, the only non-null value per group is easily extracted with max(). The group before the first non-null value retains NULL, which is easily eliminated in the final SELECT.
See:
Retrieve last known value for each column of a row
Try:
SELECT *
FROM (
SELECT id,
date,
company,
case when column3 is not null
then column3
else (
SELECT column3
FROM mytable t1
WHERE t1.company = t.company
AND t1.date < t.date
AND t1.column3 IS NOT NULL
ORDER BY t1.date DESC LIMIT 1
)
end column3
FROM mytable T
) AS subq
WHERE column3 IS NOT NULL;
demo: http://sqlfiddle.com/#!15/0cdce/12