Padding tables with 0s in redshift - sql

I have a table of the form:
id | A | B | C
-----------------
1 | 1 | 0 | 1
1 | 2 | 1 | 0
2 | 1 | 4 | 0
I would like to pad this table with rows of 0s (excluding the id) such that each id has exactly 3 entries. So the result would be:
id | A | B | C
-----------------
1 | 0 | 0 | 0
1 | 1 | 0 | 1
1 | 2 | 1 | 0
2 | 0 | 0 | 0
2 | 0 | 0 | 0
2 | 1 | 4 | 0
This is because id 1 had two entries, so we added one row of 0s, and id 2 had one entry, so we added two rows of 0s.
Note: we can assume each id occurs no more than 3 times and that if an id occurs exactly 3 times, there is no need to add padding.
Is there an intelligent way of doing this with Amazon Redshift? I need this to scale to 30 days of padding and a few hundred columns.

If column A is always sequential you can do:
select i.id, n.num,
coalesce(t.b, 0) as b,
coalesce(t.c, 0) as c
from (select distinct id from t) i cross join
(select 1 as num union all select 2 union all select 3) n left join
t on i.id = t.id and n.num = t.A;
You do need to list each column in the select to get the zeros.
If the above is not true, you can make it true with a CTE:
with t as (
select t.*, row_number() over (partition by id order by id) as num
from t
)
select i.id, coalesce(t.a, 0) as a,
coalesce(t.b, 0) as b,
coalesce(t.c, 0) as c
from (select distinct id from t) i cross join
(select 1 as num union all select 2 union all select 3) n left join
t on i.id = t.id and n.num = t.num;

Related

Bigquery: Joining 2 tables one having repeated records and one with count ()

I want to join tables after unnest arrays in Table:1 but the records duplicated after the join because of the unnest.
Table:1
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
- -------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
Table:2
| a | c | f |
-----------------
| 1 | 12 | 13 |
-----------------
| 2 | 14 | 15 |
I want to join table 1 and 2 on a but I need also to have the output of:
| a | d.b | d.c | f | h | Sum(count(a))
---------------------------------------------
| 1 | 5 | 2 | 13 | 12 |
- ------------- - - 1
| | 3 | 1 | | |
---------------------------------------------
| 2 | 2 | 1 | 15 | 14 | 1
a can be repeated in table 2 for that I need to count(a) then select the sum after join.
My problem is when I'm joining I need the nested and repeated record to be the same as in the first table but when use aggregation to get the sum I can't group by struct or arrays so I UNNEST the records first then use ARRAY_AGG function but also there was an issue in the sum.
SELECT
t1.a,
t2.f,
t2.h,
ARRAY_AGG(DISTINCT(t1.db)) as db,
ARRAY_AGG(DISTINCT(t1.dc)) as dc,
SUM(t2.total) AS total
FROM (
SELECT
a,
d.b as db,
d.c as dc
FROM
`table1`,
UNNEST(d) AS d,
) AS t1
LEFT JOIN (
SELECT
a,
f,
h,
COUNT(*) AS total,
FROM
`table2`
GROUP BY
a,f,h) AS t2
ON
t1.a = t2.a
GROUP BY
1,
2,
3
Note: the error is in the total number after the sum it is much higher than expected all other data are correct.
I guess your table 2 contains is not unique for column a.
Lets assume that the table 2 looks like this:
a
c
f
1
12
13
2
14
15
1
100
101
There are two rows where a is 1. Since b and f are different, the grouping does not solve this ( GROUP BY a,f,h) AS t2) and counts(*) as total is one for each row.
a
c
f
total
1
12
13
1
2
14
15
1
1
100
101
1
In the next step you join this table to your table 1. The rows of table1 with value 1 in column a are duplicated, because table2 has two entries. This lead to the fact that the sum is too high.
Instead of unnesting the tables, I recommend following approach:
-- Creating of sample data as given:
with tbl_A as (select 1 a, [struct(5 as b,2 as c),struct(3,1)] d union all select 2,[struct(2,1)] union all select null,[struct(50,51)]),
tbl_B as (select 1 as a,12 b, 13 f union all select 2,14,15 union all select 1,100,101 union all select null,500,501)
-- Query:
select *
from tbl_A A
left join
(Select a,array_agg(struct(b,f)) as B, count(1) as counts from tbl_B group by 1) B
on ifnull(A.a,-9)=ifnull(B.a,-9)

T-SQL: Best way to replace NULL with most recent non-null value?

Assume I have this table:
+----+-------+
| id | value |
+----+-------+
| 1 | 5 |
| 2 | 4 |
| 3 | 1 |
| 4 | NULL |
| 5 | NULL |
| 6 | 14 |
| 7 | NULL |
| 8 | 0 |
| 9 | 3 |
| 10 | NULL |
+----+-------+
I want to write a query that will replace any NULL value with the last value in the table that was not null in that column.
I want this result:
+----+-------+
| id | value |
+----+-------+
| 1 | 5 |
| 2 | 4 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 14 |
| 7 | 14 |
| 8 | 0 |
| 9 | 3 |
| 10 | 3 |
+----+-------+
If no previous value existed, then NULL is OK. Ideally, this should be able to work even with an ORDER BY. So for example, if I ORDER BY [id] DESC:
+----+-------+
| id | value |
+----+-------+
| 10 | NULL |
| 9 | 3 |
| 8 | 0 |
| 7 | 0 |
| 6 | 14 |
| 5 | 14 |
| 4 | 14 |
| 3 | 1 |
| 2 | 4 |
| 1 | 5 |
+----+-------+
Or even better if I ORDER BY [value] DESC:
+----+-------+
| id | value |
+----+-------+
| 6 | 14 |
| 1 | 5 |
| 2 | 4 |
| 9 | 3 |
| 3 | 1 |
| 8 | 0 |
| 4 | 0 |
| 5 | 0 |
| 7 | 0 |
| 10 | 0 |
+----+-------+
I think this might involve some kind of analytic function - somehow partitioning over the value column - but I'm not sure where to look.
You can use a running sum to set groups and use max to fill in the null values.
select id,max(value) over(partition by grp) as value
from (select id,value,sum(case when value is not null then 1 else 0 end) over(order by id) as grp
from tbl
) t
Change the over() clause to order by value desc to get the second result in the question.
The best way has been covered by Itzik Ben-Gan here:The Last non NULL Puzzle
Below is a solution which for 10 million rows and completes around in 20 seconds on my system
SELECT
id,
value1,
CAST(
SUBSTRING(
MAX(CAST(id AS binary(4)) + CAST(value1 AS binary(4)))
OVER (ORDER BY id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS int) AS lastval
FROM dbo.T1;
This solution assumes your id column is indexed
You can also try using correlated subquery
select id,
case when value is not null then value else
(select top 1 value from table
where id < t.id and value is not null order by id desc) end value
from table t
Result :
id value
1 5
2 4
3 1
4 1
5 1
6 14
7 14
8 0
9 3
10 3
If the NULLs are scattered I use a WHILE loop to fill them in
However if the NULLs are in longer consecutive strings there are faster ways to do it.
So here's one approach:
First find a record that we want to update. It has NULL in this record and no NULL in the prior record
SELECT C.VALUE, N.ID
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
Use that to update: (bit hazy on this syntax but you get the idea)
UPDATE N
SET VALUE = C.Value
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
.. now just keep doing it till you run out of rows
-- This is needed to set ##ROWCOUNT to non zero
SELECT 1;
WHILE ##ROWCOUNT <> 0
BEGIN
UPDATE N
SET VALUE = C.Value
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
END
The other way is to use a similiar query to get a range of id's to update. This works much faster if your NULLS are usually against consecutive id's
Here is the one simple approach using OUTER APPLY
CREATE TABLE #table(id INT, value INT)
INSERT INTO #table VALUES
(1,5),
(2,4),
(3,1),
(4,NULL),
(5,NULL),
(6,14),
(7,NULL),
(8,0),
(9,3),
(10,NULL)
SELECT t.id, ISNULL(t.value, t3.value) value
FROM #table t
OUTER APPLY(SELECT id FROM #table WHERE id = t.id AND VALUE IS NULL) t2
OUTER APPLY(SELECT TOP 1 value
FROM #table WHERE id <= t2.id AND VALUE IS NOT NULL ORDER BY id DESC) t3
OUTPUT:
id VALUE
---------
1 5
2 4
3 1
4 1
5 1
6 14
7 14
8 0
9 3
10 3
Using this sample data:
if object_id('tempdb..#t1') is not null drop table #t1;
create table #t1 (id int primary key, [value] int null);
insert #t1 values(1,5),(2,4),(3,1),(4,NULL),(5,NULL),(6,14),(7,NULL),(8,0),(9,3),(10,NULL);
I came up with:
with x(id, [value], grouper) as (
select *, row_number() over (order by id)-sum(iif([value] is null,1,0)) over (order by id)
from #t1)
select id, min([value]) over (partition by grouper)
from x;
I noticed, however, that Vamsi Prabhala beat me to it... My solution is identical to what he posted. (arghhhh!). So I thought I'd try a recursive solution. Here's a pretty efficient use of a recursive cte (provided that ID is indexed):
with sorted as (select *, seqid = row_number() over (order by id) from #t1),
firstRecord as (select top(1) * from #t1 order by id),
prev as
(
select t.id, t.[value], lastid = 1, lastvalue = null
from sorted t
where t.id = 1
union all
select t2.id, t2.[value], lastid+1, isnull(prev.[value],lastvalue)
from sorted t2
join prev on t2.id = prev.lastid+1
)
select id, [value]=isnull([value],lastvalue)--, *
from prev;
Normally I don't like recursive cte's (rCte for short) but in this case it offered an elegant solution and was faster than using the window aggregate function (sum over, min over...). Note the execution plans, the rcte on the bottom. The rCTE get's it done with two index seeks, one of which is for just one row. Unlike the window aggregate solution, the rcte does not require a sort. Running this with statistics io on; the rcte produces much less IO.
All this said, don't use either of these solutions, What the TheGameiswar posted will perform the best by far. His solution on a properly indexed id column would be lightening fast.
Following UPDATE statement can be used, please test it before use
update #table
set value = newvalue
from (
select
s.id, s.value,
(select top 1 t.value from #table t where t.id <= s.id and t.value is not null order by t.id desc) as newvalue
from #table S
) u
where #table.id = u.id and #table.value is null
stop worrying..here's the answer for you :)
SELECT *
INTO #TempIsNOtNull
FROM YourTable
WHERE value IS NOT NULL
SELECT *
INTO #TempIsNull
FROM YourTable
WHERE value IS NULL
UPDATE YourTable
SEt YourTable.value = UpdateDtls.value
FROM YourTable
JOIN (
SELECT OuterTab1.id,
#TempIsNOtNull.value
FROM #TempIsNull OuterTab1
CROSS JOIN #TempIsNOtNull
WHERE OuterTab1.id - #TempIsNOtNull.id > 0
AND (OuterTab1.id - #TempIsNOtNull.id) = ( SELECT TOP 1
OuterTab1.id - #TempIsNOtNull.id
FROM #TempIsNull InnerTab
CROSS JOIN #TempIsNOtNull
WHERE OuterTab1.id - #TempIsNOtNull.id > 0
AND OuterTab1.id = InnerTab.id
ORDER BY (OuterTab1.id - #TempIsNOtNull.id) ASC) ) AS UpdateDtls
ON (YourTable.id = UpdateDtls.id)

Check the first value according to first date

I have two tables
guid | id | Status
-----| -----| ----------
1 | 123 | 0
2 | 456 | 3
3 | 789 | 0
The other table is
id | modified date | Status
------| --------------| ----------
1 | 26-08-2017 | 3
1 | 27-08-2017 | 0
1 | 01-09-2017 | 0
1 | 02-09-2017 | 0
2 | 26-08-2017 | 3
2 | 01-09-2017 | 0
2 | 02-09-2017 | 3
3 | 01-09-2017 | 0
3 | 02-09-2017 | 3
3 | 03-09-2017 | 0
Every time the status in the first table changes for each id it also modifies date and status in second table.Like for id 1 status was changed 4 times.
I want to select those ids by joining both tables whose value of status was 0 in its first modified date.
In this example it should return only id 3 because only id 3 has a status value as 0 on it first modified date 01-09-2017.Ids 1& 2 have value 3 in their first modified date.
Any help
Try using below(Assuming first table as A and second table as B):
;with cte as (
Select a.id, b.Status, row_number() over(partition by a.id order by [modified date] asc) row_num
from A
inner join B
on a.id = b.id
)
Select * from cte where
status = 0 and row_num = 1
Think this will do what your looking for.
WITH cte
AS (SELECT id
, ROW_NUMBER() OVER (PARTITION BY (id) ORDER BY [modified date]) RN
, Status
FROM SecondTable
)
SELECT *
FROM FirstTable
JOIN cte ON FirstTable.id = cte.id
AND RN = 1
WHERE cte.Status = 0
Just expand out the * and return what fields you need.

How to select an attribute based on string value within a group

Table name: Copies
+------------------------------------+
| group_id | my_id | stuff |
+------------------------------------+
| 900 | 1 | Y |
| 900 | 2 | N |
| 901 | 3 | Y |
| 901 | 4 | Y |
| 902 | 5 | N |
| 902 | 6 | N |
| 903 | 7 | N |
| 903 | 8 | Y |
---------------------------------------
The output should be:
+------------------------------------+
| group_id | my_id | stuff |
+------------------------------------+
| 900 | 1 | Y |
| 903 | 8 | Y |
--------------------------------------
Hello, I have a table where I have to discern a 'good' record within a group_id based on a positive (Y) value within the stuff field. I need the full record where only one value fits this criteria. If both stuff values are Y or both are N, then they shouldn't be selected. It seems like this should be simple, but I am not sure how to proceed.
One option here is to use conditional aggregation over each group_id and retain a group if it has a mixture of yes and no answers.
WITH cte AS (
SELECT group_id
FROM Copies
GROUP BY group_id
HAVING SUM(CASE WHEN stuff = 'Y' THEN 1 ELSE 0 END) > 0 AND
SUM(CASE WHEN stuff = 'N' THEN 1 ELSE 0 END) > 0
)
SELECT c1.*
FROM Copies c1
INNER JOIN cte c2
ON c1.group_id = c2.group_id
WHERE c1.stuff = 'Y'
One advantage of this solution is that it will show all columns of matching records.
select group_id,
min(my_id)
keep (dense_rank first order by case stuff when 'Y' then 0 end) as my_id,
'Y' as stuff
from table_1
group by group_id
having min(stuff) != max(stuff)
with rows as(
select group_id, my_id, sum(case when stuff = 'Y' then 1 else 0 end) c
from copies
group by group_id, my_id)
select c.*
from copies c inner join rows r on (c.group_id = r.group_id and c.my_id = r.my_id)
where r.c = 1;
Try this:
SELECT C.*
FROM COPIES C,
COPIES C2
WHERE C.STUFF='Y'
AND C2.STUFF='N'
AND C.GROUP_ID=C2.GROUP_ID
Try this:
SELECT t1.*
FROM copies t1
JOIN (
SELECT group_id
FROM copies
GROUP BY group_id
HAVING COUNT(CASE WHEN stuff = 'Y' THEN 1 END) = 1 AND
COUNT(CASE WHEN stuff = 'N' THEN 1 END) = 1
) t2 ON t1.group_id = t2.group_id
WHERE t1.stuff = 'Y'
This works as long as group_id values appear in couples.

Merge multiple rows in SQL with tie breaking on primary key

I have a table with data like the following
key | A | B | C
---------------------------
1 | x | 0 | 1
2 | x | 2 | 0
3 | x | NULL | 4
4 | y | 7 | 1
5 | y | 3 | NULL
6 | z | NULL | 4
And I want to merge the rows together based on column A with largest primary key being the 'tie breaker' between values that are not NULL
Result
key | A | B | C
---------------------------
1 | x | 2 | 4
2 | y | 3 | 1
3 | z | NULL | 4
What would be the best way to achieve this assuming my data is actually 40 columns and 1 million rows with an unknown level of duplications?
Using ROW_NUMBER and conditional aggregation:
SQL Fiddle
WITH cte AS(
SELECT *,
rnB = ROW_NUMBER() OVER(PARTITION BY A ORDER BY CASE WHEN B IS NULL THEN 0 ELSE 1 END DESC, [key] DESC),
rnC = ROW_NUMBER() OVER(PARTITION BY A ORDER BY CASE WHEN C IS NULL THEN 0 ELSE 1 END DESC, [key] DESC)
FROM tbl
)
SELECT
[key] = ROW_NUMBER() OVER(ORDER BY A),
A,
B = MAX(CASE WHEN rnB = 1 THEN B END),
C = MAX(CASE WHEN rnC = 1 THEN C END)
FROM cte
GROUP BY A