Filling in missing values with a median in postgres

Filling in missing values with a median in postgres - sql

How can I replace avg with a median calculation in this?
select *
, coalesce(val, avg(val) over (order by t rows between 3 preceding and 1 preceding)) as fixed
from (
values
(1, 10),
(2, NULL),
(3, 10),
(4, 15),
(5, 11),
(6, NULL),
(7, NULL),
(8, NULL),
(9, NULL)
) as test(t, val)
;
Is there a legal version of this?
percentile_cont(0.5) within group(order by val) over (order by t rows between 3 preceding and 1 preceding)

Unfortunately percentile_cont() is an aggregate function, for which there is no equivalent window function.
One workaround is to use an inline subquery to do the aggregate computation.
If ids are always increasing, then you can do:
select
t.*,
coalesce(
t.val,
(
select percentile_cont(0.5) within group(order by t1.val)
from test t1
where t1.id between t.id - 3 and t.id - 1
)
) fixed
from test t
Otherwise, you need an additional level of nesting:
select
t.*,
coalesce(
t.val,
(
select percentile_cont(0.5) within group(order by t1.val)
from (select val from test t1 where t1.id < t.id order by t1.id desc limit 3) t1
)
) fixed
from test t
Demo on DB Fiddle - both queries yield:
id | val | fixed
-: | ---: | :----
1 | 10 | 10
2 | null | 10
3 | 10 | 10
4 | 15 | 15
5 | 11 | 11
6 | null | 11
7 | null | 13
8 | null | 11
9 | null | null

Related

T-SQL sequential updating with two columns

I have a table created by:
CREATE TABLE table1
(
id INT,
multiplier INT,
col1 DECIMAL(10,5)
)
INSERT INTO table1
VALUES (1, 2, 1.53), (2, 3, NULL), (3, 2, NULL),
(4, 2, NULL), (5, 3, NULL), (6, 1, NULL)
Which results in:
id multiplier col1
-----------------------
1 2 1.53000
2 3 NULL
3 2 NULL
4 2 NULL
5 3 NULL
6 1 NULL
I want to add a column col2 which is defined as multiplier * col1, however the next value of col1 then updates to take the previous calculated value of col2.
The resulting table should look like:
id multiplier col1 col2
---------------------------------------
1 2 1.53000 3.06000
2 3 3.06000 9.18000
3 2 9.18000 18.36000
4 2 18.36000 36.72000
5 3 36.72000 110.16000
6 1 110.16000 110.16000
Is this possible using T-SQL? I've tried a few different things such as joining id to id - 1 and have played around with a sequential update using UPDATE and setting variables but I can't get it to work.

A recursive CTE might be the best approach. Assuming your ids have no gaps:
with cte as (
select id, multiplier, convert(float, col1) as col1, convert(float, col1 * multiplier) as col2
from table1
where id = 1
union all
select t1.id, t1.multiplier, cte.col2 as col1, cte.col2 * t1.multiplier
from cte join
table1 t1
on t1.id = cte.id + 1
)
select *
from cte;
Here is a db<>fiddle.
Note that I converted the destination type to float, which is convenient for this sort of operation. You can convert back to decimal if you prefer that.

Basically, this would require an aggregate/window function that computes the product of column values. Such set function does not exists in SQL though. We can work around this with arithmetics:
select
id,
multiplier,
coalesce(min(col1) over() * exp(sum(log(multiplier)) over(order by id rows between unbounded preceding and 1 preceding)), col1) col1,
min(col1) over() * exp(sum(log(multiplier)) over(order by id)) col2
from table1
Demo on DB Fiddle:
id | multiplier | col1 | col2
-: | ---------: | -----: | -----:
1 | 2 | 1.53 | 3.06
2 | 3 | 3.06 | 9.18
3 | 2 | 9.18 | 18.36
4 | 2 | 18.36 | 36.72
5 | 3 | 36.72 | 110.16
6 | 1 | 110.16 | 110.16
This will fail if there are negative multipliers.
If you wanted an update statement:
with cte as (
select col1, col2,
coalesce(min(col1) over() * exp(sum(log(multiplier)) over(order by id rows between unbounded preceding and 1 preceding)), col1) col1_new,
min(col1) over() * exp(sum(log(multiplier)) over(order by id)) col2_new
from table1
)
update cte set col1 = col1_new, col2 = col2_new

Unexpected behavior of window function first_value

I have 2 columns - order no, value. Table value constructor:
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
I need to get
(1, 5) -- i.e. first nonnull Value if I go from current row and order by OrderNo
,(2, 5)
,(3, 2) -- i.e. first nonnull Value if I go from current row and order by OrderNo
,(4, 2) -- analogous
,(5, 2)
,(6, 1)
This is query that I think should work.
;with SourceTable as (
select *
from (values
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
) as T(OrderNo, Value)
)
select
*
,first_value(Value) over (
order by
case when Value is not null then 0 else 1 end
, OrderNo
rows between current row and unbounded following
) as X
from SourceTable
order by OrderNo
The issue is that it returns exactly same resultset as SourceTable. I don't understand why. E.g., if first row is processed (OrderNo = 1) I'd expect column X returns 5 because frame should include all rows (current row and unbound following) and it orders by Value - nonnulls first, then by OrderNo. So first row in frame should be OrderNo=2. Obviously it doesn't work like that but I don't get why.
Much appreciated if someone explains how is constructed the first frame. I need this for SQL Server and also Postgresql.
Many thanks

Although probably more expensive than two window functions, you can do this without a subquery using arrays:
with SourceTable as (
select *
from (values (1, null),
(2, 5),
(3, null),
(4, null),
(5, 2),
(6, 1)
) T(OrderNo, Value)
)
select st.*,
(array_remove(array_agg(value) over (order by orderno rows between current row and unbounded following), null))[1] as x
from SourceTable st
order by OrderNo;
Here is the db<>fiddle.
Or using a lateral join:
select st.*, st2.value
from SourceTable st left join lateral
(select st2.*
from SourceTable st2
where st2.value is not null and st2.orderno >= st.orderno
order by st2.orderno asc
limit 1
) st2
on 1=1
order by OrderNo;
With the right indexes on the source table, the lateral join might be the best solution from a performance perspective (I have been surprised by the performance of lateral joins under the right circumstances).

It's pretty easy to see why first_value doesn't work if you order the results by case when Value is not null then 0 else 1 end, orderno
orderno | value | x
---------+-------+---
2 | 5 | 5
5 | 2 | 2
6 | 1 | 1
1 | |
3 | |
4 | |
(6 rows)
For orderno=1, there's nothing after it in the frame that would be not-null.
Instead, we can arrange the orders into groups using count as a window function in a sub-query. We then use max as a window function over that group (this is arbitrary, min would work just as well) to get the one non-null value in that group:
with SourceTable as (
select *
from (values
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
) as T(OrderNo, Value)
)
select orderno, order_group, max(value) OVER (PARTITION BY order_group) FROM (
SELECT *,
count(value) OVER (ORDER BY orderno DESC) as order_group
from SourceTable
) as sub
order by orderno;
orderno | order_group | max
---------+-------------+-----
1 | 3 | 5
2 | 3 | 5
3 | 2 | 2
4 | 2 | 2
5 | 2 | 2
6 | 1 | 1
(6 rows)

Calculate percentage / aggregation based on a baseline row

I would like to calculate the productivity of a sales team compared to a specific team member.
Given this query:
with t1 (rep_id, place_id, sales_qty) as (values
(0, 1, 3),
(1, 1, 1),
(1, 2, 2),
(1, 3, 4),
(1, 4, 1),
(2, 2, 1),
(2, 3, 3)
)
select
rep_id,
count(distinct place_id) as qty_places,
sum(sales_qty) as qty,
sum(sales_qty) / count(place_id) as productivity
from
t1
group by
rep_id
result:
rep_id | qty_places | qty_sales | productivity
---------------------------------------------
0 | 1 | 6 | 6
1 | 4 | 22 | 5
2 | 2 | 9 | 4
I would like to have the productivity of the team based on the productivity of rep_id = 1, so I would like to have something like this:
rep_id | qty_places | qty_sales | productivity | productivity %
--------------------------------------------------------------
0 | 1 | 6 | 6 | 1.2
1 | 4 | 22 | 5 | 1 <- Baseline
2 | 2 | 9 | 4 | 0.8
How can I achieve that with SQL on PostgreSQL?

this should do the trick
with t1 (rep_id, place_id, sales_qty) as (values
(0, 1, 3),
(1, 1, 1),
(1, 2, 2),
(1, 3, 4),
(1, 4, 1),
(2, 2, 1),
(2, 3, 3)
),
cte as (select
rep_id,
count(distinct place_id) as qty_places,
sum(sales_qty) as qty,
sum(sales_qty) / count(place_id) as productivity
from
t1
group by
rep_id)
select rep_id, qty_places, qty, productivity,
productivity::numeric/(select productivity::numeric from cte where rep_id = 1)
as productivity_percent from cte

We can try computing the rep_id = 1 figures in a separate CTE, and then cross join that to your current table:
WITH cte AS (
SELECT SUM(CASE WHEN rep_id = 1 THEN sales_qty ELSE 0 END) /
COUNT(CASE WHEN rep_id = 1 THEN 1 END) AS baseline
FROM t1
)
SELECT
rep_id,
COUNT(DISTINCT place_id) AS qty_places,
SUM(sales_qty) AS qty,
SUM(sales_qty) / COUNT(place_id) AS productivity,
(1.0*SUM(sales_qty) / COUNT(place_id)) / t2.baseline AS productivity_pct
FROM t1
CROSS JOIN cte t2
GROUP BY
t1.rep_id, t2.baseline;
Demo

Simply use conditional aggregation. I would do this using a subquery:
select t.*,
productivity / max(productivity) filter (where rep_id = 1) over ()
from (select rep_id,
count(distinct place_id) as qty_places,
sum(sales_qty) as qty,
sum(sales_qty)::numeric / count(place_id) as productivity
from t1
group by rep_id
) t
Here is a db<>fiddle.
Note that you can actually express this without the subquery, but I think that just makes the query more complicated.

Count Top 5 Elements spread over rows and columns

Using T-SQL for this table:
+-----+------+------+------+-----+
| No. | Col1 | Col2 | Col3 | Age |
+-----+------+------+------+-----+
| 1 | e | a | o | 5 |
| 2 | f | b | a | 34 |
| 3 | a | NULL | b | 22 |
| 4 | b | c | a | 55 |
| 5 | b | a | b | 19 |
+-----+------+------+------+-----+
I need to count the TOP 3 names (Ordered by TotalCount DESC) across all rows and columns, for 3 Age groups: 0-17, 18-49, 50-100. Also, how do I ignore the NULLS from my results?
If it's possible, how I can also UNION the results for all 3 age groups into one output table to get 9 results (TOP 3 x 3 Age groups)?
Output for only 1 Age Group: 18-49 would look like this:
+------+------------+
| Name | TotalCount |
+------+------------+
| b | 4 |
| a | 3 |
| f | 1 |
+------+------------+

You need to unpivot first your table and then exclude the NULLs. Then do a simple COUNT(*):
WITH CteUnpivot(Name, Age) AS(
SELECT x.*
FROM tbl t
CROSS APPLY ( VALUES
(col1, Age),
(col2, Age),
(col3, Age)
) x(Name, Age)
WHERE x.Name IS NOT NULL
)
SELECT TOP 3
Name, COUNT(*) AS TotalCount
FROM CteUnpivot
WHERE Age BETWEEN 18 AND 49
GROUP BY Name
ORDER BY COUNT(*) DESC
ONLINE DEMO
If you want to get the TOP 3 for each age group:
WITH CteUnpivot(Name, Age) AS(
SELECT x.*
FROM tbl t
CROSS APPLY ( VALUES
(col1, Age),
(col2, Age),
(col3, Age)
) x(Name, Age)
WHERE x.Name IS NOT NULL
),
CteRn AS (
SELECT
AgeGroup =
CASE
WHEN Age BETWEEN 0 AND 17 THEN '0-17'
WHEN Age BETWEEN 18 AND 49 THEN '18-49'
WHEN Age BETWEEN 50 AND 100 THEN '50-100'
END,
Name,
COUNT(*) AS TotalCount
FROM CteUnpivot
GROUP BY
CASE
WHEN Age BETWEEN 0 AND 17 THEN '0-17'
WHEN Age BETWEEN 18 AND 49 THEN '18-49'
WHEN Age BETWEEN 50 AND 100 THEN '50-100'
END,
Name
)
SELECT
AgeGroup, Name, TotalCount
FROM(
SELECT *,
rn = ROW_NUMBER() OVER(PARTITION BY AgeGroup, Name ORDER BY TotalCount DESC)
FROM CteRn
) t
WHERE rn <= 3;
ONLINE DEMO
The unpivot technique using CROSS APPLY and VALUES:
An Alternative (Better?) Method to UNPIVOT (SQL Spackle) by Dwain Camps

You can check below multiple-CTE SQL select statement
Row_Number() with Partition By clause is used ordering records within each group categorized by ages
/*
CREATE TABLE tblAges(
[No] Int,
Col1 VarChar(10),
Col2 VarChar(10),
Col3 VarChar(10),
Age SmallInt
)
INSERT INTO tblAges VALUES
(1, 'e', 'a', 'o', 5),
(2, 'f', 'b', 'a', 34),
(3, 'a', NULL, 'b', 22),
(4, 'b', 'c', 'a', 55),
(5, 'b', 'a', 'b', 19);
*/
;with cte as (
select
col1 as col, Age
from tblAges
union all
select
col2, Age
from tblAges
union all
select
col3, Age
from tblAges
), cte2 as (
select
col,
case
when age < 18 then '0-17'
when age < 50 then '18-49'
else '50-100'
end as grup
from cte
where col is not null
), cte3 as (
select
grup,
col,
count(grup) cnt
from cte2
group by
grup,
col
)
select * from (
select
grup, col, cnt, ROW_NUMBER() over (partition by grup order by cnt desc) cnt_grp
from cte3
) t
where cnt_grp <= 3
order by grup, cnt

SQL : how to find leaf rows?

i have a self related table myTable like :
ID | RefID
----------
1 | NULL
2 | 1
3 | 2
4 | NULL
5 | 2
6 | 5
7 | 5
8 | NULL
9 | 7
i need to get leaf rows on any depth
based on the table above, the result must be :
ID | RefID
----------
3 | 2
4 | NULL
6 | 5
8 | NULL
9 | 7
thank you
PS: the depth may vary , here is very small example

Try:
SELECT id,
refid
FROM mytable t
WHERE NOT EXISTS (SELECT 1
FROM mytable
WHERE refid = t.id)

DECLARE #t TABLE (id int NOT NULL, RefID int NULL);
INSERT #t VALUES (1, NULL), (2, 1), (3, 2), (5, NULL),
(6, 5), (4, NULL), (7, 5), (8, NULL), (9, 8), (10, 7);
WITH CTE AS
(
-- top level
SELECT id, RefID, id AS RootId, 0 AS CTELevel FROM #t WHERE REfID IS NULL
UNION ALL
SELECT T.id, T.RefID, RootId, CTELevel + 1 FROM #t T JOIN CTE ON T.RefID = CTE.id
), Leafs AS
(
SELECT
id, RefID, DENSE_RANK() OVER (PARTITION BY CTE.RootId ORDER BY CTELevel DESC) AS Rn
FROM CTE
)
SELECT
id, RefID
FROM
Leafs
WHERE
rn = 1

select ID, RefId
from myTable t1 left join myTable t2 on t1.ID = t2.RefID
where t2.RefID is null

try this:
SELECT *
FROM
my_table
WHERE
id NOT IN
(
SELECT DISTINCT
refId
FROM
my_table
WHERE
refId IS NOT NULL
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Filling in missing values with a median in postgres - sql

Related

T-SQL sequential updating with two columns

Unexpected behavior of window function first_value

Calculate percentage / aggregation based on a baseline row

Count Top 5 Elements spread over rows and columns

SQL : how to find leaf rows?

Categories

Resources