SQL Query - GROUP BY , PARTITION BY - sql

This is my first post and I am new to SQL
I have a table like
H Amount Count ID
h1 2 1 x
h2 3 2 x
h3 5 3 x
h1 3 3 x
h1 1 5 y
h2 3 2 x
h3 1 1 x
h3 2 3 y
h2 5 5 y
and I want SUM(Amount*Count) of each H group based on id / Total SUM(Amount*Count) in that H group
i.e
H value ID
h1 11/16 x value = (2*1+3*3)/2*1+3*3+1*5
h1 5/16 y value = 1*5/ 2*1+3*3+1*5
h2 12/37 x
h2 25/37 y
h3 16/22 x
h3 6/22 y
My aim is to group by H and then on EACH GROUP I have to do - Sum(average*count) Over(partition by ID) / Sum(average*count)
but I am not able to write such query can you guys please help me.
And sorry about the formatting
Thanks

Try this:
SELECT t2.h, t1.value1/t2.value2, t1.id
FROM
(SELECT sum(value) as value1, id from table
group by id) as t1,
(SELECT sum(value) as value2, h from table
group by h) as t2
WHERE t1.h = t2.h

The easy answer is to use an inner query like so:
SELECT SUM(Amount * Count), (SELECT SUM(Amount * Count) FROM table AS t2 WHERE t2.H = t1.H)
FROM table AS t1
GROUP BY H, ID
This essentially refers to the same table as both t1 and t2 for two different queries.
However, the specific database management system you're using (MySQL, Microsoft SQL Server, sqlite, whatever) may have a built-in function to handle this sort of thing. You should look into what your DBMS offers (or tag your question here with a specific platform).

What you want to do is get the dividend value (sum of amount*count per h -> id group) and join the divisor value (sum of amount*count per h group) in another subselect:
SELECT
a.h, a.id, a.dividend / b.divisor AS value
FROM
(
SELECT h, id, SUM(amount*count) AS dividend
FROM tbl
GROUP BY h, id
) a
INNER JOIN
(
SELECT h, SUM(amount*count) AS divisor
FROM tbl
GROUP BY h
) b ON a.h = b.h

Related

Calculate correlation between two words

Let's say I have a table in Postgres that stores a column of strings like this.
animal
cat/dog/bird
dog/lion
bird/dog
dog/cat
cat/bird
What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.
For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we'll call this instance_count) would be 2.
According to the above logic, the resulting table from this example would look like this.
base_animal
correlated_animal
instance_count
correlation
cat
cat
3
100
cat
dog
2
66
cat
bird
2
66
cat
lion
0
0
dog
dog
4
100
dog
cat
2
50
dog
bird
2
50
dog
lion
1
25
bird
bird
3
100
bird
cat
2
66
bird
dog
2
66
bird
lion
0
0
lion
lion
1
100
lion
cat
0
0
lion
dog
1
100
lion
bird
0
0
I've come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?
Edit:
Based off Erwin's answer, here's the same idea, except this answer doesn't make a record for animal combinations that never intersect.
with flat as (
select t.id, a
from (select row_number() over () as id, animal from animals) t,
unnest(string_to_array(t.animal, '/')) a
), ct as (select a, count(*) as ct from flat group by 1)
select
f1.a as b_animal,
f2.a as c_animal,
count(*) as instance_count,
round(count(*) * 100.0 / ct.ct, 0) as correlation
from flat f1
join flat f2 using(id)
join ct on f1.a = ct.a
group by f1.a, f2.a, ct.ct
Won't get much simpler or faster than this:
WITH flat AS (
SELECT t.id, a
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM (
SELECT a.a, b.a AS b, a.ct
FROM ct a, ct b
) x
LEFT JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM flat f1
JOIN flat f2 USING (id)
GROUP BY 1,2
) xc USING (a,b)
ORDER BY a, instance_count DESC;
db<>fiddle here
Produces your desired result, except for ...
added consistent sort order
rounded correctly
This assumes distinct animals per row in the source data. Else it's unclear how to count the same animal in the same row exactly ...
Setp-by-step
CTE flat attaches an arbitrary row number as unique id. (If you have a PRIMARY KEY, use that instead and skip the subquery t.) Then unnest animals to get one pet per row (& id).
CTE ct gets the list of distinct animals & their total count.
The outer SELECT builds the complete raster of animal pairs (a / b) in subquery x, plus total count for a. LEFT JOIN to the actual pair count in subquery xc. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:
Look for percentage of characters in a word/phrase within a block of text
Updated task
If you don't need pairs that never met, and pairing with self, either, this could be your query:
-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
SELECT f1.a AS base_animal
, f1.ct AS base_count
, f2.a AS corr_animal
, count(*) AS instance_count
, round(count(*) * 100.0 / f1.ct) AS correlation
FROM flat f1
JOIN flat f2 USING (id)
JOIN (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE f1.a <> f2.a -- exclude pairing with self
GROUP BY f1.a, f1.ct, f2.a
ORDER BY f1.a, instance_count DESC;
db<>fiddle here
I added the total occurrence count of the base animal as base_count.
Most notably, I dropped the additional CTE ct, and get the base_count from the first CTE with a window function. That's about the same cost by itself, but we then don't need another join in the outer query, which should be cheaper overall.
You can still use that if you include pairs with self. Check the fiddle.
Oh, and we don't need COALESCE any more.
Idea is to split the data into rows (using unnest(string_to_array())) and then cross-join same to get all permutations.
with data1 as (
select *
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal,
(case
when
data2.base_animal = data2.correlated_animal
then
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case
when
data2.base_animal = data2.correlated_animal
then
100
else
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal
Refer to fiddle here.

SQL Query get common column with diff values in other columns

I am not very fluent with SQL.. Im just facing a little issue in making the best and efficient sql query. I have a table with a composite key of column A and B as shown below
A
B
C
1
1
4
1
2
5
1
3
3
2
2
4
2
1
5
3
1
4
So what I need is to find rows where column C has both values of 4 and 5 (4 and 5 are just examples) for a particular value of column A. So 4 and 5 are present for two A values (1 and 2). For A value 3, 4 is present but 5 is not, hence we cannot take it.
My explanation is so confusing. I hope you get it.
After this, I need to find only those where B value for 4 (First Number) is less than B value for 5 (Second Number). In this case, for A=1, Row 1 (A-1, B-1,C-4) has B value lesser than Row 2 (A-1, B-2, C-5) So we take this row. For A = 2, Row 1(A-2,B-2,C-4) has B value greater than Row 2 (A-2,B-1,C-5) hence we cannot take it.
I Hope someone gets it and helps. Thanks.
Rows containing both c=4 and c=5 for a given a and ordered by b and by c the same way.
select a, b, c
from (
select tbl.*,
count(*) over(partition by a) cnt,
row_number() over (partition by a order by b) brn,
row_number() over (partition by a order by c) crn
from tbl
where c in (4, 5)
) t
where cnt = 2 and brn = crn;
EDIT
If an order if parameters matters, the position of the parameter must be set explicitly. Comparing b ordering to explicit parameter position
with params(val, pos) as (
select 4,2 union all
select 5,1
)
select a, b, c
from (
select tbl.*,
count(*) over(partition by a) cnt,
row_number() over (partition by a order by b) brn,
p.pos
from tbl
join params p on tbl.c = p.val
) t
where cnt = (select count(*) from params) and brn = pos;
I assume you want the values of a where this is true. If so, you can use aggregation:
select a
from t
where c in (4, 5)
group by a
having count(distinct c) = 2;

Use alias query as a table

I'm trying to get exclusive max values from a query.
My first query (raw data) is something like that:
Material¦Fornecedor
X B
X B
X B
X C
X C
Y B
Y D
Y D
Firstly, I need to create the max values query for table above. For that, I need to count sequentially sames values of Materials AND Fornecedors. I mean, I need to count until SQL find a line that shows different material and fornecedors.
After that, I'll get an result as showed below (max_line is the number of times that it found a line with same material and fornecedor):
max_line¦Material¦Fornecedor
3 X B
2 X C
1 Y B
2 Y D
In the end, I need to get the highest rows lines for an exclusive Material. The result of the query that I need to contruct, based on table above, should be like that:
max_line¦Material¦Fornecedor
3 X B
2 Y D
My code, so far, is showed below:
select * from
(SELECT max(w2.line) as max_line, w2.Material, w2.[fornecedor] FROM
(SELECT w.Material, ROW_NUMBER() OVER(PARTITION BY w.Material, w.[fornecedor]
ORDER BY w.[fornecedor] DESC) as line, w.[fornecedor]
FROM [Database].[dbo].['Table1'] w) as w2
group by w2.Material, w2.[fornecedor]) as w1
inner join (SELECT w1.Material, MAX(w1.max_line) AS maximo FROM w1 GROUP BY w1.material) as w3
ON w1.Material = w3.Material AND w1.row = w3.maximo
I'm stuck on inner join, since I can't alias a query and use it on inner join.
Could you, please, help me?
Thank you,
Use a window function to find the max row number then filter by it.
SELECT MAXROW, w1.Material, w1.[fornecedor]
FROM (
SELECT w2.Material, w2.[fornecedor]
, max([ROW]) over (partition by Material) MAXROW
FROM (
SELECT w.Material, w.[fornecedor]
, ROW_NUMBER() OVER (PARTITION BY w.Material, w.[fornecedor] ORDER BY w.[fornecedor] DESC) as [ROW]
FROM [Database].[dbo].['Table1'] w
) AS w2
) AS w1
WHERE w1.[ROW] = w1.MAXROW;

SQL: Removing Duplicates in one column while retaining the row with highest value in another column

I am using Teradata and am stuck trying to write some code... I would like to remove the rows in which columnB has a duplicate value, based on the values in ColumnA - if anyone can help me that would be great!
I have a sequencial number in columnA and would like to retain the row with the highest value in columnA.
eg. in the below table I would like to retain rows 9,7,6 & 2, because although they have a duplicate in column 2 they have the highest ColumnA value for that Letter.
Table name: DataTable
Column1 Column2 Column3 Column4 Column5
1 B X X X
2 A Y Y Y
3 E Z Z Z
4 B X X X
5 C Y Y Y
6 E Z Z Z
7 C X X X
8 B Y Y Y
9 B Z Z Z
If you just want to select the rows, you can do:
select t.*
from t
where t.columnA = (select max(t2.columnA) from t t2 where t2.columnB = t.columnB);
If you actually want to remove them, then one method is:
delete from t
where t.columnA < (select max(t2.columnA) from t t2 where t2.columnB = t.columnB);
If you want to return those rows using a SELECT there's no need for a Correlated Subquery, OLAP-functions usually perform better:
select *
from tab
qualify
row_number() over (partition by ColumnB order by columnA DESC) = 1
If you actually want to DELETE the other rows go for Gordon's query.

Populate a sql table with duplicate data except for one column

I have a sql table :
Levels
LevelId Min Product
1 x 1
2 y 1
3 z 1
4 a 1
I need to duplicate the same data into the database by changing only the product Id from 1 2,3.... 40
example
LevelId Min Product
1 x 2
2 y 2
3 z 2
4 a 2
I could do something like
INSERT INTO dbo.Levels SELECT top 4 * fROM dbo.Levels
but that would just copy paste the data.
Is there a way I can copy the data and paste it changing only the Product value?
You're most of the way there - you just need to take one more logical step:
INSERT INTO dbo.Levels (LevelID, Min, Product)
SELECT LevelID, Min, 2 FROM dbo.Levels WHERE Product = 1
...will duplicate your rows with a different product ID.
Also consider that WHERE Product = 1 is going to be more reliable than TOP 4. Once you have more than four rows in the table, you will not be able to guarantee that TOP 4 will return the same four rows unless you also add an ORDER BY to the select, however WHERE Product = ... will always return the same rows, and will continue to work even if you add an extra row with a product ID of 1 (where as you'd have to consider changing TOP 4 to TOP 5, and so on if extra rows are added).
You can generate the product id's and then load them in:
with cte as (
select 2 as n
union all
select n + 1
from cte
where n < 40
)
INSERT INTO dbo.Levels(`min`, product)
SELECT `min`, cte.n as product
fROM dbo.Levels l cross join
cte
where l.productId = 1;
This assumes that the LevelId is an identity column, that auto-increments on insert. If not:
with cte as (
select 2 as n
union all
select n + 1
from cte
where n < 40
)
INSERT INTO dbo.Levels(levelid, `min`, product)
SELECT l.levelid+(cte.n-1)*4, `min`, cte.n as product
fROM dbo.Levels l cross join
cte
where l.productId = 1;
INSERT INTO dbo.Levels (LevelId, Min, Product)
SELECT TOP 4
LevelId,
Min,
2
FROM dbo.Levels
You can include expressions in the SELECT statement, either hard-coded values or something like Product + 1 or anything else.
I expect you probably wouldn't want to insert the LevelId though, but left that there to match your sample. If you don't want that just remove it from the INSERT and SELECT sections.
You could use a CROSS JOIN against a numbers table, for example.
WITH
L0 AS(SELECT 1 AS C UNION ALL SELECT 1 AS O), -- 2 rows
L1 AS(SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B), -- 4 rows
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N FROM L1)
SELECT
lvl.[LevelID],
lvl.[Min],
num.[N]
FROM dbo.[Levels] lvl
CROSS JOIN Nums num
This would duplicate 4 times.