How to query partitions that you get from using window functions?

How to query partitions that you get from using window functions? - sql

I have a table that has the following structure
------------------------------------
|Company_ID| Company_Name| Join_Key|
------------------------------------
| 1 | ACompany | AC |
| 2 | BCompany | BC |
While this table doesn't have many column, there are somewhere around 4 million rows.
I want to calculate some string distance calculations on these company names. I have the following query
select a.Company_Name as Name1,
b.Company_Name as Name2,
Fuzzy_Match(a.Company_Name, b.Company_Name, 'JaccardDistance') as Jaccard --this is a custom function
from [Companies] a, [Companies] b
While something like this would work on a smaller database, since my database is so large, there is no way for me to be able to get through all of the combinations in a reasonable amount of time. So I thought about partitioning the database with a window function.
select Company_Name,
ROW_NUMBER() over(partition by Join_Key order by Join_Key asc) as row_num
Join_Key
from [Companies]
This gives me a list of the companies numbered and partitioned by their join_key, but the thing that I'm not sure of is how to do both things.
How can I perform a cross join and calculate the string similarity measures for each partition so that I'm only comparing companies that both have 'AC' as their join key?

Related

How to aggregate json fields when using GROUP BY clause in postgres?

I have the following table structure in my Postgres DB (v12.0)
id | pieces | item_id | material_detail
---|--------|---------|-----------------
1 | 10 | 2 | [{"material_id":1,"pieces":10},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
2 | 20 | 2 | [{"material_id":1,"pieces":40}
3 | 30 | 3 | [{"material_id":1,"pieces":20},{"material_id":3,"pieces":30}
I am using GROUP BY query for this records, like below
SELECT SUM(PIECES) FROM detail_table GROUP BY item_id HAVING item_id =2
Using which I will get the total pieces as 30. But how could I get the count of total pieces from material_detail group by material_id.
I want result something like this
pieces | material_detail
-------| ------------------
30 | [{"material_id":1,"pieces":50},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
As I am from MySQL background, I don't know how to achieve this with JSON fields in Postgres.
Note: material_detail column is of JSONB type.

You are aggregating on two different levels. I can't think of a solution that wouldn't need two separate aggregation steps. Additionally to aggregate the material information all arrays of the item_id have to be unnested first, before the actual pieces value can be aggregated for each material_id. Then this has to be aggregated back into a JSON array.
with pieces as (
-- the basic aggregation for the "detail pieces"
select dt.item_id, sum(dt.pieces) as pieces
from detail_table dt
where dt.item_id = 2
group by dt.item_id
), details as (
-- normalize the material information and aggregate the pieces per material_id
select dt.item_id, (m.detail -> 'material_id')::int as material_id, sum((m.detail -> 'pieces')::int) as pieces
from detail_table dt
cross join lateral jsonb_array_elements(dt.material_detail) as m(detail)
where dt.item_id in (select item_id from pieces) --<< don't aggregate too much
group by dt.item_id, material_id
), material as (
-- now de-normalize the material aggregation back into a single JSON array
-- for each item_id
select item_id, jsonb_agg(to_jsonb(d) - 'item_id') as material_detail
from details d
group by item_id
)
-- join both results together
select p.item_id, p.pieces, m.material_detail
from pieces p
join material m on m.item_id = p.item_id
;
Online example

Get specific row from each group

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.

This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?

A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2

You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1

I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.

Select a row used for GROUP BY

I have this table:
id | owner | asset | rate
-------------------------
1 | 1 | 3 | 1
2 | 1 | 4 | 2
3 | 2 | 3 | 3
4 | 2 | 5 | 4
And i'm using
SELECT asset, max(rate)
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
ORDER BY max(rate) DESC
to get intersection of assets for specified owners with best rate.
I also need id of row used for max(rate), but i can't find a way to include it to SELECT. Any ideas?
Edit:
I need
Find all assets that belongs to both owners (1 and 2)
From the same asset i need only one with the best rate (3)
I also need other columns (owner) that belongs to the specific asset with best rate
I expect the following output:
id | asset | rate
-------------------------
3 | 3 | 3
Oops, all 3s, but basically i need id of 3rd row to query the same table again, so resulting output (after second query) will be:
id | owner | asset | rate
-------------------------
3 | 2 | 3 | 3
Let's say it's Postgres, but i'd prefer reasonably cross-DBMS solution.
Edit 2:
Guys, i know how to do this with JOINs. Sorry for misleading question, but i need to know how to get extra from existing query. I already have needed assets and rates selected, i just need one extra field among with max(rate) and given conditions if it's possible.

Another solution that might or might not be faster than a self join (depending on the DBMS' optimizer)
SELECT id,
asset,
rate,
asset_count
FROM (
SELECT id,
asset,
rate,
rank() over (partition by asset order by rate desc) as rank_rate,
count(asset) over (partition by null) as asset_count
FROM test
WHERE owner IN (1, 2)
) t
WHERE rank_rate = 1
ORDER BY rate DESC

You are dealing with two questions and trying to solve them as if they are one. With a subquery, you can better refine by filtering the list in the proper order first (max(rate)), but as soon as you group, you lose this. As such, i would set up two queries (same procedure, if you are using procedures, but two queries) and ask the questions separately. Unless ... you need some of the information in a single grid when output.
I guess the better direction to head is to have you show how you want the output to look. Once you bake the input and the output, the middle of the oreo is easier to fill.

SELECT b.id, b.asset, b.rate
from
(
SELECT asset, max(rate) maxrate
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
) a, test b
WHERE a.asset = b.asset
AND a.maxrate = b.rate
ORDER BY b.rate DESC

You don't specify what type of database you're running on, but if you have analytical functions available you can do this:
select id, asset, max_rate
from (
select ID, asset, max(rate) over (partition by asset) max_rate,
row_number() over (partition by asset order by rate desc) row_num
from test
where owner in (1,2)
) q
where row_num = 1
I'm not sure how to add in the "having count(asset) > 1" in this way though.

This first searches for rows with the maximum rate per asset. Then it takes the highest id per asset, and selects that:
select *
from test
inner join
(
select max(id) as MaxIdWithMaxRate
from test
inner join
(
select asset
, max(rate) as MaxRate
from test
group by
asset
) filter
on filter.asset = test.asset
and filter.MaxRate = test.rate
group by
asset
) filter2
on filter.MaxIdWithMaxRate = test.id
If multiple assets share the maximum rate, this will display the one with the highest id.

How to Select and Order By columns not in Groupy By SQL statement - Oracle

I have the following statement:
SELECT
IMPORTID,Region,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,RefObligor
There exists some extra columns in table Positions that I want as output for "display data" but I don't want in the group by statement.
These are Site, Desk
Final output would have the following columns:
IMPORTID,Region,Site,Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
Ideally I'd want the data sorted like:
Order BY
IMPORTID,Region,Site,Desk,RefObligor
How to achieve this?

It does not make sense to include columns that are not part of the GROUP BY clause. Consider if you have a MIN(X), MAX(Y) in the SELECT clause, which row should other columns (not grouped) come from?
If your Oracle version is recent enough, you can use SUM - OVER() to show the SUM (grouped) against every data row.
SELECT
IMPORTID,Site,Desk,Region,RefObligor,
SUM(NOTIONAL) OVER(PARTITION BY IMPORTID, Region,RefObligor) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
Order BY
IMPORTID,Region,Site,Desk,RefObligor
Alternatively, you need to make an aggregate out of the Site, Desk columns
SELECT
IMPORTID,Region,Min(Site) Site, Min(Desk) Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,Min(Site),Min(Desk),RefObligor

I believe this is
select
IMPORTID,
Region,
Site,
Desk,
RefObligor,
Sum(Sum(Notional)) over (partition by IMPORTID, Region, RefObligor)
from
Positions
group by
IMPORTID, Region, Site, Desk, RefObligor
order by
IMPORTID, Region, RefObligor, Site, Desk;
... but it's hard to tell without further information and/or test data.

A great blog post that covers this dilemma in detail is here:
http://bernardoamc.github.io/sql/2015/05/04/group-by-non-aggregate-columns/
Here are some snippets of it:
Given:
CREATE TABLE games (
game_id serial PRIMARY KEY,
name VARCHAR,
price BIGINT,
released_at DATE,
publisher TEXT
);
INSERT INTO games (name, price, released_at, publisher) VALUES
('Metal Slug Defense', 30, '2015-05-01', 'SNK Playmore'),
('Project Druid', 20, '2015-05-01', 'shortcircuit'),
('Chroma Squad', 40, '2015-04-30', 'Behold Studios'),
('Soul Locus', 30, '2015-04-30', 'Fat Loot Games'),
('Subterrain', 40, '2015-04-30', 'Pixellore');
SELECT * FROM games;
game_id | name | price | released_at | publisher
---------+--------------------+-------+-------------+----------------
1 | Metal Slug Defense | 30 | 2015-05-01 | SNK Playmore
2 | Project Druid | 20 | 2015-05-01 | shortcircuit
3 | Chroma Squad | 40 | 2015-04-30 | Behold Studios
4 | Soul Locus | 30 | 2015-04-30 | Fat Loot Games
5 | Subterrain | 40 | 2015-04-30 | Pixellore
(5 rows)
Trying to get something like this:
SELECT released_at, name, publisher, MAX(price) as most_expensive
FROM games
GROUP BY released_at;
But name and publisher are not added due to being ambiguous when aggregating...
Let’s make this clear:
Selecting the MAX(price) does not select the entire row.
The database can’t know and when it can’t give the right answer every
time for a given query it should give us an error, and that’s what it
does!
Ok… Ok… It’s not so simple, what can we do?
Use an inner join to get the additional columns
SELECT g1.name, g1.publisher, g1.price, g1.released_at
FROM games AS g1
INNER JOIN (
SELECT released_at, MAX(price) as price
FROM games
GROUP BY released_at
) AS g2
ON g2.released_at = g1.released_at AND g2.price = g1.price;
Or Use a left outer join to get the additional columns, and then filter by the NULL of a duplicate column...
SELECT g1.name, g1.publisher, g1.price, g2.price, g1.released_at
FROM games AS g1
LEFT OUTER JOIN games AS g2
ON g1.released_at = g2.released_at AND g1.price < g2.price
WHERE g2.price IS NULL;
Hope that helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas