SQL query to identify patterns in Parent-Child relationship - sql

I am working with an Oracle database in a financial institution. It has a Credit Facility table and a Loan table, in a parent-child relationship (1-many).
As part of a project, they added a new field called Type Code to both of these tables (using some complex logic involving the values of a bunch of other fields). I assumed that in the vast majority of cases the parent Credit Facility and all of the children Loans would be assigned the same Type Code. But it turns out that there are hundreds of thousands of cases where the Credit Facility and the Loan have different Type Codes, and all those cases have to be "handled" somehow. I was able to make a query to generate a list of all the Credit Facilities and related Loans where 1 or more Loans have a different Type Code from its parent, and the result was 600K records.
Results look like this (Simplified)
Now I want to break it down into patterns, hopefully using 1 field that I can group by. The field should have values like below:
The Pattern field should always generate the same value based on the parent's Type Code and the unique values of the children. I don't care how many child Loans there are of each type or what order they come in.
Any ideas for how to generate this PATTERN field in a SQL query? I could also do it in Excel in a pinch, but not even sure how to do that, short of writing VBA code, which is my last resort.
Thank you!

The following will produce the results you want, assuming you rename your TYPE fields so they're distinct from one another - one should probably be FACILITY_TYPE and the other should apparently be LOAN_TYPE:
WITH cteDistinct_facility_loan_types
AS (SELECT DISTINCT FACILITY_NUM,
LOAN_TYPE
FROM YOUR_TABLE),
cteFacility_loan_types
AS (SELECT FACILITY_NUM,
LISTAGG(LOAN_TYPE, ',')
WITHIN GROUP (ORDER BY FACILITY_NUM) AS FACILITY_LOAN_TYPES
FROM cteDistinct_facility_loan_types
GROUP BY FACILITY_NUM)
SELECT t.*,
t.FACILITY_TYPE || '-' || flt.FACILITY_LOAN_TYPES AS PATTERN
FROM YOUR_TABLE t
INNER JOIN cteFacility_loan_types flt
ON flt.FACILITY_NUM = t.FACILITY_NUM
ORDER BY t.FACILITY_NUM
db<>fiddle here

If you wanted to look at the patterns for the facilities, then you can use listagg() as an analytic function:
select t.*,
(t.facility_num || '-' ||
listagg(t.loan_type, ',') within group (order by loan_num) over (partition by facility_num)
) as pattern
from t;
These results are a bit different from what you have. They start with the facility type and they include all loans, including duplicates.
If you want the distinct loan_types, then use a subquery:
select t.*,
(t.facility_num || '-' ||
listagg(case when seqnum = 1 then t.loan_type end, ',') within group
(order by loan_num) over
(partition by facility_num)
) as pattern
from (select t.*,
row_number() over (partition by facility_num, loan_type order by loan_num) as seqnum
from t
) t
Here is a db<>fiddle.

Related

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

SQL - select specific results

I have a table containing 2 columns for example. First column has unique values and the second column duplicates. Is there any way for me to select the first unique value only from the first column in relation to the second column?
For example: The results should get: Apple, Tire, and Fork only since they are the first results of the second column (category)
Details
Category
Apple
Fruits
Banana
Fruits
Tire
Car
Engine
Car
Fork
Silverware
Spoon
Silverware
Knife
Silverware
Usually we can use windowing functions like ROW_NUMBER() to simplify these types of queries, however your requested record set does not have a natural sort order that could be used that would result in the output you are expecting.
The following is a simple solution that uses ROW_NUMBER(), however it will not result as you have requested:
SELECT Category, Details
FROM
(
SELECT Category, Details, row_number() over (partition by category order by details) as rn
FROM SpecificResults
) as numberedRecords
WHERE rn = 1;
Results:
Category
Details
Car
Engine
Fruits
Apple
Silverware
Fork
You requested an output of: Apple, Tire, and Fork
The next query might produce the expected output, because we do not specify the sort, however due to this the output is non-deterministic, that is we cannot gaurantee it, due to database internals over time or even after instantaneously repeated queries the result might be different.
There are many discussions on non-deterministic queries in SQL, have a read through this thread on SO: The order of a SQL Select statement without Order By clause
SELECT Category, details.Details
FROM SpecificResults byCategory
CROSS APPLY (
SELECT TOP 1 Details
FROM SpecificResults lookup
WHERE lookup.Category = byCategory.Category
--ORDER BY Details
) as details
GROUP BY Category, details.Details;
Results in:
Category
Details
Car
Tire
Fruits
Apple
Silverware
Fork
I have setup a SQL Fiddle for you to explore this further: http://sqlfiddle.com/#!18/68530/12
Real World Solution
In the real world, your dataset will have a primary key, and in many cases that key value might be incrementally tallied, if not there may be other columns that could be used to determine the sort order that will match your expected results.
Assuming that your dataset has an integer column called Id and that column is an Identity column, then a simple change to the original query using ROW_NUMBER() will achieve the desired result:
SELECT Category, Details
FROM
(
SELECT Category, Details, row_number() over (partition by category order by Id) as rn
FROM OrderedResults
) as numberedRecords
WHERE rn = 1;
I have updated the SQL Fiddle with this variation: http://sqlfiddle.com/#!18/3f7bd/2
If there is a Created date or some other Timestamp or DateTime based column in your recordset then you you could consider those as candidates for your ORDER BY clause.
SQL table represent unordered sets. There is no "first" value unless a column specifies the value. If you have such a column, then you can use row_number():
select t.*
from (select t.*,
row_number() over (partition by category order by <ordering col>) as seqnum
from t
) t
where seqnum = 1;
If you don't have such a column, then you simply cannot ask such a question in a relational database. The data doesn't support the question.
If I understand it correctly, try this -
select category, details from ( select *, row number() over (partition by category order by details) as rn from tablename) where rn = 1

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

PostgreSQL: get the max values from a consult

I need to get the max values from a list of values obtained from a query.
Basically, the problem is this:
I have 2 tables:
Lawyer
id (PK)
surname
name
Case
id (PK)
id_Client
date
id_Lawyer (FK)
And I need to get the Lawyer with the largest number of cases...(There is not problem with that) but, if exist more than one lawyer with the largest number of cases, I should list them.
Any help on this would be appreciated.
SELECT l.*, cases
FROM (
SELECT "id_Lawyer", count(*) AS cases, rank() OVER (ORDER BY count(*) DESC) AS rnk
FROM "Case"
GROUP BY 1
) c
JOIN "Lawyer" l ON l.id = c."id_Lawyer"
WHERE c.rnk = 1;
Basics for the technique (like #FuzzyTree provided):
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
You only need a single subquery level since you can run window functions over aggregate functions:
Get the distinct sum of a joined table column
Best way to get result count before LIMIT was applied
Aside: It's better to use legal, lower case, unquoted identifiers in Postgres. Never use a reserved word like Case, that can lead to very confusing errors.

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.