Efficient approach to get two-dimensional datau using - sql

For the sake of example, let's say I have the following models:
teams
each team has an arbitrary amount of fans
In SQL, this means you end up with the following tables:
team: identifier, name
fan: identifier, name
team_fan: team_identifier, fan_identifier
I am looking for an approach to retrieve:
all teams, and
for each team, the first 5 fans of which his/her name starts with an 'A'.
What is an efficient approach to do this?
In my current naive approach, I do <# teams> + 1 queries, which is troublesome:
First: SELECT * FROM team
Then, for each team with identifier X:
SELECT *
FROM fan
INNER JOIN team_fan
ON fan.identifier = team_fan.fan_identifier AND team_fan.team_identifier = X
WHERE fan.name LIKE 'A%'
ORDER BY fan.name LIMIT 5
There should be a better way to do this.
I could first retrieve all teams, as I do now, and then do something like:
SELECT *
FROM fan
WHERE fan.name LIKE 'A%'
AND fan.identifier IN (
SELECT fan_identifier
FROM team_fan
WHERE team_identifier IN (<all team identifiers from first query>))
ORDER BY fan.name
However, this approach ignores the requirement that I need the first 5 fans for each team with his/her name starting with an 'A'. Just adding LIMIT 5 to the query above is not correct.
Also, with this approach, if I have a large amount of teams, I am sending the corresponding team identifiers back to the database in the second query (for the IN (<all team identifiers from first query>)), which might kill performance?
I am developing against PostgreSQL, Java, Spring and plain JDBC.

You need a three table join
SELECT team.*, fan.*
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
Now to filter you need to do this.
with cte as (
SELECT team.*, fan.*,
row_number() over (partition by team.team_identifier
order by fan.name) as rn
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
WHERE fan.name LIKE 'A%'
)
SELECT *
FROM cte
WHERE rn <= 5

Usually, RDBMSes have their own hacks around standard SQL that allows you to have a number in a count over some condition of grouping/ordering.
Postgres is no exception, it got ROW_NUMBER() function.
What you need is to partition your row numbers properly, order them by alphabet and restrict the query to row numbers < 6.

Related

Oracle - Join multiple columns trying different combinations

I'll try to explain my problem:
I need to find the most efficient way to join two table on 4 columns, but data is really crappy so there could be cases where I can join only on 3 or 2 columns because the fourth and/or third were stored badly (with spaces, zeros, dashes,...)
I should try to achieve something like this:
select * from table a
join table b
on a.key1=b.key1
and a.key2=b.key2
or a.key3=b.key3
or a.key4=b.key4```
I already performed some data quality but the number of records is really high (table a is 300k records and table b is about 25M records).
I know that the example I provided is not efficient and it would be better making separate joins and then "union" them, but I'm asking you if there could be some better way to do it.
Thanks in advance
You haven't explained your problem very well, so let's create an example:
There is a table of clients and a table of orders. Both are not related via keys, because both are imported from different systems. Your task is now to find the client per order.
Both tables contain the client's last name, first name, city, and a client number. However, these columns are optional in the order table (but either last name or client number are always given). And sometimes a first name or city may be abbreviated or misspelled (e.g. J./James, NY/New York, Cris/Chris).
So, if the order contains a client number, we have a match and are done. Otherwise the last name must match. In the latter case we look at first name and city, too. Do both match? Only one? Neither?
We use RANK to rank the clients per order and pick the best matches. Some orders will end up with exactly one match, others will have ties and we must examine the data manually then (the worst case being no client number and no last name match because of a misspelled name).
select *
from
(
select
o.*,
c.*,
rank() over
(
partition by o.order_number
order by
case
when c.client_number = o.client_number then 1
when c.last_name = o.last_name and c.first_name = o.first_name and c.city = o.city then 2
when c.last_name = o.last_name and (c.first_name = o.first_name or c.city = o.city) then 3
when c.last_name = o.last_name then 4
else 5
end
) as rnk
from orders o
left join clients c on c.client_number = o.client_number or c.last_name = o.last_name
) ranked
where rnk = 1
order by order_number;
I hope this gets you an idea how to write such a query and you will be able to adapt this concept to your case.

SQL Join query brings multiple results

I have 2 tables. One lists all the goals scored in the English Premier League and who scored it and the other, the squad numbers of each player in the league.
I want to do a join so that the table sums the total number of goals by player name, and then looks up the squad number of that player.
Table A [goal_scorer]
[]1
Table B [squads]
[]2
I have the SQL query below:
SELECT goal_scorer.*,sum(goal_scorer.number),squads.squad_number
FROM goal_scorer
Inner join squads on goal_scorer.name=squads.player
group by goal_scorer.name
The issue I have is that in the result, the sum of 'number' is too high and seems to include duplicate rows. For example, Aaron Lennon has scored 33 times, not 264 as shown below.
Maybe you want something like this?
SELECT goal_scorer.*, s.total, squads.squad_number
FROM goal_scorer
LEFT JOIN (
SELECT name, sum(number) as total
FROM goal_scorer
GROUP BY name
) s on s.name = goal_scorer.name
JOIN squads on goal_scorer.name=squads.player
There are other ways to do it, but here I'm using a sub-query to get the total by player. NB: Most modern SQL platforms support windowing functions to do this too.
Also, probably don't need the left on the sub-query (since we know there will always be at least one name), but I put it in case your actual use case is more complicated.
Can you try this if you are using sql-server?
select *
from squads
outer apply(
selecr sum(goal_scorer.number) as score
from goal_scorer where goal_scorer.name=squads.player
)x

Efficient approach for fetching many-to-many relations

I have the following tables:
team: identifier, name
fan: identifier, name
team_fan: team_identifier, fan_identifier
In other words, there is a many-to-many relation between team and fan.
I want to fetch all teams for which a certain condition is met; and for each selected team, I want to fetch all its fans. So, in my application, I want to have the following data-structures:
Team A
Fan F1
Fan F2
Team B
Fan F1
Fan F3
Team C
Fan F2
Fan F3
Fan F4
I already came up with the following solutions:
[0] default, typical approach
The default, typical approach is the inner join:
select team.name, fan.name
from team
inner join team_fan
on team.identifier = team_fan.team_identifier
inner join fan
on team_fan.fan_identifier = fan.identifier
where ... (team conditions)
This provides all the required information to construct the data-structures as demonstrated above.
There a lot of teams and fans can belong to multiple teams. The query above might not be a good idea, because teams and fans are duplicated in the result. All these duplicates need to be transmitted over the wire.
In the alternatives below, I am doing the JOIN in the application. The alternatives below might be slower, but I don't know yet. I want to compare and learn from this.
[1] very naive approach
First, we select all teams:
select name from team where ...
Then, for each team with identifier X, we select its fans:
select name
from fan
where exists(select 1 from team_fan where team_identifier = X)
This is a bad solution, because the number of required queries is 1 + number of teams. Also, a fan belonging to multiple teams is fetched multiple times. We can do better.
[2] top-down approach
First, we select all teams. While doing this, we also collect in an array all fans belonging to the team:
select name, array(select identifier
from fan
where exists(select 1 from team_fan where fan.identifier = team_fan.fan_identifier and team.identifier = team_fan.team_identifier)) as fans
from team
where ...
Then, in our application, we construct the union of all fan identifiers. Given this set of fan identifiers, we can select all fans:
select name from fan where identifier in(...)
Now, I have enough information to replicate the JOIN in my application and construct the data-structures as demonstrated above.
This seems like a better solution. The number of queries is always 2. Also, each team and each fan is only fetched once.
[3] bottom-up approach
I called the previous solution top-down because we are adding an array of children (fan) to the parent (team). In this approach, we do the inverse: we are adding an array of parents (team) to the child (fan).
So, first, let's just select all teams:
select name from team where ...
Next, in our application, we construct the union of all team identifiers. Given this set of team identifier, we can select all fans:
select name, array(select team_fan.team_identifier from team_fan where fan_identifier = fan.identifier and team_identifier in(...))
from fan
where exists(select 1 from team_fan where fan_identifier = fan.identifier and team_identifier in(...));
Now, I have enough information to replicate the JOIN in my application and construct the data-structures as demonstrated above.
This seems also like a valid solution. Also in this case, the number of queries is always 2. Also, each team and each fan is only fetched once.
My question
So, back to my question: I want to fetch all teams for which a certain condition is met; and for each selected team, I want to fetch all its fans.
Currently, I am unsure if approach 2 is better than approach 3 (or vice versa), or even, if there are better approaches for this. Any insights are welcome.
Do a simple join
Select
t.identitfier team_identifier
,t.name team_name
,f.identitfier fan_identifier
,f.name fan_name
From team t
inner join team_fan tf
on t.identifier=tf.team_identifier
/* and --(team condition can be put here) */
inner join fan f on tf.fan_identifier=f.identifier
/*where ... --(or team condition can be put here)*/
I recommend modifying option 2 and removing the fan table altogether.
Assuming there are fewer teams than fans this approach will return fewer rows to your application and will likely be more efficient as the array function will not need to execute on as many rows as the alternative.
SELECT
name,
array(
SELECT DISTINCT
fan_identifier
FROM team_fan
WHERE team.identifier = team_fan.team_identifier
) as fans
FROM team
WHERE ...

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

SQL Database SELECT question

Need some help with an homework assignment on SQL
Problem
Find out who (first name and last name) has played the most games in the chess tournament with an ID = 41
Background information
I got a table called Games, which contains information...
game ID
tournament ID
start_time
end_time
white_pieces_player_id
black_pieces_player_id
white_result
black_result
...about all the separate chess games that have taken place in three different tournaments ....
(tournaments having ID's of 41,42 and 47)
...and the first and last names of the players are stored in a table called People....
person ID (same ID which comes up in the table 'Games' as white_pieces_player_id and
black_pieces_player_id)
first_name
last_name
...how to make a SELECT statement in SQL that would give me the answer?
sounds like you need to limit by tournamentID in your where clause, join with the people table on white_pieces_player_id and black_pieces_player_id, and use the max function on the count of white_result = win union black_result = win.
interesting problem.
what do you have so far?
hmm... responding to your comment
SELECT isik.eesnimi
FROM partii JOIN isik ON partii.valge=isik.id
WHERE turniir='41'
group by isik.eesnimi
having count(*)>4
consider using the max() function instead of the having count(*)> number
you can add the last name to the select clause if you also add it to the group by clause
sry, I only speak American. What language is this code in?
I would aggregate a join to that table to a derived table like this:
SELECT a.last_name, a.first_name, CNT(b.gamecount) totalcount
FROM players a
JOIN (select cnt(*) gamecount, a.playerid
FROM games
WHERE a.tournamentid = 47
AND (white_player_id = a.playerid OR black_player_id = a.playerid)
GROUP BY playerid
) b
ON b.playerid = a.playerid
GROUP BY last_name, first_name
ORDER BY totalcount
something like this so that you are getting both counts for their black/white play and then joining and aggregating on that.
Then, if you only want the top one, just select the TOP 1