postgreSQL - get most frequent value from many columns - sql

I have a table hobbies:
+++++++++++++++++++++++++++++++
+ hobby_1 | hobby_2 | hobby_3 +
+---------+---------+---------+
+ music | soccer | [null] +
+ movies | music | cars +
+ cats | dogs | music +
+++++++++++++++++++++++++++++++
I want to get to most freuqent used value. The answer would be music
I know the query to get the most frequent value for one column:
SELECT hobby_1, COUNT(*) FROM hobbies
GROUP BY hobby_1
ORDER BY count(*) DESC;
But how to get the most frequent value when combining all columns.

You need to unpivot the data. Here is one method:
select h.hobby, count(*)
from ((select hobby_1 as hobby from hobbies) union all
(select hobby_2 as hobby from hobbies) union all
(select hobby_3 as hobby from hobbies)
) h
group by h.hobby
order by count(*) desc;
However, you should really fix your data structure. Having multiple columns only distinguished by a number is usually a sign of a problem with the data structure. You should have a table with one row for each hobby.

Related

How can I delete completely duplicate rows from a query, without having a unique value for it?

I'm having an issue getting information from an MS Access Database table. I need a count of a code but I don't have to take into account duplicate rows, which means that I need to delete all duplicate rows.
Here's an example to illustrate what I need:
Code | Name
12 | George
20 | John
12 | George
33 | John
I will need first to delete both rows with the same code, and then I need a count for the name the rest of the table data for example this will be the result that I'm expecting:
Name | Count
John | 2
I already have a query that does that for me, but is taking around 1 hour to get me around 5000 rows and I need something more efficient. My query:
select name, count(*) from Table
where name = '" + input_name + "'
and code in (select code from Table group by code
having count(code) = 1)
group by name
order by count(name) desc;
I would appreciate any suggestion.
Rather than using in, I might suggest filtering the original dataset in a subquery, e.g.:
select u.name, count(*)
from (select t.code, t.name from yourtable t group by t.code, t.name having count(*) = 1) u
group by u.name
Here, change yourtable to the name of your table.

Best practice for joinning 2 tables using LIKE operator or better approach

I have 2 tables that have to be processed once a day in data warehouse.
MessageTable
Id integer primary key
Message varchar(max)
Example:
Id | Message
1 | Hi! This is the first message.
2 | the last message.
PartTable
PartId integer primary key
Words varchar(100)
Example:
PartId | Message
1 | This
2 | message, first
3 | last
Table 1 contains messages to be compared with Table 2 in order to know which parts each message is belonged to.
So above example should return like this.
Id | MessageId | PartId
1 | 1 | 1
2 | 1 | 2
3 | 2 | 3
Because message(id 1) contains "This" keyword as well as "message" and "first", it can be part of 0 and 1.
When keywords in a part are separated by comma all the keywords need to be found in message irrespective of the order.
Stored procedure I roughly made for this process is like this.
INSERT INTO ResultTable(MessageId, PartId)
SELECT MessageTable.Id as MessageId, PartTable.Id as PartID
FROM MessageTable m, PartTable p
WHERE
(SELECT COUNT(VALUE) FROM STRING_SPLIT(p.Word, ',') WHERE CHARINDEX(CONCAT(' ', VALUE, ' '), m.Message) > 0) = (SELECT COUNT(VALUE) FROM STRING_SPLIT(p.Word, ','))
This SQL statement seems to work even though I haven't confirmed thoroughly. But this doesn't look like a good practice.
Should I just try to use more relational approach on PartTable like below? Then all the word rows for a part should be found in message to determine message is belonged to the part.
Id | PartId | Word
1 | 1 | This
2 | 2 | message
3 | 2 | last
I can create this table using STRING_SPLIT on PartTable or PartTable can be refactored. But I don't see the way to join this table with MessageTable. Also I am expecting there would be a lot of rows in MessageTable.
Can anyone give me any help on this?
Thanks,
Hmmmm . . . You can combine all parts and messages and split the parts into words. A where clause can be used for filtering, so only matches are included. A final aggregation and counting returns the message/part pairs where all words match:
select m.id, pt.partid
from message m cross join
parttable pt cross apply
string_split(pt.words, ',') s
where m.message like '%' + s.value + '%'
group by m.id, pt.partid
having count(*) = (select count(*)
from parttable pt2 cross apply
string_split(pt.words, ',') s
where pt2.partid = pt.partid
);
This is not efficient and it is very hard to optimize in SQL Server given your data structure.
A better structure for the parttable would be an improvement for the query:
select m.id, ptn.partid
from message m join
(select ptn.*, count(*) over (partition by partid) as cnt
from parttablenormalized ptn
) ptn
on m.message like '%' + ptn.word + '%'
group by m.id, pnt.partid, cnt
having count(*) = cnt;
However performance might not change much. You would need to denormalize message as well for a speedier query.

Search for occurrences of string in one table field in the field of another

Let's say I want to find mentions of names listed in one table within another. So for instance I have this table:
ID | Name
----+-----------------------
1 | PersonA
2 | PersonB
3 | PersonC
4 | PersonD
Now I want to search a field in another table for these persons' names and produce a count for each. Here's what I've tried, to no avail:
select
Name,
sum(
select
count(*)
from Posts
where Posts.Body like '%[^N]' + [Name] + '%'
) as [Count]
from NamesTable
order by Name;
I am using Data Explorer here on SE, so whatever syntax will work there is what I need. I'm not sure how to get this working or if this is even the best approach.
Your query is very close. You just don't need the sum() in the outer query:
select Name,
(select count(*)
from Posts
where Posts.Body like '%[^N]' + [Name] + '%'
) as [Count]
from NamesTable
order by Name;

Select a row used for GROUP BY

I have this table:
id | owner | asset | rate
-------------------------
1 | 1 | 3 | 1
2 | 1 | 4 | 2
3 | 2 | 3 | 3
4 | 2 | 5 | 4
And i'm using
SELECT asset, max(rate)
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
ORDER BY max(rate) DESC
to get intersection of assets for specified owners with best rate.
I also need id of row used for max(rate), but i can't find a way to include it to SELECT. Any ideas?
Edit:
I need
Find all assets that belongs to both owners (1 and 2)
From the same asset i need only one with the best rate (3)
I also need other columns (owner) that belongs to the specific asset with best rate
I expect the following output:
id | asset | rate
-------------------------
3 | 3 | 3
Oops, all 3s, but basically i need id of 3rd row to query the same table again, so resulting output (after second query) will be:
id | owner | asset | rate
-------------------------
3 | 2 | 3 | 3
Let's say it's Postgres, but i'd prefer reasonably cross-DBMS solution.
Edit 2:
Guys, i know how to do this with JOINs. Sorry for misleading question, but i need to know how to get extra from existing query. I already have needed assets and rates selected, i just need one extra field among with max(rate) and given conditions if it's possible.
Another solution that might or might not be faster than a self join (depending on the DBMS' optimizer)
SELECT id,
asset,
rate,
asset_count
FROM (
SELECT id,
asset,
rate,
rank() over (partition by asset order by rate desc) as rank_rate,
count(asset) over (partition by null) as asset_count
FROM test
WHERE owner IN (1, 2)
) t
WHERE rank_rate = 1
ORDER BY rate DESC
You are dealing with two questions and trying to solve them as if they are one. With a subquery, you can better refine by filtering the list in the proper order first (max(rate)), but as soon as you group, you lose this. As such, i would set up two queries (same procedure, if you are using procedures, but two queries) and ask the questions separately. Unless ... you need some of the information in a single grid when output.
I guess the better direction to head is to have you show how you want the output to look. Once you bake the input and the output, the middle of the oreo is easier to fill.
SELECT b.id, b.asset, b.rate
from
(
SELECT asset, max(rate) maxrate
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
) a, test b
WHERE a.asset = b.asset
AND a.maxrate = b.rate
ORDER BY b.rate DESC
You don't specify what type of database you're running on, but if you have analytical functions available you can do this:
select id, asset, max_rate
from (
select ID, asset, max(rate) over (partition by asset) max_rate,
row_number() over (partition by asset order by rate desc) row_num
from test
where owner in (1,2)
) q
where row_num = 1
I'm not sure how to add in the "having count(asset) > 1" in this way though.
This first searches for rows with the maximum rate per asset. Then it takes the highest id per asset, and selects that:
select *
from test
inner join
(
select max(id) as MaxIdWithMaxRate
from test
inner join
(
select asset
, max(rate) as MaxRate
from test
group by
asset
) filter
on filter.asset = test.asset
and filter.MaxRate = test.rate
group by
asset
) filter2
on filter.MaxIdWithMaxRate = test.id
If multiple assets share the maximum rate, this will display the one with the highest id.

How to Select and Order By columns not in Groupy By SQL statement - Oracle

I have the following statement:
SELECT
IMPORTID,Region,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,RefObligor
There exists some extra columns in table Positions that I want as output for "display data" but I don't want in the group by statement.
These are Site, Desk
Final output would have the following columns:
IMPORTID,Region,Site,Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
Ideally I'd want the data sorted like:
Order BY
IMPORTID,Region,Site,Desk,RefObligor
How to achieve this?
It does not make sense to include columns that are not part of the GROUP BY clause. Consider if you have a MIN(X), MAX(Y) in the SELECT clause, which row should other columns (not grouped) come from?
If your Oracle version is recent enough, you can use SUM - OVER() to show the SUM (grouped) against every data row.
SELECT
IMPORTID,Site,Desk,Region,RefObligor,
SUM(NOTIONAL) OVER(PARTITION BY IMPORTID, Region,RefObligor) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
Order BY
IMPORTID,Region,Site,Desk,RefObligor
Alternatively, you need to make an aggregate out of the Site, Desk columns
SELECT
IMPORTID,Region,Min(Site) Site, Min(Desk) Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,Min(Site),Min(Desk),RefObligor
I believe this is
select
IMPORTID,
Region,
Site,
Desk,
RefObligor,
Sum(Sum(Notional)) over (partition by IMPORTID, Region, RefObligor)
from
Positions
group by
IMPORTID, Region, Site, Desk, RefObligor
order by
IMPORTID, Region, RefObligor, Site, Desk;
... but it's hard to tell without further information and/or test data.
A great blog post that covers this dilemma in detail is here:
http://bernardoamc.github.io/sql/2015/05/04/group-by-non-aggregate-columns/
Here are some snippets of it:
Given:
CREATE TABLE games (
game_id serial PRIMARY KEY,
name VARCHAR,
price BIGINT,
released_at DATE,
publisher TEXT
);
INSERT INTO games (name, price, released_at, publisher) VALUES
('Metal Slug Defense', 30, '2015-05-01', 'SNK Playmore'),
('Project Druid', 20, '2015-05-01', 'shortcircuit'),
('Chroma Squad', 40, '2015-04-30', 'Behold Studios'),
('Soul Locus', 30, '2015-04-30', 'Fat Loot Games'),
('Subterrain', 40, '2015-04-30', 'Pixellore');
SELECT * FROM games;
game_id | name | price | released_at | publisher
---------+--------------------+-------+-------------+----------------
1 | Metal Slug Defense | 30 | 2015-05-01 | SNK Playmore
2 | Project Druid | 20 | 2015-05-01 | shortcircuit
3 | Chroma Squad | 40 | 2015-04-30 | Behold Studios
4 | Soul Locus | 30 | 2015-04-30 | Fat Loot Games
5 | Subterrain | 40 | 2015-04-30 | Pixellore
(5 rows)
Trying to get something like this:
SELECT released_at, name, publisher, MAX(price) as most_expensive
FROM games
GROUP BY released_at;
But name and publisher are not added due to being ambiguous when aggregating...
Let’s make this clear:
Selecting the MAX(price) does not select the entire row.
The database can’t know and when it can’t give the right answer every
time for a given query it should give us an error, and that’s what it
does!
Ok… Ok… It’s not so simple, what can we do?
Use an inner join to get the additional columns
SELECT g1.name, g1.publisher, g1.price, g1.released_at
FROM games AS g1
INNER JOIN (
SELECT released_at, MAX(price) as price
FROM games
GROUP BY released_at
) AS g2
ON g2.released_at = g1.released_at AND g2.price = g1.price;
Or Use a left outer join to get the additional columns, and then filter by the NULL of a duplicate column...
SELECT g1.name, g1.publisher, g1.price, g2.price, g1.released_at
FROM games AS g1
LEFT OUTER JOIN games AS g2
ON g1.released_at = g2.released_at AND g1.price < g2.price
WHERE g2.price IS NULL;
Hope that helps.