Efficient way of getting group ID without sorting - sql

Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?

Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
Which columns are indexed?
How large is the data? Does it fit in memory?
How many different countries are there?
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;

Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE #country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO #country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN #country c
ON c.countryName = p.countryName

group by use also sort operator in background (group is based on 'sort and compare' like Icomparable in C#)

Related

Need a little SQL help - Getting number of items in common

Imagine I have a table like such
UserID Name Hobbies
00001 Jim Baseball, Hockey, Astonomy
00002 Jack Baseball, Football, Video Games
00003 Jill Astronomy, Shopping, Soccer
00004 Jane Hockey, Astronomy, Video Games
00005 Jacob Football, Basketball, Video Games
Now, what I want to do is get a count of hobbies in common. So, let's say I plug in 00001 into a textbox or query string or whatever. I want to see something like:
Name Hobbies
Jack You have (1) hobby in common
Jill You have (1) hobby in common
Jane You have (2) hobbies in common
Jacob You have (0) hobbies in common
How would I write the code for that? I'm stumped. I'm thinking it's got to do with string matching, but I have no idea how to do that.
The first choice is to fix your data structure. Comma-delimited lists are bad, bad, bad. A separate table storing one row per person and per hobby is good, good, good.
If you are stuck with someone else's bad decisions, there is a little recourse. First Google "sql server split" and get your favorite string splitting function.
Then, you can do:
with t as (
select t.*, s.val as hobby
from table t cross apply
dbo.split(t.Hobbies, ', ') as s(val) -- Note, some `split()` implementations also have a `pos` value
)
select t.userName, count(tuser.userId) as NumInCommon
from t left join
t tuser
on t.hobby = tuser.hobby and tuser.userId = '00001'
group by t.userId, t.userName;
It is not worth constructing the full sentence in SQL, unless you really want to. Use SQL primarily to get the data you want. (Formatting in SQL can be useful sometimes, but it is really more for the application code.)
create table #temp_hobbies
(hobby_id int
,hobby varchar(50))
insert into #temp_hobbies values
(1, 'football')
,(2,'baseball')
create table #temp_people
(user_ids int,
name varchar(50),
hobby_ids int)
insert into #temp_people values
(01,'Adam',1)
,(01,'Adam',2)
,(02,'Dave',1)
,(03,'Matt',2)
select count(distinct hobby) , count(distinct name)
from #temp_hobbies a
inner join #temp_people b on a.hobby_id = b.hobby_ids
part of your solution you now need to add query that will give computed column of each user's hobby compared to other.
But per other user's try seperating hobby's into a seperate table and use int to do joins. Sql server is faster to process ints than varchar's esp if you will need to do this for thousand's of records.
First of all please NORMALIZE your data. you can see lot of repeatating hobbies in each row, also it will be tedious to serach and for maintainability.
you can have all your USERS data in one table as below :
CREATE TABLE USERS ( UserID , NAME ); --> USERID being PRIMARY KEY
you can have all your HOBBIES in another table as below :
CREATE TABLE HOBBIES ( HOBBYID, HOBBYNAME); --> HOBBYID being PRIMARY KEY
you can have another table which maps USERS with HOBBIES as below :
CREATE USERS_HOBBIES( USERID , HOBBYID );
once the table is normalized as above, you can get the desired result by querying as below :
SELECT u.NAME , count(*) AS Hobbies FROM USERS u INNER JOIN
USERS_HOBBIES uh ON u.UserID = uh.USERID INNER JOIN HOBBIES h ON
uh.HOBBYID = h.HOBBYID WHERE h.HOBBYID IN (
(SELECT a.HOBBYID as HOBBYID FROM
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID = '00001' ) a INNER JOIN
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID <> '00001' ) b ON a.HOBBYID = b.HOBBYID) )
AND u.USERID = '00001' GROUP BY u.NAME
P.S : The above query syntax is in ORACLE

User to location mapping with country state and city in the same table

I have a user table that has among others the fields CityId, StateId, CountryId. I was wondering if it was a good idea to store them[City, State, Country] in separate tables and put their respective ids in the User table or put all the three entities in one table.
While the former is conventional, I am concerned about the extra tables to join and so would want to store all these three different location types in one table like so
RowId - unique row id
LocationType - 1 for City, 2 for state, etc
ActualLocation - Can be a city name if the locationType is 1 and so on..
RowId LocationType ActualLocation
1 1 Waltham
2 1 Yokohama
3 2 Delaware
4 2 Wyoming
5 3 US
6 3 Japan
the problem is I am only able to get the city name for all three fields using a join like this
select L.ActualLocation as CityName,
L.ActualLocation as StateName,
L.ActualLocation as CountryName
from UserTable U,
AllLocations L
WHERE
(L.ID = U.City and L.LocationType= 1)
AND
(L.ID = U.State and L.LocationType = 2)
What worked best for us was to have a country table (totally separate table, which can store other country related information, a state table (ditto), and then the city table with ID's to the other tables.
CREATE TABLE Country (CountryID int, Name varchar(50))
CERATE TABLE State (StateID int, CountryID int, Name varchar(50))
CREATE TABLE City (CityID int, StateID int, Name varchar(50))
This way you can enforce referential integrity using standard database functions and add additional information about each entity without having a bunch of blank columns or 'special' values.
You actually need to select from your location table three times - so you will still have the joins:
select L1.ActualLocation as CityName,
L2.ActualLocation as StateName,
L3.ActualLocation as CountryName
from UserTable U,
AllLocations L1,
AllLocations L2,
AllLocations L3
WHERE
(L1.ID = U.City and L1.LocationType= 1)
AND
(L2.ID = U.State and L2.LocationType = 2)
AND
(L3.ID = U.Country and L3.LocationType = 3)
HOWEVER
Depending what you want to do with this, you might want to think about the model... You probably want a separate table that would contain the location "Springfield Missouri" and "Springfield Illinois" - depending how "well" you want to manage this data, you would need to manage the states and countries as separate inter-related reference data (see, for example, ISO 3361 part 2). Most likely overkill for you though, and it might be easiest just to store the text of the location with the user - not "pure" modeling, but much simplified for simple needs... just pulling the "word" out into a separate table doesn't really give you much other than complex queries

Optimize sub query

Suppose there three columns ename , city , salary. There are millions of rows in this table named emp.
ename city salary
ak newyork $5000
bk abcd $4000
ck Delhi $4000
....................
...................
Maverick newyork $8000
I want to retrieve all employees having the same city name as Maverick.
select * from emp where
city = (select city from emp where ename= 'maverick' )
I know it will work, but for performance reasons, this query is not good because there are two where clauses present in this query.
I need a query having better performance than above query.
Oracle is probably going to do a good job getting the optimal execution plan for this query:
select *
from emp
where city = (select city from emp where ename= 'maverick' ) ;
What would help the query are two indexes:
create index idx_emp_ename_city on emp(ename, city)
create index idx_emp_ename_city on emp(city)
The first would be used for the subquery. The second to look up all the matching rows. Without indexes, Oracle is going to have to read the table at least once (I think at least twice) and that is going to affect performance on such a large table.
This would give you the same output but I doubt it will perform any better.
You could compare the plans though.
select x.*
from emp x
join (select city from emp where ename = 'maverick') y
on x.city = y.city
You can also add 2 indexes, one on the ENAME column, and a separate one on the CITY column.
create index emp_idx_ename on emp(ename);
create index emp_idx_city on emp(city);
The first index will speed up the inline view whose results are being joined to, because it is searching the table on employee.
The second index will speed up the parent query, because it is searching the table for a given city.
You could create a composite index on emp(city, ename) as others have suggested since you're select only the city column where the ename is X, allowing the query in the inline view to use only the index and not the table, which I didn't initially think of. It may provide an additional boost, more or less, depending on the size of the table, although the index will also be larger.
To make sure the indexes will immediately use updated statistics related to that table, I would also run the following after you create the above indexes, so that your query will immediately start using them:
analyze table emp compute statistics;
You could use with statement... Users sugest you many dicisions
WITH new_city_tab AS (
SELECT city AS ncity
FROM emp WHERE ename='Maverick'
GROUP BY city)
SELECT *
FROM emp e,
new_city_tab c
WHERE E.city = c.ncity;
Sometimes complexity wins from the desire to narrow down the query further. Just isn't possible to optimize this query further.
You could opt to add an index to create better performance. The index should come on city and ename.
Try this to create these indexes:
create index emp_city -- for the outer where clause
on emp
( city
)
create index emp_ename_city -- for the sub query
on emp
( ename
, city
)

What the simplest way to sub-query a variable number of rows into fields of the parent query?

What the simplest way to sub-query a variable number of rows into fields of the parent query?
PeopleTBL
NameID int - unique
Name varchar
Data: 1,joe
2,frank
3,sam
HobbyTBL
HobbyID int - unique
HobbyName varchar
Data: 1,skiing
2,swimming
HobbiesTBL
NameID int
HobbyID int
Data: 1,1
2,1
2,2
The app defines 0-2 Hobbies per NameID.
What the simplest way to query the Hobbies into fields retrieved with "Select * from PeopleTBL"
Result desired based on above data:
NameID Name Hobby1 Hobby2
1 joe skiing
2 frank skiing swimming
3 sam
I'm not sure if I understand correctly, but if you want to fetch all the hobbies for a person in one row, the following query might be useful (MySQL):
SELECT NameID, Name, GROUP_CONCAT(HobbyName) AS Hobbies
FROM PeopleTBL
JOIN HobbiesTBL USING NameID
JOIN HobbyTBL USING HobbyID
Hobbies column will contain all hobbies of a person separated by ,.
See documentation for GROUP_CONCAT for details.
I don't know what engine are you using, so I've provided an example with MySQL (I don't know what other sql engines support this).
Select P.NameId, P.Name
, Min( Case When H2.HobbyId = 1 Then H.HobbyName End ) As Hobby1
, Min( Case When H2.HobbyId = 2 Then H.HobbyName End ) As Hobby2
From HobbyTbl As H
Join HobbiesTbl As H2
On H2.HobbyId = H.HobbyId
Join PeopleTbl As P
On P.NameId = H2.NameId
Group By P.NameId, P.Name
What you are seeking is called a crosstab query. As long as the columns are static, you can use the above solution. However, if you want to dynamic build the columns, you need to build the SQL statement in middle-tier code or use a reporting tool.

Mysql, reshape data from long / tall to wide

I have data in a mysql table in long / tall format (described below) and want to convert it to wide format. Can I do this using just sql?
Easiest to explain with an example. Suppose you have information on (country, key, value) for M countries, N keys (e.g. keys can be income, political leader, area, continent, etc.)
Long format has 3 columns: country, key, value
- M*N rows.
e.g.
'USA', 'President', 'Obama'
...
'USA', 'Currency', 'Dollar'
Wide format has N=16 columns: county, key1, ..., keyN
- M rows
example:
country, President, ... , Currency
'USA', 'Obama', ... , 'Dollar'
Is there a way in SQL to create a new table with the data in the wide format?
select distinct key from table;
// this will get me all the keys.
1) How do I then create the table using these key elements?
2) How do I then fill in the table values?
I'm pretty sure I can do this with any scripting language (I like python), but wanted to know if there is an easy way to do this in mysql. Many statistical packages like R and STATA have this command built in because it is often used.
======
To be more clear, here is the desired input output for a simple case:
Input:
country attrName attrValue key (these are column names)
US President Obama 2
US Currency Dollar 3
China President Hu 4
China Currency Yuan 5
Output
country President Currency newPkey
US Obama Dollar 1
China Hu Yuan 2
Cross-tabs or pivot tables is the answer. From there you can SELECT FROM ... INSERT INTO ... or create a VIEW from the single SELECT.
Something like:
SELECT country,
MAX( IF( key='President', value, NULL ) ) AS President,
MAX( IF( key='Currency', value, NULL ) ) AS Currency,
...
FROM table
GROUP BY country;
If you were using SQL Server, this would be easy using UNPIVOT. As far as I am aware, this is not implemented in MySQL, so if you want to do this (and I'd advise against it) you'll probably have to generate the SQL dynamically, and that's messy.
I think I found the solution, which uses VIEWS and INSERT INTO (as suggested by e4c5).
You have to get your list of AttrNames/Keys yourself, but MYSQL does the other heavy lifting.
For the simple test case above, create the new_table with the appropriate columns (don't forget to have an auto-increment primary key as well). Then
CREATE VIEW a
AS SELECT country, attrValue
WHERE attrName="President";
CREATE VIEW b
AS SELECT country, attrValue
WHERE attrName="Currency";
INSERT INTO newtable(country, President, Currency)
SELECT a.country, a.attrValue, b.attrValue
FROM a
INNER JOIN b ON a.country=b.country;
If you have more attrNames, then create one view for each one and then adjust the last statement accordingly.
INSERT INTO newtable(country, President, Currency, Capital, Population)
SELECT a.country, a.attrValue, b.attrValue, c.attrValue, d.attrValue
FROM a
INNER JOIN b ON a.country=b.country
INNER JOIN c ON a.country=c.country
INNER JOIN d ON a.country=d.country;
Some more tips
use NATURAL LEFT JOIN and you don't have to specify the ON clause