Imagine I have a table like such
UserID Name Hobbies
00001 Jim Baseball, Hockey, Astonomy
00002 Jack Baseball, Football, Video Games
00003 Jill Astronomy, Shopping, Soccer
00004 Jane Hockey, Astronomy, Video Games
00005 Jacob Football, Basketball, Video Games
Now, what I want to do is get a count of hobbies in common. So, let's say I plug in 00001 into a textbox or query string or whatever. I want to see something like:
Name Hobbies
Jack You have (1) hobby in common
Jill You have (1) hobby in common
Jane You have (2) hobbies in common
Jacob You have (0) hobbies in common
How would I write the code for that? I'm stumped. I'm thinking it's got to do with string matching, but I have no idea how to do that.
The first choice is to fix your data structure. Comma-delimited lists are bad, bad, bad. A separate table storing one row per person and per hobby is good, good, good.
If you are stuck with someone else's bad decisions, there is a little recourse. First Google "sql server split" and get your favorite string splitting function.
Then, you can do:
with t as (
select t.*, s.val as hobby
from table t cross apply
dbo.split(t.Hobbies, ', ') as s(val) -- Note, some `split()` implementations also have a `pos` value
)
select t.userName, count(tuser.userId) as NumInCommon
from t left join
t tuser
on t.hobby = tuser.hobby and tuser.userId = '00001'
group by t.userId, t.userName;
It is not worth constructing the full sentence in SQL, unless you really want to. Use SQL primarily to get the data you want. (Formatting in SQL can be useful sometimes, but it is really more for the application code.)
create table #temp_hobbies
(hobby_id int
,hobby varchar(50))
insert into #temp_hobbies values
(1, 'football')
,(2,'baseball')
create table #temp_people
(user_ids int,
name varchar(50),
hobby_ids int)
insert into #temp_people values
(01,'Adam',1)
,(01,'Adam',2)
,(02,'Dave',1)
,(03,'Matt',2)
select count(distinct hobby) , count(distinct name)
from #temp_hobbies a
inner join #temp_people b on a.hobby_id = b.hobby_ids
part of your solution you now need to add query that will give computed column of each user's hobby compared to other.
But per other user's try seperating hobby's into a seperate table and use int to do joins. Sql server is faster to process ints than varchar's esp if you will need to do this for thousand's of records.
First of all please NORMALIZE your data. you can see lot of repeatating hobbies in each row, also it will be tedious to serach and for maintainability.
you can have all your USERS data in one table as below :
CREATE TABLE USERS ( UserID , NAME ); --> USERID being PRIMARY KEY
you can have all your HOBBIES in another table as below :
CREATE TABLE HOBBIES ( HOBBYID, HOBBYNAME); --> HOBBYID being PRIMARY KEY
you can have another table which maps USERS with HOBBIES as below :
CREATE USERS_HOBBIES( USERID , HOBBYID );
once the table is normalized as above, you can get the desired result by querying as below :
SELECT u.NAME , count(*) AS Hobbies FROM USERS u INNER JOIN
USERS_HOBBIES uh ON u.UserID = uh.USERID INNER JOIN HOBBIES h ON
uh.HOBBYID = h.HOBBYID WHERE h.HOBBYID IN (
(SELECT a.HOBBYID as HOBBYID FROM
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID = '00001' ) a INNER JOIN
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID <> '00001' ) b ON a.HOBBYID = b.HOBBYID) )
AND u.USERID = '00001' GROUP BY u.NAME
P.S : The above query syntax is in ORACLE
Related
I have a user table that has among others the fields CityId, StateId, CountryId. I was wondering if it was a good idea to store them[City, State, Country] in separate tables and put their respective ids in the User table or put all the three entities in one table.
While the former is conventional, I am concerned about the extra tables to join and so would want to store all these three different location types in one table like so
RowId - unique row id
LocationType - 1 for City, 2 for state, etc
ActualLocation - Can be a city name if the locationType is 1 and so on..
RowId LocationType ActualLocation
1 1 Waltham
2 1 Yokohama
3 2 Delaware
4 2 Wyoming
5 3 US
6 3 Japan
the problem is I am only able to get the city name for all three fields using a join like this
select L.ActualLocation as CityName,
L.ActualLocation as StateName,
L.ActualLocation as CountryName
from UserTable U,
AllLocations L
WHERE
(L.ID = U.City and L.LocationType= 1)
AND
(L.ID = U.State and L.LocationType = 2)
What worked best for us was to have a country table (totally separate table, which can store other country related information, a state table (ditto), and then the city table with ID's to the other tables.
CREATE TABLE Country (CountryID int, Name varchar(50))
CERATE TABLE State (StateID int, CountryID int, Name varchar(50))
CREATE TABLE City (CityID int, StateID int, Name varchar(50))
This way you can enforce referential integrity using standard database functions and add additional information about each entity without having a bunch of blank columns or 'special' values.
You actually need to select from your location table three times - so you will still have the joins:
select L1.ActualLocation as CityName,
L2.ActualLocation as StateName,
L3.ActualLocation as CountryName
from UserTable U,
AllLocations L1,
AllLocations L2,
AllLocations L3
WHERE
(L1.ID = U.City and L1.LocationType= 1)
AND
(L2.ID = U.State and L2.LocationType = 2)
AND
(L3.ID = U.Country and L3.LocationType = 3)
HOWEVER
Depending what you want to do with this, you might want to think about the model... You probably want a separate table that would contain the location "Springfield Missouri" and "Springfield Illinois" - depending how "well" you want to manage this data, you would need to manage the states and countries as separate inter-related reference data (see, for example, ISO 3361 part 2). Most likely overkill for you though, and it might be easiest just to store the text of the location with the user - not "pure" modeling, but much simplified for simple needs... just pulling the "word" out into a separate table doesn't really give you much other than complex queries
Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?
Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
Which columns are indexed?
How large is the data? Does it fit in memory?
How many different countries are there?
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE #country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO #country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN #country c
ON c.countryName = p.countryName
group by use also sort operator in background (group is based on 'sort and compare' like Icomparable in C#)
Let's say I have a parent and child database, and the child keeps a sort of running transcript of things that happen to the parent:
create table patient (
fullname text not null,
admission_number integer primary key
);
create table history (
note text not null,
doctor text not null,
admission_number integer references patient (admission_number)
);
(Just an example, I'm not doing a medical application).
history is going to have many records for the same admission_number:
admission_number doctor note
------------------------------------
3456 Johnson Took blood pressure
7828 Johnson EKG 120, temp 99.2
3456 Nichols Drew blood
9001 Damien Discharged patient
7828 Damien Discharged patient with Rx
So, my question is, how would I build a query that let me do and/or/not searches of the note field for patient records, like, for example, if I wanted to find every patient whose history contained "blood pressure" and "discharged".
Right now I'm been doing a select on history that groups by admission_number, combining all the notes with a group_concat(note) and doing my search in the having, thus:
select * from history
group by admission_number
having group_concat(note) like '%blood pressure%'
and group_concat(note) like '%discharged';
This works, but it makes certain elaborations very complicated -- for example, I'd like to be able to ask things like "every patient whose history contains "blood pressure" and whose history with Dr. Damien says "discharged," and building qualifications like this on top of my basic query is very messy.
Is there any better way of phrasing my basic query?
This is similar to your EXISTS method, but computes the subqueries differently.
This might or might not be faster, depending on how your tables and indexes are organized, and on the queries' selectivity.
SELECT *
FROM patient
WHERE admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%blood pressure%')
AND admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%discharged%'
AND doctor = 'Damien')
Alternatively, you could use a compound subquery (computing the intersection once is likely to be faster than executing IN twice for every record):
SELECT *
FROM patient
WHERE admission_number IN (SELECT admission_number
FROM history
WHERE note LIKE '%blood pressure%'
INTERSECT
SELECT admission_number
FROM history
WHERE note LIKE '%discharged%'
AND doctor = 'Damien')
Why don't you use a JOIN operation?
e.g.
considering, the patient table contains the following data:
INSERT INTO patient VALUES('Bob', 3456);
INSERT INTO patient VALUES('Mary', 7828);
INSERT INTO patient VALUES('Lucy', 9001);
Running the query:
SELECT DISTINCT p.fullname, p.admission_number FROM patient p
INNER JOIN history h ON p.admission_number = h.admission_number
WHERE note LIKE '%blood pressure%' OR note LIKE '%Discharged%';
gets you:
fullname = Bob
admission_number = 3456
fullname = Lucy
admission_number = 9001
fullname = Mary
admission_number = 7828
And running the following query:
SELECT DISTINCT p.fullname, p.admission_number FROM patient p
INNER JOIN history h ON p.admission_number = h.admission_number
WHERE note LIKE '%blood pressure%';
gets you:
fullname = Bob
admission_number = 3456
I have something -- using EXISTS to construct these is a bit cleaner:
select * from patients where
exists (
select 1 from history where
history.admission_number == patients.admission_number
AND
history.note LIKE '%blood pressure%'
)
AND
exists (
select 1 from history where
history.admission_number == patients.admission_number
AND
history.note LIKE '%discharged%'
AND
history.doctor == 'Damien'
);
That's much better, now I can construct really fine-grained predicates.
What the simplest way to sub-query a variable number of rows into fields of the parent query?
PeopleTBL
NameID int - unique
Name varchar
Data: 1,joe
2,frank
3,sam
HobbyTBL
HobbyID int - unique
HobbyName varchar
Data: 1,skiing
2,swimming
HobbiesTBL
NameID int
HobbyID int
Data: 1,1
2,1
2,2
The app defines 0-2 Hobbies per NameID.
What the simplest way to query the Hobbies into fields retrieved with "Select * from PeopleTBL"
Result desired based on above data:
NameID Name Hobby1 Hobby2
1 joe skiing
2 frank skiing swimming
3 sam
I'm not sure if I understand correctly, but if you want to fetch all the hobbies for a person in one row, the following query might be useful (MySQL):
SELECT NameID, Name, GROUP_CONCAT(HobbyName) AS Hobbies
FROM PeopleTBL
JOIN HobbiesTBL USING NameID
JOIN HobbyTBL USING HobbyID
Hobbies column will contain all hobbies of a person separated by ,.
See documentation for GROUP_CONCAT for details.
I don't know what engine are you using, so I've provided an example with MySQL (I don't know what other sql engines support this).
Select P.NameId, P.Name
, Min( Case When H2.HobbyId = 1 Then H.HobbyName End ) As Hobby1
, Min( Case When H2.HobbyId = 2 Then H.HobbyName End ) As Hobby2
From HobbyTbl As H
Join HobbiesTbl As H2
On H2.HobbyId = H.HobbyId
Join PeopleTbl As P
On P.NameId = H2.NameId
Group By P.NameId, P.Name
What you are seeking is called a crosstab query. As long as the columns are static, you can use the above solution. However, if you want to dynamic build the columns, you need to build the SQL statement in middle-tier code or use a reporting tool.
This is a dumbed down version of the real table data, so may look bit silly.
Table 1 (users):
id INT
username TEXT
favourite_food TEXT
food_pref_id INT
Table 2 (food_preferences):
id INT
food_type TEXT
The logic is as follows:
Let's say I have this in my food preference table:
1, 'VEGETARIAN'
and this in the users table:
1, 'John', NULL, 1
2, 'Pete', 'Curry', 1
In which case John defaults to be a vegetarian, but Pete should show up as a person who enjoys curry.
Question, is there any way to combine the query into one select statement, so that it would get the default from the preferences table if the favourite_food column is NULL?
I can obviously do this in application logic, but would be nice just to offload this to SQL, if possible.
DB is SQLite3...
You could use COALESCE(X,Y,...) to select the first item that isn't NULL.
If you combine this with an inner join, you should be able to do what you want.
It should go something like this:
SELECT u.id AS id,
u.username AS username,
COALESCE(u.favorite_food, p.food_type) AS favorite_food,
u.food_pref_id AS food_pref_id
FROM users AS u INNER JOIN food_preferences AS p
ON u.food_pref_id = p.id
I don't have a SQLite database handy to test on, however, so the syntax might not be 100% correct, but it's the gist of it.