Mysql, reshape data from long / tall to wide - sql

I have data in a mysql table in long / tall format (described below) and want to convert it to wide format. Can I do this using just sql?
Easiest to explain with an example. Suppose you have information on (country, key, value) for M countries, N keys (e.g. keys can be income, political leader, area, continent, etc.)
Long format has 3 columns: country, key, value
- M*N rows.
e.g.
'USA', 'President', 'Obama'
...
'USA', 'Currency', 'Dollar'
Wide format has N=16 columns: county, key1, ..., keyN
- M rows
example:
country, President, ... , Currency
'USA', 'Obama', ... , 'Dollar'
Is there a way in SQL to create a new table with the data in the wide format?
select distinct key from table;
// this will get me all the keys.
1) How do I then create the table using these key elements?
2) How do I then fill in the table values?
I'm pretty sure I can do this with any scripting language (I like python), but wanted to know if there is an easy way to do this in mysql. Many statistical packages like R and STATA have this command built in because it is often used.
======
To be more clear, here is the desired input output for a simple case:
Input:
country attrName attrValue key (these are column names)
US President Obama 2
US Currency Dollar 3
China President Hu 4
China Currency Yuan 5
Output
country President Currency newPkey
US Obama Dollar 1
China Hu Yuan 2

Cross-tabs or pivot tables is the answer. From there you can SELECT FROM ... INSERT INTO ... or create a VIEW from the single SELECT.
Something like:
SELECT country,
MAX( IF( key='President', value, NULL ) ) AS President,
MAX( IF( key='Currency', value, NULL ) ) AS Currency,
...
FROM table
GROUP BY country;

If you were using SQL Server, this would be easy using UNPIVOT. As far as I am aware, this is not implemented in MySQL, so if you want to do this (and I'd advise against it) you'll probably have to generate the SQL dynamically, and that's messy.

I think I found the solution, which uses VIEWS and INSERT INTO (as suggested by e4c5).
You have to get your list of AttrNames/Keys yourself, but MYSQL does the other heavy lifting.
For the simple test case above, create the new_table with the appropriate columns (don't forget to have an auto-increment primary key as well). Then
CREATE VIEW a
AS SELECT country, attrValue
WHERE attrName="President";
CREATE VIEW b
AS SELECT country, attrValue
WHERE attrName="Currency";
INSERT INTO newtable(country, President, Currency)
SELECT a.country, a.attrValue, b.attrValue
FROM a
INNER JOIN b ON a.country=b.country;
If you have more attrNames, then create one view for each one and then adjust the last statement accordingly.
INSERT INTO newtable(country, President, Currency, Capital, Population)
SELECT a.country, a.attrValue, b.attrValue, c.attrValue, d.attrValue
FROM a
INNER JOIN b ON a.country=b.country
INNER JOIN c ON a.country=c.country
INNER JOIN d ON a.country=d.country;
Some more tips
use NATURAL LEFT JOIN and you don't have to specify the ON clause

Related

Need a little SQL help - Getting number of items in common

Imagine I have a table like such
UserID Name Hobbies
00001 Jim Baseball, Hockey, Astonomy
00002 Jack Baseball, Football, Video Games
00003 Jill Astronomy, Shopping, Soccer
00004 Jane Hockey, Astronomy, Video Games
00005 Jacob Football, Basketball, Video Games
Now, what I want to do is get a count of hobbies in common. So, let's say I plug in 00001 into a textbox or query string or whatever. I want to see something like:
Name Hobbies
Jack You have (1) hobby in common
Jill You have (1) hobby in common
Jane You have (2) hobbies in common
Jacob You have (0) hobbies in common
How would I write the code for that? I'm stumped. I'm thinking it's got to do with string matching, but I have no idea how to do that.
The first choice is to fix your data structure. Comma-delimited lists are bad, bad, bad. A separate table storing one row per person and per hobby is good, good, good.
If you are stuck with someone else's bad decisions, there is a little recourse. First Google "sql server split" and get your favorite string splitting function.
Then, you can do:
with t as (
select t.*, s.val as hobby
from table t cross apply
dbo.split(t.Hobbies, ', ') as s(val) -- Note, some `split()` implementations also have a `pos` value
)
select t.userName, count(tuser.userId) as NumInCommon
from t left join
t tuser
on t.hobby = tuser.hobby and tuser.userId = '00001'
group by t.userId, t.userName;
It is not worth constructing the full sentence in SQL, unless you really want to. Use SQL primarily to get the data you want. (Formatting in SQL can be useful sometimes, but it is really more for the application code.)
create table #temp_hobbies
(hobby_id int
,hobby varchar(50))
insert into #temp_hobbies values
(1, 'football')
,(2,'baseball')
create table #temp_people
(user_ids int,
name varchar(50),
hobby_ids int)
insert into #temp_people values
(01,'Adam',1)
,(01,'Adam',2)
,(02,'Dave',1)
,(03,'Matt',2)
select count(distinct hobby) , count(distinct name)
from #temp_hobbies a
inner join #temp_people b on a.hobby_id = b.hobby_ids
part of your solution you now need to add query that will give computed column of each user's hobby compared to other.
But per other user's try seperating hobby's into a seperate table and use int to do joins. Sql server is faster to process ints than varchar's esp if you will need to do this for thousand's of records.
First of all please NORMALIZE your data. you can see lot of repeatating hobbies in each row, also it will be tedious to serach and for maintainability.
you can have all your USERS data in one table as below :
CREATE TABLE USERS ( UserID , NAME ); --> USERID being PRIMARY KEY
you can have all your HOBBIES in another table as below :
CREATE TABLE HOBBIES ( HOBBYID, HOBBYNAME); --> HOBBYID being PRIMARY KEY
you can have another table which maps USERS with HOBBIES as below :
CREATE USERS_HOBBIES( USERID , HOBBYID );
once the table is normalized as above, you can get the desired result by querying as below :
SELECT u.NAME , count(*) AS Hobbies FROM USERS u INNER JOIN
USERS_HOBBIES uh ON u.UserID = uh.USERID INNER JOIN HOBBIES h ON
uh.HOBBYID = h.HOBBYID WHERE h.HOBBYID IN (
(SELECT a.HOBBYID as HOBBYID FROM
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID = '00001' ) a INNER JOIN
(SELECT DISTINCT(HOBBYID) as HOBBYID FROM USERS_HOBBIES WHERE
USERID <> '00001' ) b ON a.HOBBYID = b.HOBBYID) )
AND u.USERID = '00001' GROUP BY u.NAME
P.S : The above query syntax is in ORACLE

Efficient way of getting group ID without sorting

Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?
Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
Which columns are indexed?
How large is the data? Does it fit in memory?
How many different countries are there?
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE #country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO #country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN #country c
ON c.countryName = p.countryName
group by use also sort operator in background (group is based on 'sort and compare' like Icomparable in C#)

Include a string in Select Query

I'm wondering can we do this query below?
SELECT America, England, DISTINCT (country) FROM tb_country
which will (my intention is to) display :
America
England
(List of distinct country field in tb_country)
So the point is to display (for example) America and England even if the DISTINCT country field returns nothing. Basically I need this query to list a select dropdown, and give some sticky values user can pick, while allowing themselves to add a new country as they wish.
It also goes without saying, that should one row in the tb_country has a value of America or England, they will not show as a duplicate in the query result. So if the tb_country has list of values :
Germany
England
Holland
The query will only output :
America
England
Germany
Holland
You need to use a UNION:
SELECT 'America' AS country
UNION
SELECT 'England' AS country
UNION
SELECT DISTINCT(c.country) AS country
FROM TB_COUNTRY c
UNION will remove duplicates; UNION ALL will not (but is faster for it).
The data type must match for each ordinal position in the SELECT clause. Meaning, if the first column in the first query were INT, the first column for all the unioned statements afterwards need to be INT as well or NULL.
Why you do not add a weight column in tb_country and use a order clause :
Perform once:
update country set weight = 1 where country = 'England';
update country set weight = 1 where country = 'America';
Then use it:
select distinct(country) from tb_country order by desc weight ;
Another way is to use an extra country table with two columns (country, weight) and an outer join.
Personnaly I rather prefer a country table with a UNIQUE constraint for country field and
Use of a foreign key.

What the simplest way to sub-query a variable number of rows into fields of the parent query?

What the simplest way to sub-query a variable number of rows into fields of the parent query?
PeopleTBL
NameID int - unique
Name varchar
Data: 1,joe
2,frank
3,sam
HobbyTBL
HobbyID int - unique
HobbyName varchar
Data: 1,skiing
2,swimming
HobbiesTBL
NameID int
HobbyID int
Data: 1,1
2,1
2,2
The app defines 0-2 Hobbies per NameID.
What the simplest way to query the Hobbies into fields retrieved with "Select * from PeopleTBL"
Result desired based on above data:
NameID Name Hobby1 Hobby2
1 joe skiing
2 frank skiing swimming
3 sam
I'm not sure if I understand correctly, but if you want to fetch all the hobbies for a person in one row, the following query might be useful (MySQL):
SELECT NameID, Name, GROUP_CONCAT(HobbyName) AS Hobbies
FROM PeopleTBL
JOIN HobbiesTBL USING NameID
JOIN HobbyTBL USING HobbyID
Hobbies column will contain all hobbies of a person separated by ,.
See documentation for GROUP_CONCAT for details.
I don't know what engine are you using, so I've provided an example with MySQL (I don't know what other sql engines support this).
Select P.NameId, P.Name
, Min( Case When H2.HobbyId = 1 Then H.HobbyName End ) As Hobby1
, Min( Case When H2.HobbyId = 2 Then H.HobbyName End ) As Hobby2
From HobbyTbl As H
Join HobbiesTbl As H2
On H2.HobbyId = H.HobbyId
Join PeopleTbl As P
On P.NameId = H2.NameId
Group By P.NameId, P.Name
What you are seeking is called a crosstab query. As long as the columns are static, you can use the above solution. However, if you want to dynamic build the columns, you need to build the SQL statement in middle-tier code or use a reporting tool.

SQL select replace integer with string

Goal is to replace a integer value that is returned in a SQL query with the char value that the number represents. For example:
A table attribute labeled ‘Sport’ is defined as a integer value between 1-4. 1 = Basketball, 2 = Hockey, etc. Below is the database table and then the desired output.
Database Table:
Player Team Sport
--------------------------
Bob Blue 1
Roy Red 3
Sarah Pink 4
Desired Outputs:
Player Team Sport
------------------------------
Bob Blue Basketball
Roy Red Soccer
Sarah Pink Kickball
What is best practice to translate these integer values for String values? Use SQL to translate the values prior to passing to program? Use scripting language to change the value within the program? Change database design?
The database should hold the values and you should perform a join to another table which has that data in it.
So you should have a table which has say a list of people
ID Name FavSport
1 Alex 4
2 Gnats 2
And then another table which has a list of the sports
ID Sport
1 Basketball
2 Football
3 Soccer
4 Kickball
Then you would do a join between these tables
select people.name, sports.sport
from people, sports
where people.favsport = sports.ID
which would give you back
Name Sport
Alex Kickball
Gnat Football
You could also use a case statement eg. just using the people table from above you could write something like
select name,
case
when favsport = 1 then 'Basketball'
when favsport = 2 then 'Football'
when favsport = 3 then 'Soccer'
else 'Kickball'
end as "Sport"
from people
But that is certainly not best practice.
MySQL has a CASE statement. The following works in SQL Server:
SELECT
CASE MyColumnName
WHEN 1 THEN 'First'
WHEN 2 THEN 'Second'
WHEN 3 THEN 'Third'
ELSE 'Other'
END
In oracle you can use the DECODE function which would provide a solution where the design of the database is beyond your control.
Directly from the oracle documentation:
Example: This example decodes the value warehouse_id. If warehouse_id is 1, then the function returns 'Southlake'; if warehouse_id is 2, then it returns 'San Francisco'; and so forth. If warehouse_id is not 1, 2, 3, or 4, then the function returns 'Non domestic'.
SELECT product_id,
DECODE (warehouse_id, 1, 'Southlake',
2, 'San Francisco',
3, 'New Jersey',
4, 'Seattle',
'Non domestic') "Location"
FROM inventories
WHERE product_id < 1775
ORDER BY product_id, "Location";
The CASE expression could help. However, it may be even faster to have a small table with an int primary key and a name string such as
1 baseball
2 football
etc, and JOIN it appropriately in the query.
Do you think it would be helpful to store these relationships between integers and strings in the database itself? As long as you have to store these relationships, it makes sense to store it close to your data (in the database) instead of in your code where it can get lost. If you use this solution, this would make the integer a foreign key to values in another table. You store integers in another table, say sports, with sport_id and sport, and join them as part of your query.
Instead of SELECT * FROM my_table you would SELECT * from my_table and use the appropriate join. If not every row in your main column has a corresponding sport, you could use a left join, otherwise selecting from both tables and using = in the where clause is probably sufficient.
definitely have the DB hold the string values. I am not a DB expert by any means, but I would recommend that you create a table that holds the strings and their corresponding integer values. From there, you can define a relationship between the two tables and then do a JOIN in the select to pull the string version of the integer.
tblSport Columns
------------
SportID int (PK, eg. 12)
SportName varchar (eg. "Tennis")
tblFriend Columns
------------
FriendID int (PK)
FriendName (eg. "Joe")
LikesSportID (eg. 12)
In this example, you can get the following result from the query below:
SELECT FriendName, SportName
FROM tblFriend
INNER JOIN tblSport
ON tblFriend.LikesSportID = tblSport.SportID
Man, it's late - I hope I got that right. by the way, you should read up on the different types of Joins - this is the simplest example of one.