Normalizing a table, from one to the other - sql

I'm trying to normalize a mysql database....
I currently have a table that contains 11 columns for "categories". The first column is a user_id and the other 10 are category_id_1 - category_id_10. Some rows may only contain a category_id up to category_id_1 and the rest might be NULL.
I then have a table that has 2 columns, user_id and category_id...
What is the best way to transfer all of the data into separate rows in table 2 without adding a row for columns that are NULL in table 1?
thanks!

You can create a single query to do all the work, it just takes a bit of copy and pasting, and adjusting the column name:
INSERT INTO table2
SELECT * FROM (
SELECT user_id, category_id_1 AS category_id FROM table1
UNION ALL
SELECT user_id, category_id_2 FROM table1
UNION ALL
SELECT user_id, category_id_3 FROM table1
) AS T
WHERE category_id IS NOT NULL;
Since you only have to do this 10 times, and you can throw the code away when you are finished, I would think that this is the easiest way.

One table for users:
users(id, name, username, etc)
One for categories:
categories(id, category_name)
One to link the two, including any extra information you might want on that join.
categories_users(user_id, category_id)
-- or with extra information --
categories_users(user_id, category_id, date_created, notes)
To transfer the data across to the link table would be a case of writing a series of SQL INSERT statements. There's probably some awesome way to do it in one go, but since there's only 11 categories, just copy-and-paste IMO:
INSERT INTO categories_users
SELECT user_id, 1
FROM old_categories
WHERE category_1 IS NOT NULL

Related

Split one large, denormalized table into a normalized database

I have a large (5 million row, 300+ column) csv file I need to import into a staging table in SQL Server, then run a script to split each row up and insert data into the relevant tables in a normalized db. The format of the source table looks something like this:
(fName, lName, licenseNumber1, licenseIssuer1, licenseNumber2, licenseIssuer2..., specialtyName1, specialtyState1, specialtyName2, specialtyState2..., identifier1, identifier2...)
There are 50 licenseNumber/licenseIssuer columns, 15 specialtyName/specialtyState columns, and 15 identifier columns. There is always at least one of each of those, but the remaining 49 or 14 could be null. The first identifier is unique, but is not used as the primary key of the Person in our schema.
My database schema looks like this
People(ID int Identity(1,1))
Names(ID int, personID int, lName varchar, fName varchar)
Licenses(ID int, personID int, number varchar, issuer varchar)
Specialties(ID int, personID int, name varchar, state varchar)
Identifiers(ID int, personID int, value)
The database will already be populated with some People before adding the new ones from the csv.
What is the best way to approach this?
I have tried iterating over the staging table one row at a time with select top 1:
WHILE EXISTS (Select top 1 * from staging)
BEGIN
INSERT INTO People Default Values
SET #LastInsertedID = SCOPE_IDENTITY() -- might use the output clause to get this instead
INSERT INTO Names (personID, lName, fName)
SELECT top 1 #LastInsertedID, lName, fName from staging
INSERT INTO Licenses(personID, number, issuer)
SELECT top 1 #LastInsertedID, licenseNumber1, licenseIssuer1 from staging
IF (select top 1 licenseNumber2 from staging) is not null
BEGIN
INSERT INTO Licenses(personID, number, issuer)
SELECT top 1 #LastInsertedID, licenseNumber2, licenseIssuer2 from staging
END
-- Repeat the above 49 times, etc...
DELETE top 1 from staging
END
One problem with this approach is that it is prohibitively slow, so I refactored it to use a cursor. This works and is significantly faster, but has me declaring 300+ variables for Fetch INTO.
Is there a set-based approach that would work here? That would be preferable, as I understand that cursors are frowned upon, but I'm not sure how to get the identity from the INSERT into the People table for use as a foreign key in the others without going row-by-row from the staging table.
Also, how could I avoid copy and pasting the insert into the Licenses table? With a cursor approach I could try:
FETCH INTO ...#LicenseNumber1, #LicenseIssuer1, #LicenseNumber2, #LicenseIssuer2...
INSERT INTO #LicenseTemp (number, issuer) Values
(#LicenseNumber1, #LicenseIssuer1),
(#LicenseNumber2, #LicenseIssuer2),
... Repeat 48 more times...
.
.
.
INSERT INTO Licenses(personID, number, issuer)
SELECT #LastInsertedID, number, issuer
FROM #LicenseTEMP
WHERE number is not null
There still seems to be some redundant copy and pasting there, though.
To summarize the questions, I'm looking for idiomatic approaches to:
Break up one large staging table into a set of normalized tables, retrieving the Primary Key/identity from one table and using it as the foreign key in the others
Insert multiple rows into the normalized tables that come from many repeated columns in the staging table with less boilerplate/copy and paste (Licenses and Specialties above)
Short of discreet answers, I'd also be very happy with pointers towards resources and references that could assist me in figuring this out.
Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.
Calculate the personId on the staging table
As #Shnugo suggested before me, calculating the personId in the staging table will ease the next steps
Use a sequence for the personID
From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value
Create a numbers table
Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)
Use set logic to do the insert
I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.
You could proceed like these:
People:
INSERT INTO People (personID)
SELECT personId from staging;
Names:
INSERT INTO Names (personID, lName, fName)
SELECT personId, lName, fName from staging;
Licenses:
here we'll need the Number table
INSERT INTO Licenses (personId, number, issuer)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then licenseNumber1
when 2 then licenseNumber2
...
when 50 then licenseNumber50
end as licenseNumber,
case nbrs.n
when 1 then licenseIssuer1
when 2 then licenseIssuer2
...
when 50 then licenseIssuer50
end as licenseIssuer
from staging
cross join
(select n from numbers where n>=1 and n<=50) nbrs
) WHERE licenseNumber is not null;
Specialties:
INSERT INTO Specialties(personId, name, state)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then specialtyName1
when 2 then specialtyName2
...
when 15 then specialtyName15
end as specialtyName,
case nbrs.n
when 1 then specialtyState1
when 2 then specialtyState2
...
when 15 then specialtyState15
end as specialtyState
from staging
cross join
(select n from numbers where n>=1 and n<=15) nbrs
) WHERE specialtyName is not null;
Identifiers:
INSERT INTO Identifiers(personId, value)
SELECT * FROM (
SELECT personId,
case nbrs.n
when 1 then identifier1
when 2 then identifier2
...
when 15 then identifier15
end as value
from staging
cross join
(select n from numbers where n>=1 and n<=15) nbrs
) WHERE value is not null;
Hope it helps.
You say: but the staging table could be modified
I would
add a PersonID INT NOT NULL column and fill it with DENSE_RANK() OVER(ORDER BY fname,lname)
add an index to this PersonID
use this ID in combination with GROUP BY to fill your People table
do the same with your names table
And then use this ID for a set-based insert into your three side tables
Do it like this
SELECT AllTogether.PersonID, AllTogether.TheValue
FROM
(
SELECT PersonID,SomeValue1 AS TheValue FROM StagingTable
UNION ALL SELECT PersonID,SomeValue2 FROM StagingTable
UNION ALL ...
) AS AllTogether
WHERE AllTogether.TheValue IS NOT NULL
UPDATE
You say: might cause a conflict with IDs that already exist in the People table
You did not tell anything about existing People...
Is there any sure and unique mark to identify them? Use a simple
UPDATE StagingTable SET PersonID=xyz WHERE ...
to set existing PersonIDs into your staging table and then use something like
UPDATE StagingTable
SET PersonID=DENSE RANK() OVER(...) + MaxExistingID
WHERE PersonID IS NULL
to set new IDs for PersonIDs still being NULL.

how to not insert common data in a column in sqlite?

I have a table named category and the table looks like
cat_id cat_name
1 Science
2 Arts
and another table named item which looks like
item_id item_name cat_id
1 physics 1
2 literature 2
3 chemistry 1
please mark that cat_id is the foreign key here of item table.
now I want that If I put math as item_name under Arts category then it will insert successfully but I want this to happen in such a way that if I want to put same data again then it wont insert. please mark also that I have cat_name and item_name only then my query fetches the category_id using cat_name from category table and inserts into the item table like this way
insert into item (item_name,cat_id) select 'math',category.cat_id from category where category.cat_name = 'Arts'
but if I run this query again it inserts the same item math again, but I want to stop this to happen, what should I do?
Reyjohn, can't you define the column as unique? That will prevent duplicate values.
I don't use sqlite but my impression is it uses similar syntax to MySQL, but without some of the stricter checks.
As per this answer:
sqlite - How to get INSERT OR IGNORE to work
You could use INSERT OR IGNORE statement.
SQLite doesn't have "INSERT OR UPDATE" which would be perfect for your case.
But you can emulate it with "INSERT OR REPLACE" - which is supported by SQLite - together with a "UNION". I developed this technique because I wanted to issue one single query and let the SQLite engine do everything.
INSERT OR REPLACE INTO item (item_id, item_name, cat_id)
SELECT item_id, item_name, cat_id FROM (
SELECT item_id, item_name, 'new_category' FROM item
WHERE item_name='math'
UNION
SELECT NULL, 'math', 'new_category'
) ORDER BY item_id DESC LIMIT 1;
So you are basically making one "SELECT WHERE item_name='math'" to get the existing record (if it exists) and then you concatenate this result (UNION) with a SELECT that generates a new record with a NULL item_id.
The trick then is the "ORDER BY item_id DESC LIMIT 1" at the end, where the actual result for the INSERT statement is exactly 1 record: the one that existed, with same item_id and a 'new_category' or a new one with a NULL item_id, which forces SQLite to create one for you.
You can also check if record exists before (with a SELECT) and decide if you need an INSERT or an UPDATE. But this leads to 2 queries and additional code on the client side. Thats why I developed the technique above.

SQL unpivot & insert

Sorry for the lack of info -- SQL Server 2008.
I'm struggling to get a couple of column values from table A into a new row in table B for each row in A where a column isn't null.
Table A's structure is as:
UserID | ClientUserID | ClientSessionID | [and a load of other irrelevant columns)
Table B:
UserID | Name | Value
I want to create rows in table B for each non-null ClientUserID or ClientSessionID in A - using the column name as B's "Name", and column value as "B's Value".
I'm struggling to write my "unpivot" statement - just getting the syntax correct! I'm trying to follow along with some samples but can't
Here's my SQL query so far - any further help would be appreciated (just getting this SELECT is frustrating me, let alone doing the insert!)
SELECT UserID, ClientUserID, ClientSessionID FROM websiteuser WHERE ClientSessionID IS NOT null
This gives me the rows that I need to perform actions upon -- but I just can't get the syntax correct for UNPIVOTing this data and turning it into my insert.
You can unpivot records in this fashion by using UNION to get each new row:
INSERT INTO TableB (UserID, Name, Value)
SELECT UserID, 'ClientUserID' AS Name, ClientUserID AS Value
FROM TableA
WHERE ClientUserID IS NOT NULL
UNION ALL
SELECT UserID, 'ClientSessionID' AS Name, ClientSessionID AS Value
FROM TableA
WHERE ClientSessionID IS NOT NULL
I am using UNION ALL in this case as UNION implies a DISTINCT operation across the entire set, which should normally be unnecessary when pivoting unique records.
If your ClientUserID and ClientSessionID columns are not the same datatype, you may have to cast one or both to the same.

oracle unique constraint

I'm trying to insert distinct values from one table into another. My target table has a primary key studentid and when I perform distinct id from source to target the load is successful. When I'm trying to load a bunch of columns from source to target including student_id, I'm getting an error unique constraint violated. There is only one constraint on target which is the primary key on studentid.
my query looks like this (just an example)
insert into target(studentid, age, schoolyear)
select distinct id, age, 2012 from source
Why does the above query returns an error where as the below query works perfectly fine
insert into target(studentid)
select distinct id from source
help me troubleshoot this.
Thanks for your time.
In your first query you are selecting for distinct combination of three columns ie,
select distinct id, age, 2012 from source
Not the distinct id alone. In such case there are possibility for duplicate id's.
For example, Your above query is valid for this
id age
1 23
1 24
1 25
2 23
3 23
But in your second query you are selecting only distinct id's
select distinct id from source
So this will return like,
id
1
2
3
In this case there is no way for duplicates and your insert into target will not
fail.
If you really want to do bulk insert with constrain on target then go for
any aggregate functions
select id, max(age), max(2012) group by id from source
Or if you dont want to loose any records from source to target then remove your constraint on target and insert it.
Hope this helps

Select from multiple tables with different columns

Say I got this SQL schema.
Table Job:
id,title, type, is_enabled
Table JobFileCopy:
job_id,from_path,to_path
Table JobFileDelete:
job_id, file_path
Table JobStartProcess:
job_id, file_path, arguments, working_directory
There are many other tables with varying number of columns and they all got foreign key job_id which is linked to id in table Job.
My questions:
Is this the right approach? I don't have requirement to delete anything at anytime. I will require to select and insert mostly.
Secondly, what is the best approach to get the list of jobs with relevant details from all the different tables in a single database hit? e.g I would like to select top 20 jobs with details, their details can be in any table (depends on column type in table Job) which I don't know until runtime.
select (case when type = 'type1' then (select field from table1) else (select field from table2) end) as a from table;
Could it be a solution for you?