Mending bad BAD database design once data is in the system - sql

I know that that is not a question... erm anyway HERE is the question.
I have inherited a database that has 1(one) table in that looks much like this. Its aim is to record what species are found in the various (200 odd) countries.
ID
Species
Afghanistan
Albania
Algeria
American Samoa
Andorra
Angola
....
Western Sahara
Yemen
Zambia
Zimbabwe
A sample of the data would be something like this
id Species Afghanistan Albania American Samoa
1 SP1 null null null
2 SP2 1 1 null
3 SP3 null null 1
It seems to me this is a typical many to many situation and I want 3 tables.
Species, Country, and SpeciesFoundInCountry
The link table (SpeciesFoundInCountry) would have foreign keys in both the species and Country tables.
(It is hard to draw the diagram!)
Species
SpeciesID SpeciesName
Country
CountryID CountryName
SpeciesFoundInCountry
CountryID SpeciesID
Is there a magic way I can generate an insert statement that will get the CountryID from the new Country table based on the column name and the SpeciesID where there is a 1 in the original mega table?
I can do it for one Country (this is a select to show what I want out)
SELECT Species.ID, Country.CountryID
FROM Country, Species
WHERE (((Species.Afghanistan)=1)) AND (((Country.Country)="Afghanistan"));
(the mega table is called species)
But using this strategy I would need to do the query for each column in the original table.
Is there a way of doing this in sql?
I guess I can OR a load of my where clauses together and write a script to make the sql, seems inelegant though!
Any thoughts (or clarification required)?

I would use a script to generate all the individual queries, since this is a one-off import process.
Some programs such as Excel are good at mixing different dimensions of data (comparing column names to data inside rows) but relational databases rarely are.
However, you might find that some systems (such as Microsoft Access, surprisingly) have convenient tools which you can use to normalise the data. Personally I'd find it quicker to write the script but your relative skills with Access and scripting might be different to mine.

Why do you want to do it in SQL? Just write a little script that does the conversion.

When I run into these I write a script to do the conversion rather than trying to do it in SQL. It is typically much faster and easier for me. Pick any language you are comfortable with.

If this was SQL Server, you'd use the Unpivot commands, but looking at the tag you assigned it's for access - am I right?
Although there is a pivoting command in access, there is no reverse statement.
Looks like it can be done with a complex join. Check this interesting article for a lowdown on how to unpivot in a select command.

You're probably going to want to create replacement tables in place. The script sort of depends on the scripting language you have available to you, but you should be able to create the country ID table simply by listing the columns of the table you have now. Once you've done that, you can do some string substitutions to go through all of the unique country names and insert into the speciesFoundInCountry table where the given country column is not null.

You could probably get clever and query the system tables for the column names, and then build a dynamic query string to execute, but honestly that will probably be uglier than a quick script to generate the SQL statements for you.
Hopefully you don't have too much dynamic SQL code that accesses the old tables buried in your codebase. That could be the really hard part.

In SQL Server this will generate your custom select you demonstrate. You can extrapolate to an insert
select
'SELECT Species.ID, Country.CountryID FROM Country, Species WHERE (((Species.' +
c.name +
')=1)) AND (((Country.Country)="' +
c.name +
'"))'
from syscolumns c
inner join sysobjects o
on o.id = c.id
where o.name = 'old_table_name'

As with the others I would most likely just do it as a one time quick fix in whatever manner works for you.
With these types of conversions, they are one off items, quick fixes, and the code doesn't have to be elegant, it just has to work. For these types of things I have done it many ways.

If this is SQL Server, you can use the sys.columns table to find all of the columns of the original table. Then you can use dynamic SQL and the pivot command to do what you want. Look those up online for syntax.

I would definitely agree with your suggestion of writing a small script to produce your SQL with a query for every column.
In fact your script could have already been finished in the time you've spent thinking about this magical query (that you would use only one time and then throw away, so what's the use in making it all magicy and perfect)

Sorry, but the bloody posting parser removed the whitespace and formatting on my post. It makes it a log harder to read.

#stomp:
Above the box where you type the answer, there are several buttons. The one that is 101010 is a code sample. You select all your text that is code, and then click that button. Then it doesn't get messed with much.
cout>>"I don't know C"
cout>>"Hello World"

I would use a Union query, very roughly:
Dim db As Database
Dim tdf As TableDef
Set db = CurrentDb
Set tdf = db.TableDefs("SO")
strSQL = "SELECT ID, Species, """ & tdf.Fields(2).Name _
& """ AS Country, [" & tdf.Fields(2).Name & "] AS CountryValue FROM SO "
For i = 3 To tdf.Fields.Count - 1
strSQL = strSQL & vbCrLf & "UNION SELECT ID, Species, """ & tdf.Fields(i).Name _
& """ AS Country, [" & tdf.Fields(i).Name & "] AS CountryValue FROM SO "
Next
db.CreateQueryDef "UnionSO", strSQL
You would then have a view that could be appended to your new design.

When I read the title 'bad BAD database design', I was curious to find out how bad it is. You didn't disappoint me :)
As others mentioned, a script would be the easiest way. This can be accomplished by writing about 15 lines of code in PHP.
SELECT * FROM ugly_table;
while(row)
foreach(row as field => value)
if(value == 1)
SELECT country_id from country_table WHERE country_name = field;
if(field == 'Species')
SELECT species_id from species_table WHERE species_name = value;
INSERT INTO better_table (...)
Obviously this is pseudo code and will not work as it is. You can also populate the countries and species table on the fly by adding insert statements here.

Sorry, I've done very little Access programming but I can offer some guidance which should help.
First lets walk through the problem.
It is assumed that you will typically need to generate multiple rows in SpeciesFoundInCountry for every row in the original table. In other words species tend to be in more then one country. This is actually easy to do with a Cartesian product, a join with no join criteria.
To do a Cartesian product you will need to create the Country table. The table should have the country_id from 1 to N (N being the number of unique countries, 200 or so) and country name. To make life easy just use the numbers 1 to N in column order. That would make Afghanistan 1 and Albania 2 ... Zimbabwe N. You should be able to use the system tables to do this.
Next create a table or view from the original table which contains the species and a sting with a 0 or 1 for each country. You will need to convert the null, not null to a text 0 or 1 and concatenate all of the values into a single string. A description of the table and a text editor with regular expressions should make this easy. Experiment first with a single column and once that's working edit the create view/insert with all of the columns.
Next join the two tables together with no join criteria. This will give you a record for every species in every country, you're almost there.
Now all you have to do is filter out the records which are not valid, they will have a zero in the corresponding location in the string. Since the country table's country_code column has the substring location all you need to do is filter out the records where it's 0.
where substring(new_column,country_code) = '1'
You will still need to create the species table and join to that
where a.species_name = b.species_name
a and b are table aliases.
Hope this help

OBTW,
If you have queries that already run against the old table you will need to create a view which replicates the old tables using the new tables. You will need to do a group by to denormalize the tables.
Tell your users that the old table/view will not be supported in the future and all new queries or updates to older queries will have to use the new tables.

If I ever have to create a truckload of similar SQL statements and execute all of them, I often find Excel is very handy. Take your original query. If you have a country list in column A and your SQL statement in column B, formated as text (in quotes) with cell references inserted where the country appears in the sql
e.g. ="INSERT INTO new_table SELECT ... (species." & A1 & ")= ... ));"
then just copy the formula down to create 200 different SQL statements, copy/paste the column to your editor and hit F5. You can of course do this with as many variables as you want.

When I've been faced with similar problems, I've found it convenient to generate a script that generates SQL scripts. Here's the sample you gave, abstracted to use %PAR1% in place of Afghanistan.
SELECT Species.ID, Country.CountryID
FROM Country, Species
WHERE (((Species.%PAR1%)=1)) AND (((Country.Country)="%PAR1%"))
UNION
Also the key word union has been added as a way to combine all the selects.
Next, you need a list of countries, generated from your existing data:
Afghanistan
Albania
.
,
.
Next you need a script that can iterate through the country list, and for each iteration,
produce an output that substitutes Afghanistan for %PAR1% on the first iteration, Albania for the second iteration and so on. The algorithm is just like mail-merge in a word processor. It's a little work to write this script. But, once you have it, you can use it in dozens of one-off projects like this one.
Finally, you need to manually change the last "UNION" back to a semicolon.
If you can get Access to perform this giant union, you can get the data you want in the form you want, and insert it into your new table.

I would make it a three step process with a slight temporary modification to your SpeciesFoundInCountry table. I would add a column to that table to store the Country name. Then the steps would be as follows.
1) Create/Run a script that walks columns in the source table and creates a record in SpeciesFoundInCountry for each column that has a true value. This record would contain the country name.
2) Run a SQL statement that updates the SpeciesFoundInCountry.CountryID field by joining to the Country table on Country Name.
3) Cleanup the SpeciesFoundInCountry table by removing the CountryName column.
Here is a little MS Access VB/VBA pseudo code to give you the gist
Public Sub CreateRelationshipRecords()
Dim rstSource as DAO.Recordset
Dim rstDestination as DAO.Recordset
Dim fld as DAO.Field
dim strSQL as String
Dim lngSpeciesID as Long
strSQL = "SELECT * FROM [ORIGINALTABLE]"
Set rstSource = CurrentDB.OpenRecordset(strSQL)
set rstDestination = CurrentDB.OpenRecordset("SpeciesFoundInCountry")
rstSource.MoveFirst
' Step through each record in the original table
Do Until rstSource.EOF
lngSpeciesID = rstSource.ID
' Now step through the fields(columns). If the field
' value is one (1), then create a relationship record
' using the field name as the Country Name
For Each fld in rstSource.Fields
If fld.Value = 1 then
with rstDestination
.AddNew
.Fields("CountryID").Value = Null
.Fields("CountryName").Value = fld.Name
.Fields("SpeciesID").Value = lngSpeciesID
.Update
End With
End IF
Next fld
rstSource.MoveNext
Loop
' Clean up
rstSource.Close
Set rstSource = nothing
....
End Sub
After this you could run a simple SQL statement to update the CountryID values in the SpeciesFoundInCountry table.
UPDATE SpeciesFoundInCountry INNER JOIN Country ON SpeciesFoundInCountry.CountryName = Country.CountryName SET SpeciesFoundInCountry.CountryID = Country.CountryID;
Finally, all you have to do is cleanup the SpeciesFoundInCountry table by removing the CountryName column.
****SIDE NOTE: I have found it usefull to have country tables that also include the ISO abbreviations (country codes). Occassionally they are used as Foreign Keys in other tables so that a join to the Country table does not have to be included in queries.
For more info: http://en.wikipedia.org/wiki/Iso_country_codes

This is (hopefully) a one-off exercise, so an inelegant solution might not be as bad as it sounds.
The problem (as, I'm sure you're only too aware!) is that at some point in your query you've got to list all those columns. :( The question is, what is the most elegant way to do this? Below is my attempt. It looks unwieldy because there are so many columns, but it might be what you're after, or at least it might point you in the right direction.
Possible SQL Solution:
/* if you have N countries */
CREATE TABLE Country
(id int,
name varchar(50))
INSERT Country
SELECT 1, 'Afghanistan'
UNION SELECT 2, 'Albania',
UNION SELECT 3, 'Algeria' ,
UNION SELECT 4, 'American Samoa' ,
UNION SELECT 5, 'Andorra' ,
UNION SELECT 6, 'Angola' ,
...
UNION SELECT N-3, 'Western Sahara',
UNION SELECT N-2, 'Yemen',
UNION SELECT N-1, 'Zambia',
UNION SELECT N, 'Zimbabwe',
CREATE TABLE #tmp
(key varchar(N),
country_id int)
/* "key" field needs to be as long as N */
INSERT #tmp
SELECT '1________ ... _', 'Afghanistan'
/* '1' followed by underscores to make the length = N */
UNION SELECT '_1_______ ... ___', 'Albania'
UNION SELECT '__1______ ... ___', 'Algeria'
...
UNION SELECT '________ ... _1_', 'Zambia'
UNION SELECT '________ ... __1', 'Zimbabwe'
CREATE TABLE new_table
(country_id int,
species_id int)
INSERT new_table
SELECT species.id, country_id
FROM species s ,
#tmp t
WHERE isnull( s.Afghanistan, ' ' ) +
isnull( s.Albania, ' ' ) +
... +
isnull( s.Zambia, ' ' ) +
isnull( s.Zimbabwe, ' ' ) like t.key
My Suggestion
Personally, I would not do this. I would do a quick and dirty solution like the one to which you allude, except that I would hard-code the country ids (because you're only going to do this once, right? And you can do it right after you create the country table, so you know what all the IDs are):
INSERT new_table SELECT Species.ID, 1 FROM Species WHERE Species.Afghanistan = 1
INSERT new_table SELECT Species.ID, 2 FROM Species WHERE Species.Albania= 1
...
INSERT new_table SELECT Species.ID, 999 FROM Species WHERE Species.Zambia= 1
INSERT new_table SELECT Species.ID, 1000 FROM Species WHERE Species.Zimbabwe= 1

Related

Access Append Query compare with table

I am currently rebuilding a messy Access Database and I entcountered the following problem:
I've got a Table of facilities which contain a row called district. Those Rows contain a number linked to another table which just contains the numbers and names of districts. I added a lookup Column with the Name of the district displayed.
I now want to change the new column for every row depending on the data in the old row.
Facilities
NAME|..|DISTRICT_OLD
A |..| 1
B |..| 2
C |..| 1
...
DISTRICTS
ID|NAME
1 |EAST
2 |WEST
...
I would like something like the following:
Facilities
NAME|..|DISTRICT_OLD|DISTRICT
A |..| 1|EAST
B |..| 2|WEST
C |..| 1|EAST
...
The District Field (lookup) gets its Data like follows SELECT [DISTRICTS].ID, [DISTRICTS].NAME FROM DISTRICTS ORDER BY [NAME];
(Thanks to Gordon Linoff) I could get the query but I do now struggle with the insert. I can get the Data I want:
SELECT [DISTRICTS].NAME FROM Facilities INNER JOIN DISTRICTS ON Facilities.DISTRICT_OLD = [DISTRICTS].ID;
If I try to INSERT INTO Facilities(DISTRICT) It says Typerror.
How can I modify the data to be compatible with a lookup column?
I guess I need to select the ID as well which isnt't a problem but then the error says to many columns.
I hope I haven't mistaken any names, my Access isn't running the english language.
Can you help me?
Fabian
Lookup columns are number (long integer)
with a relational database, you only need the single column containing the ID (as you always lookup the district.name with a query) so:
INSERT INTO Facilities(DISTRICT) SELECT 4
where 4 is the ID of the record in the lookup table that you want, or better still:
INSERT INTO Facilities(DISTRICT)
SELECT ID FROM DISTRICTS
where District.Name = "Name you want the ID for"

SQL using where contains to return rows based on the content of another table

I need some help:
I have a table called Countries, which has a column named Town and a column named Country.
Then I have table named Stores, which has several columns (it is a very badly set up table) but the ones that are important are the columns named Address1 and Address2.
I want to return all of the rows in Stores where Address1 and Address2 contains the towns in the Countries table.
I have a feeling this is a simple solution but I just can't see it.
It would help if maybe you could use WHERE CONTAINS but in your parameters search in another table's column?
e.g.
SELECT *
FROM Stores
WHERE CONTAINS (Address1, 'Select Towns from Countries')
but obviously that is not possible, is there a simple solution for this?
You're close
SELECT * FROM Stores s
WHERE EXISTS (
SELECT * FROM Countries
WHERE CONTAINS(s.Address1, Town) OR CONTAINS(s.Address2, Town)
)
This would be my first attempt:
select * from stores s
where
exists
(
select 1 from countries c
where s.Address1 + s.Address2 like '%'+c.Town+'%'
)
Edit: Ooops just saw that you want the 'CONTAINS' clause. Then take Paul's solution

How to efficiently write DISTINCT query in Django with table having foreign keys

I want to show distinct cities of Users in the front end dropdown. For that, i make a db query which fetches distinct city_name from table City but only those cities where users are present.
Something like below works for a small size of User table, but takes a very long time if User table in of size 10 million. Distinct cities of these users are still ~100 though.
class City(models.Model):
city_code = models.IntegerField(unique=True)
city_name = models.CharField(max_length=256)
class User(models.Model):
city = models.ForeignKey('City', to_field='city_code')
Now i try to search distinct city names as:
City.objects.filter().values_list('city__city_name').distinct()
which translates to this on PostgreSQL:
SELECT DISTINCT "city"."city_name"
FROM "user"
LEFT OUTER JOIN "city"
ON ("user"."city_id" = "city"."city_code");
Time: 9760.302 ms
That clearly showed that PostgreSQL was not making use of index on 'user'.'city_id'. I also read about a workaround solution here which involved writing a custom SQL query which somehow utilizes index.
I tried to find distinct 'user'.'city_id' using the above query, and that actually turned out to be pretty fast.
WITH
RECURSIVE t(n) AS
(SELECT min(city_id)
FROM user
UNION
SELECT
(SELECT city_id
FROM user
WHERE city_id > n order by city_id limit 1)
FROM t
WHERE n is not null)
SELECT n
FROM t;
Time: 79.056 ms
But now i am finding it hard to incorporate this in my Django code. I still think it is a kind of hack adding custom query in the code for this. But a bigger concern for me is that the column name can be totally dynamic, and i can not hardcode these column names (eg. city_id, etc.) in the code.
#original_fields could be a list from input, like ['area_code__district_code__name']
dataset_klass.objects.filter().values_list(*original_fields).distinct()
Using the custom query would need atleast splitting the field name with '__' as delimiter and process the first part. But it looks like a bad hack to me.
How can i improve this?
PS. The City User example is just shown to explain the scenario. The syntax might not be correct.
I finally reached to this workaround solution.
from django.db import connection, transaction
original_field = 'city__city_name'
dataset_name = 'user'
dataset_klass = eval(camelize(dataset_name))
split_arr = original_field.split("__",1)
"""If a foreign key relation is present
"""
if len(split_arr) > 1:
parent_field = dataset_klass._meta.get_field_by_name(split_arr[0])[0]
cursor = connection.cursor()
"""This query will run fast only if parent_field is indexed (city_id)
"""
cursor.execute('WITH RECURSIVE t(n) AS ( select min({0}) from {1} '
'union select (select {0} from {1} where {0} > n'
' order by {0} limit 1) from t where n is not null) '
'select n from t;'.format(parent_field.get_attname_column()[1], dataset_name))
"""Create a list of all distinct city_id's"""
distinct_values = [single[0] for single in cursor.fetchall()]
"""create a dict of foreign key field to the above list"""
"""to get the actual city_name's using _meta information"""
filter_dict = {parent_field.rel.field_name+'__in':distinct_values}
values = parent_field.rel.to.objects.filter(**filter_dict).values_list(split_arr[1])
else:
values = dataset_klass.objects.filter().values_list(original_field).distinct()
Which utilizes the index on city_id in user table, runs pretty fast.

How can I retrieve similar data from two separate tables simultaneously?

Disclaimer: my SQL skills are basic, to say the least.
Let's say I have two similar data types in different tables of the same database.
The first table is called hardback and the fields are as follows:
hbID | hbTitle | hbPublisherID | hbPublishDate
The second table is called paperback and its fields hold similar data but the fields are named differently:
pbID | pbTitle | pbPublisherID | pbPublishDate
I need to retrieve the 10 most recent hardback and paperback books, where the publisher ID is 7.
This is what I have so far:
SELECT TOP 10
hbID, hbTitle, hbPublisherID, hbPublishDate AS pDate
bpID, pbTitle, bpPublisherID, pbPublishDate AS pDate
FROM hardback CROSS JOIN paperback
WHERE (hbPublisherID = 7) OR (pbPublisherID = 7)
ORDER BY pDate DESC
This returns seven columns per row, at least three of which may or may not be for the wrong publisher. Possibly four, depending on the contents of pDate, which is almost certainly going to be a problem if the other six columns are for the correct publisher!
In an effort to release an earlier version of this software, I ran two separate queries fetching 10 records each, then sorted them by date and discarded the bottom ten, but I just know there must be a more elegant way to do it!
Any suggestions?
Aside: I was reviewing what I'd written here, when my Mac suddenly experienced a kernel panic. Restarted, reopened my tabs and everything I'd typed was still here! Stack Exchange sites are awesome :)
The easiest way is probably a UNION:
SELECT TOP 10 * FROM
(SELECT hbID, hbTitle, hbPublisherID as PublisherID, hbPublishDate as pDate
FROM hardback
UNION
SELECT hpID, hpTitle, hpPublisher, hpPublishDate
FROM paperback
) books
WHERE PublisherID = 7
If you could have two copies of the same title (1 paperback, 1 hardcover), change the UNION to a UNION ALL; UNION alone discards duplicates. You could also add a column that indicates what book type it is by adding a pseudo-column to each select (after the publish date, for instance):
hbPublishDate as pDate, 'H' as Covertype
You'll have to add the same new column to the paperback half of the query, using 'P' instead. Note that on the second query you don't have to specify column names; the resultset takes the names from the first one. All column data types in the two queries have match, also - you can't UNION a date column in the first with a numeric column in the second without converting the two columns to the same datatype in the query.
Here's a sample script for creating two tables and doing the select above. It works just fine in SQL Server Management Studio.Just remember to drop the two tables (using DROP Table tablename) when you're done.
use tempdb;
create table Paperback (pbID Integer Identity,
pbTitle nvarchar(30), pbPublisherID Integer, pbPubDate Date);
create table Hardback (hbID Integer Identity,
hbTitle nvarchar(30), hbPublisherID Integer, hbPubDate Date);
insert into Paperback (pbTitle, pbPublisherID, pbPubDate)
values ('Test title 1', 1, GETDATE());
insert into Hardback (hbTitle, hbPublisherID, hbPubDate)
values ('Test title 1', 1, GETDATE());
select * from (
select pbID, pbTitle, pbPublisherID, pbPubDate, 'P' as Covertype
from Paperback
union all
select hbID, hbTitle, hbPublisherID, hbPubDate,'H'
from Hardback) books
order by CoverType;
/* You'd drop the two tables here with
DROP table Paperback;
DROP table HardBack;
*/
i think it is clearly better, if you make only one table with a reference to another one which holds information about the category of the entry like hardback or paperback. this is my first suggestion.
by the way, what is your programming language?

Get words from sentence - SQL

Suppose I have a description column that contains
Column Description
------------------
I live in USA
I work as engineer
I have an other table containing the list of countries, since USA (country name) is mentioned in first row, I need that row.
In second case there is no country name so I don't need that column.
Can you please clarify
You may want to try something like the following:
SELECT cd.*
FROM column_description cd
JOIN countries c ON (INSTR(cd.description, c.country_name) > 1);
If you are using SQL Server, you should be able to use the CHARINDEX() function instead of INSTR(), which is available for MySQL and Oracle. You can also use LIKE as other answers have suggested.
Test case:
CREATE TABLE column_description (description varchar(100));
CREATE TABLE countries (country_name varchar(100));
INSERT INTO column_description VALUES ('I live in USA');
INSERT INTO column_description VALUES ('I work as engineer');
INSERT INTO countries VALUES ('USA');
Result:
+---------------+
| description |
+---------------+
| I live in USA |
+---------------+
1 row in set (0.01 sec)
This is a really bad idea, to join on arbitrary text like this. It will be very slow and may not even work.. give it a shot:
select t1.description, c.*
from myTable t1
left join countries c on t1.description like CONCAT('%',c.countryCode,'%')
Its not entierly clear from your post but I think you are asking to return all the rows in the table that contain the descriptions which contain a certain country name? If thats the case you can just use the sql LIKE operator like the following.
select
column_description
from
description_table
where
column_description like %(select distinct country_name from country)%
If not I think your only other choice is Dans post.
Enjoy !
All the suggestions so far seem to match partial words e.g. 'I AM USAIN BOLT' would match the country 'USA'. The question implies that matching should be done on whole words.
If the text was consisted entirely of alphanumeric characters and each word was separated by a space character, you could use something like this
Descriptions AS D1
LEFT OUTER JOIN Countries AS C1
ON ' ' + D1.description + ' '
LIKE '%' + ' ' + country_name + ' ' + '%'
However, 'sentence' implies punctuation e.g. the above would fail to match 'I work in USA, Goa and Iran.' You need to delimit words before you can start matching them. Happily, there are already solutions to this problem e.g. full text search in SQL Server and the like. Why reinvent the wheel?
Another problem is that a single country can go by many names e.g. my country can legitimately be referred to as 'Britain', 'UK', 'GB' (according to my stackoverflow profile), 'England' (if you ask my kids) and 'The United Kingdom of Great Britain and Northern Ireland' (the latter is what is says on my passport and no it won't fit in your NVARCHAR(50) column ;) to name but a few.