Sqlite what's the right way to make a table holding data for several tables? - sql

I need a table Attachments that stores data for a number of other tables such as Notes and Projects (+many more) with the following properties:
any other table can have many attachments
I frequently need to find all attachments for specific entries of another table (by their primary key in that table)
I've seen in other answers to similar questions that it's best to create the attachments table and then tables like NotesAttachments, ProjectsAttachments, etc. with the Attachment, Notes and Projects IDs as foreign keys. But that looks like complex overengineering to me.
What about directly storing the table Name itself as TEXT column in Attachments and use that name to look for the attachments of one table whenever I need them? So basically the plan is to query for (TableName, ForeignID) to obtain all attachments with integer id ForeignID in table TableName.
Is that problematic, and if so, why?

The attachments table now has a primary key consisting of two columns; this implies that all lookups search those two columns, and that you need a two-column index for these searches to be efficient.
But multi-column keys are a common feature in SQL databases, and are perfectly fine.
You might optimize the TableName column to store a short value, such as a single character, or a number. But the difference is probably not noticeable unless you have a really large amount of data.
(If you could ensure that the Notes/Projects/etc. use primary keys that are unique over all these tables, then you might be able to avoid storing the table name.)

Related

SQL - What is best to do when multiple tables have the same columns

I have different tables in my scheme with different columns, but I want to store data of when was the table modified or when was the data stored, so I added some columns to specify that.
I realized that I had to add the same "modification_date" and "modification_time" columns to all my tables, so I thought about making a new table called DATA_INFO so I won't need to do so, but every table has a different PRIMARY KEY and I don't know which one to add as FOREIGN KEY to the DATA_INFO table.
I don't know if I have to maybe add all of them or is there another way to do what I need.
It's better to have the same "modification_datetime" column in all tables, rather than trying to keep that data in a central table.
That's what we have done at every shop I've worked in.
I want to emphasize that a separate table is not reasonable for this purpose. The lack of an obvious foreign key is a hint.
Unlike Tab Allerman, tables that I create are much less likely to be updated, so I have three additional columns on most tables:
CreatedBy -- the user who created the row
CreatedAt -- when the row was creatd
CreatedOn -- the system where the table was created
The most important point is that this information can -- in many databases -- be implemented using default values rather than triggers. That is a big advantage of working within a single row. The fewer triggers, the better.

Current primary key is ineffective at preventing duplicates. Does this sound like a good way to rearchitect my tables?

Every so often, I update our research recruitment database with those who responded to our Craigslist ad. Each respondent is given a unique respondentID, which is the primary key.
Sometimes, people respond to these Craigslist ads multiple times. I think we may have duplicate people in our database, which is bad.
I would like to change the primary key of all my recruitment tables from respondentID to Email, which will prevent duplicates and make it easier to look up information. There are probably duplicate email records in my database already, and I need to clean this up if so.
Here's the current architecture for my three recruitment tables:
demographic - contains columns like RespondentID (PK), Email (I want this to be PK), Phone, etc
genre - contains columns like RespondentID (PK), Horror, etc
platform - contains columns like RespondentID (PK), TV, etc.
I want to join all three tables together at some point so we can get a better understanding of someone.
Here are my questions:
How can I eliminate duplicate respondents already in my database? (I can tell if they are duplicates because they will have the same Email value.)
Given my current architecture, how can I transition my database to have Email as the primary key without messing up my data?
After transitioning to a new architecture, what is the process I can use to delete duplicates in my Craigslist ad spreadsheet before I append them to Demo, Genre, and Platform tables?
Here are my ideas about solutions:
Create backup tables. Join the three tables and export the big table to Excel. In Excel, use Data Filtering and Conditional Formatting to find the duplicate entries, and delete them by hand. Unfortunately, I have 20,000 records which will crash Excel. :( The chief issue is that I don't know how to remove duplicate entries within a table using SQL. (Also, if I have two entries by bobdole#republican.com, one entry should remain.) Can you come up with a smarter solution involving SQL and Access?
After each Email record is unique, I will create new tables with each using Email as the primary key.
When I want to remove duplicates within the data I'd like to import, I should be able to easily do it within Excel. Next, I will use this SQL command to deduplicate between the current database and the incoming data:
DELETE * from newParticipantsList
WHERE Email in (SelectEmail from Demo)
I'm going to try to duplicate my current architecture in a small test table in Access and see if I can figure it out. Overall, I don't have much experience with joining tables and removing data in SQL, so it's a little scary.
Maybe I'm just being thick, but why don't you just create a new Identity column in the existing table? You can always remove those records you deem duplicates, but the Identity column is guaranteed to be unique under all circumstances.
It will be up to you to make sure that any new records inserted into the table are not duplicates, by checking the Email column.
To remove duplicates from demographic table you could do something like:
WITH RecordsToKeep AS (
SELECT MIN(RespondentID) as RespondentID
FROM demographic
GROUP BY Email
) DELETE demographic
FROM demographic
LEFT JOIN RecordsToKeep on RecordsToKeep.RespondentID = demographic.RespondentID
where RecordsToKeep.RespondentID IS NULL
This will keep the first record for each email address and delete the rest. You will need to remap the genre and platform tables before you delete the source.
In terms of what to do in the future, you could get SQL to do all the de-duplicating for you by importing the data into a staging table and then only import distinct records to the final when the address isn't already in the demographic table.
There is no reason to change the Email Address to be the primary key. String's aren't great primary keys for a number of reasons. The problem you have isn't with duplicate keys, the problem is how you are inserting the data.

A single table that represents multiple tables

I have a problem with finding a way to represent multiple tables hash tables into a single table.
Say I have 3 tables with the format:
Table1(Table1_PK1,Table1_PK2,Table1_PK3,Table1_Hash)
Table2(Table2_PK1,Table2_PK2,Table2_Hash)
Table3(Table3_Pk1,Table3_PK2,Table3_PK3,Table3_PK4,Table3_PK5,Table3_Hash)
Table1_PK1,Table1_PK2,Table1_PK3... are columns and they might have different datatypes (VARCHAR, INT or DATETIME ...).
My question is if there is a way to create a single table (fixed number of columns) that can represent all of these 3 tables (may be more in practical).
I am trying to do this for my database tool. Each table actual a table which contains primary keys and a hash data associating with them.
Since you're apparently building a database tool, not a database, it might make more sense to do this in application code rather than in a database table.
In a different answer, you commented
I am still looking for a dynamic way to do it without knowing how many primary keys a table can have.
A table can have only one primary key. That primary key can consist of more than one column, though. (You already knew this; you were just using the wrong words, which might confuse others.)
A table can also have an arbitrary number of other keys, which will be either declared (as NOT NULL UNIQUE) or "undeclared" (by creating an index that guarantees uniqueness over a set of columns).
You can look all that stuff up at run time in one or both of two ways. (Links go to documentation for PostgreSQL.)
System tables, sometimes called system catalogs
information_schema views
As far as I know, all modern SQL platforms implement at least one of these interfaces. The information_schema views are covered in the SQL standards, but there seems to be some room for interpretation. They don't look quite the same on all platforms.
Why combine the 3 tables into one? Would be really bad db design. But here's a way to do it:
The one table will have a column for each of the 3 tables' columns you want in the final table. I am making the assumption that TableX_Hash is the same type, so that remains as one unique column:
Table_All_in_One (
Table1_PK1,
Table1_PK2,
Table1_PK3,
# space just for clarity of grouping
Table2_PK1,
Table2_PK2,
Table3_PK1,
Table3_PK2,
Table3_PK3,
Table3_PK4,
Table3_PK5,
TableX_Hash # Assuming all the _Hash'es are the same type+length,
# otherwise, add Table1_Hash, Table2_Hash, Table3_Hash
# This can be your new primary key
)
The Primary Keys (PKx) are required to be non-NULL only in their own tables. For this table, they have to allow nulls. The idea is that each row of this new table will only hold the data for one of the tables. The other columns will be empty for that row. If you want to associate the row of one table with another, you can add that to the same row or add FK_Table1_Hash, FK_Table2_Hash and FK_Table3_Hash columns which will refer to the TableX_Hash value of a record.
PS: I wonder if what you are really looking for is a View and not this really bad all-in-one table.
Edit: Combining them into one "without knowing how many primary keys a table can have." as per your comment:
Store all the _PKs concatenated into one column:
Table_All_in_One (
New_PK,
TableX_Hash,
Table1_PKx, # Concatenated PKs of Table1
Table2_PKx, # Concatenated PKs of Table2, etc.
...,
# OR just one
TableX_PKs, # concatenate all the PK's into one VARCHAR field
# Add a pipe `|` between them optionally.
Table_Num # If using just one, then you'll need to store the table number
)
You will not be able to conveniently pick records based on part of their composite primary key. It will always have to be TableX_PKs = CONCAT_WS('|', Table1_PK1, Table1_PK2, ...). So your only dependency is the number of PKs in the original column.
In order to model a bunch of tables you will need 3 tables. An entity table that contains the table names of the tables you wish to set up this way called a factor or entity table. A Factor_detail table that contains all the columns and their associated properties of the tables. A table, factor_detail_value, for storing things like lookup values for lookup tables. I'm trying to learn more about this myself as well because we are using this technique at work as well. Genrate sql on the fly for any table so encoded, and store the data in a repository pertiinant to the data itself. This way if a table changes and you need to add a column or change a datatype, you can add a row to the factor detail table without affecting a database shut down in production. In most businesses a four hour shut down to make a sql data table change can cost thousands of dollars. If you are dealing with insurance for example, each additional state that you sell insurance in has different requirements for being able to seel it and that will result in table changes. We reduced our table count way down from over 700 tables in this manner also we can make changes without database shut down thus avoiding the loss in revenue.

Normalization Help

I am refactoring an old Oracle 10g schema to try to introduce some normalization. In one of the larger tables, there is a text field that has at most, 10-15 possible values. In my mind, it seems that this field is an example of unnecessary data duplication and should be extracted to a separate table.
After examining the data, I cannot find one relevant piece of information that could be associated with that text value. Basically, if I pulled that value out and put it into its own table, it would be the only field in that table. It exists today as more of a 'flag' field. Should I create a two-column table with a surrogate key, keep it as it is, or do something entirely different? Am I doing more harm than good by trying to minimize data duplication on this field?
You might save some space by extracting the column to a separate table. This is called a lookup table. It can give you a couple of other benefits:
You can declare a foreign key constraint to the lookup table, so you can rely on the column in the main table never having any value other than the 10-15 values you want.
It's easy to query for a concise list of all permitted values, by querying the lookup table. This can be faster than using SELECT DISTINCT on the main table's column. It also returns values that are permitted, but not currently used in the main table.
If you change a value in the lookup table, it automatically applies to all rows in the main table that reference it.
However, creating a lookup table with one column is not strictly normalization. You're just replacing one value with another. The attribute in the main table either already supports a normal form, or not.
Using surrogate keys (vs. natural keys) also has nothing to do with normalization. A lot of people make this mistake.
However, if you move other attributes into the lookup table, attributes that depend only on the lookup value and therefore would create repeating groups (violating 3NF) in the main table if you left them there, then that would be normalization.
If you want normalization break it out.
I think of these types of data in DBs as the equivalent of enums in C,C++,C#. Mostly you put them in the table as documentation.
I often have an ID, Name, Description, and auditing columns for them (eg modified by, modified date, create date, create by, active.) The description field is rarely used.
Example (some might say there are more than just 2)
Gender
ID Name Audit Columns...
1 Male
2 Female
Then in your contacts you would have a GenderID column which would link to this one.
Of course you don't "need" the table. You could have external documentation somewhere that says 1=Male, 2=Female -- but I think these tables serve to document a system.
If it's really a free-entry text field that's not re-used somewhere else in the database, and there's just a single field without repeated instances, I'd probably go ahead and leave it as it is. If you're determined to break it out I'd create a 'validation' table with a surrogate key and the text value, then put the surrogate key in the base table.
Share and enjoy.
Are these 10-15 values actually meaningful, or are they really just flags? If they're meaningful pieces of text and it seems wasteful to replicate them, then sure create a lookup table. But if they're just arbitrary flag values, then your new table will be nothing more than a mapping from one arbitrary value to another, and not terribly helpful.
A completely separate question is whether all or most of the rows in your big table even have a value for this column. If not, then indeed you have a good opportunity for normalization and can create a separate table linking the primary key from your base table with the flag value.
Edit: One thing. If there's some chance that one of these "flag" values is likely to be wholesale replaced with another value at some point in the future, that would be another good reason to create a table.

SQL Server select primary key from table where the key contains multiple columns

I am working on a legacy database. I am not able to change the schema :( in a couple of tables the primary key uses multiple columns.
In the app I read the data in each row into a table the user then updates the data and I write the data back into the table.
Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table.
Now I was wondering if there is a more efficient way to do that. Coming from a mySQL background I am not aware of any but thought SQL Server 2005 may have a function
SELECT PRIMARYKEY() as pk, ... FROM table WHERE ...
the above would select the key that the database engine uses as the primary key for the given record
I searched and couldn't find anything. Its probably just me being fussy but I don't like the concatenation trick.
DC
In SQL Server, there is no equivalent of PRIMARYKEY() that I would be aware of, really. You can consult the system catalog views to find out which columns make up the primary key, but you can't just simply select the primary key value(s) with a function call.
I would agree with StarShip3000 - what do you concatenate your PK values for? While I don't think a compound primary key made up of several columns is necessarily a very good idea, if it's a legacy system and you can't change it, I wouldn't bother concatenating the PK values on read, and then having to split them apart again when you write your data back. Just leave the structure as it is - compound keys aren't generally recommended, but they are indeed supported, no problem.
"Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table."
Can't you just store the pk as two columns in the target table and use that to join back to the two columns on the source table?
What benefit is concatenating giving you here?