How to manage initial data in PostgreSQL? - sql

I have a project with a postgresql database. I'm handling migrations with Flyway. Now I have some initial data, that I want to add to the database when the application starts. It's a data that should always be there in the beginning. How could I handle this data initialization properly?
I've been thinking about using Flyways repeatable migrations. It is run always if the hash of the sql file changes. The problem is, that then I would need to construct it with sql insert statements. The problem there is, that what if the object already exists? Ideally, I would want that I could specify the data in the sql, and then the migration either inserts it to the table if it doesn't exist. But it should look for each field, not just by primary key. Because if I want to change something in one row, then I would want that to update to the database. Of course I could always drop the whole contents of the table, and then run the migration, but isn't that little cumbersome in the long run? Like always after little edit, I need to drop table and run the migration... I just wonder if there is some better way to handle the initial data?

You can specify the primary key value with INSERT or COPY by including the column like any other. With the former, you could add an ON CONFLICT DO UPDATE clause to make any possible changes. If you're using 9.4 or below, ON CONFLICT isn't available so you're stuck with DELETE and a plain INSERT or COPY, although knowing the primary keys means you don't have to delete the entire table.

Related

BEFORE and AFTER triggers, FOR EACH ROW and FOR EACH STATEMENT triggers

I'm new to sql and in particular to postgresql, and I'm studying it for university, but I'm having trouble understanding when I should use AFTER TRIGGERS instead of BEFORE TRIGGERS and when I should make my trigger a FOR EACH ROW TRIGGER or a FOR EACH STATEMENT TRIGGER.
From what I understood, every time the constraint has a count, a sum, an average or depends on a property related to the whole table I should use an AFTER TRIGGER with FOR EACH STATEMENT but I'm not sure and honestly I'm pretty confused.
Do you have any tips for when I should use each type of trigger, or how to understand when I should choose one over the others?
Thank you!
You use a BEFORE trigger FOR EACH ROW if you want to modify the data before they get written to the database.
You use an AFTER trigger if you need the data modifications to be already done, for example if you want to insert a row that references these data via a foreign key constraint.
You use a FOR EACH ROW trigger if you need to deal with on its own, and FOR EACH STATEMENT if the actual rows processed don't concern you (e.g., you want to write an audit log entry for the statement) or you want to access the modified data as a whole (e.g., throw an error if someone tries to delete more than 10 rows with a single SQL statement).

How to trace row level dependency?

In case I need to change the PK of a single row from 1 to 10, for example, is there any way to trace every proc, view and function that might reference the old value?
I mean, a simple select in a proc like: select * from table where FK = 1 would break, and I'd had to look for every reference for ones in every proc and view and change them to 10 to get the system to work.
Is there any automatic way of doing this? I use SQL SERVER.
I suspect that the only way to do this correctly involves querying the database metadata - to identify all the places that use your PK as a FK, in a proc, or in a view. This is likely to be complex; fragile; and prone to error.
This is one of the (many) reasons to avoid having the PK as anything other than a system derived, meaningless value, which is not accessible to manipulation by (even) the creator/administrator. Also, under what circumstances would you have a PK hard coded in a proc or function - again a potential source of fragility in your system.
If a PK is created that is incorrect (by whatever criteria) or which needs to be changed, create a new record and copy the existing values into it. While this does not answer your query, your routines to delete or modify values in the table need to know how and where it is used; and so a routine to copy a row should be able to access this information.

Using Trigger to get ID on Insert - SQL 2005

I have a table (table_a) that, upon insert, needs to retrieve the next available id from the available_id field in another table (table_b) to use as the primary key in table_a, and then increment the available_id field in table_b by 1. While doing this via stored procedures is easy, I need to be able to have this occur on any insert into the table.
I know I need to use triggers, but I am unsure how to code this. Any advice?
Basically this is my dilema:
I need to ensure 2 different tables have unique id's throughout. What would be the best way to do this without using GUID's? (Some of this code cannot be controlled on our end and requires ints as id's).
My advice is DON'T! Use an identity field instead.
In the first place, inserts can have multiple records and so a trigger to properly do this would have to account for that making it rather tricky to write. It would have to be an instead of trigger which is also tricky as you wouldn't have one of the required values (I assume your ID field is required) in the initial insert. In the second place two inserts going on at the same time could try to pick the same number or could lock the second connection for a good bit of time if you are doing a large import of data in one connection.
You could use an Oracle-style sequence, described here, calling it either via a trigger or from your application (providing the resulting value to your insert routine):
http://www.sqlteam.com/article/custom-auto-generated-sequences-with-sql-server
He mentions these issues to consider:
• What if two processes attempt to add
a row to the table at the exact same
time? Can you ensure that the same
value is not generated for both
processes?
• There can be overhead querying the
existing data each time you'd like to
insert new data
• Unless this is implemented as a
trigger, this means that all inserts
to your data must always go through
the same stored procedure that
calculates these sequences. This
means that bulk imports, or moving
data from production to testing and
so on, might not be possible or might
be very inefficient.
• If it is implemented as a trigger,
will it work for a set-based
multi-row INSERT statement? If so,
how efficient will it be? This
function wouldn't work if called for
each row in a single set-based INSERT
-- each NextCustomerNumber() returned would be the same value.

When to Create, When to Modify a Table?

I wanted to know, what should i consider while deciding if i should create a new table or modify an existing table for a sql db. i use both mysql and sqlite.
-Edit- I always thought if i can put a column into a table where it makes sense and can be used by every row then i would always modify it. However at work if its a different 'release' we put it in a different table.
You can modify existing tables, as long as
you are keeping the database Normalized
you are not breaking code that uses the table
You can create new tables even if 1. and 2. are true for the following reasons:
Performance reasons
Clarity in your schema logic.
Not sure if I'm understanding your question correctly, but one thing I always try to consider is the impact on existing data.
Taking the case of an application which relies on a database...
When you update the application (including database schema updates), it is important to ensure that any existing, in-use databases will be either backwards compatible with the application, or there is way to migrate and update the existing database.
Generally if the data is in a one-to-one relationship with the existing data in the table and if the table row size is not too large already and if there aren't too many records in the table, then I usually alter the table to accept the new column.
However, suppose I want to add a column with a default value to a table where it doesn't exist. Adding it to the table with 50 million records might not be so speedy a process and it might lock up the table on production when we move the change up. In this case, putting it into a separate table and adding the records to it may work out better. In general, I wouldn't do this unless my testing has shown that adding and populating the column will take an unacceptably long time. I would prefer to keep the record together where possible.
Same thing with the overall record size. SQL server has a byte limit to the number of bytes that can be in a record, it will allow you to create a structure that is potentially larger than that, but it will not alow you to put more than the byte limit into a specific record. Further, less wide tables tend to be faster to access due to how they are stored. Frequently, people will create a table that has a one-to-one relationship (we call them extended tables in our structure) for additional columns that are not as frequnetly used. If the fields from both tables will be frequently used, often they still create two tables but have a view that will pickout all the columns needed.
And of course if the data is in a one to many relationship, you need a related table not just a new column.
Incidentally, you should always do an alter table through a script and the SSMS GUI as it is more efficient and easier to move to prod.

Skipping primary key conflicts with SQL copy

I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.
I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.
That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.
Use a select statement to select exactly the data you want to insert, without the duplicates.
Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)
You might check out ASKTOM ideas on how to select the non-duplicate rows