My company uses liquibase to keep track of database changes. Everyday around 100 new changesets are being added. From what I understand for already executed changesets liquibase computes checksum again and compares it with checksum in databasechangelog table to see whether checksum has changed and gives checksum issue if it is changed.
So after few months when I have large number of changesets already executed, If I add a new changeset doesn't this process of computing checksum of already executed changesets and comparing them make the execution of new changesets slower or cause any performance related issues?
I've never stumbled across this kind of performance issues with liquibase.
But I guess your question raises a couple of more questions:
what do you consider to be "slower"?
when performance starts to become an issue and is it really an issue?
maybe something's wrong with your application's architecture?
Anyway, comparing checksums against DATABASECHANGELOG table shouldn't take a lot of time - it could be couple of seconds, if you have lots and lots of changeSets.
According to liquibase documentation:
Other times, the problem is that liquibase update is taking too long.
Liquibase tries to be as efficient as possible when comparing the
contents of the DATBASECHANGELOG table with the current changelog file
and even if there are thousands of already ran changeSets, an “update”
command should take just seconds to run.
But if these seconds really make an issue, then consider reading this article:
Trimming ChangeLog Files
Related
It might be super obvious but no one bothered clarifying what or who is actually creating/writing the changesets for liquibase. I read more than a dozen articles related to changesets in liquibase and while I now understand how it works I still wonder if these changesets are generated somewhere by Liquibase ? Or are users supposed to write them by hand ?
And do we agree that the CHANGELOGTABLE, is populated from doing a liquibase update by reading the already existing changesets ? Not the other way around ?
And do we also agree that liquibase doesn't track schema changes, it just computes the desired state of a DB from the changesets ?
Thanks
Edit: I asked many questions, but ultimately I'm just looking for an answer to the title and somehow understand properly how liquibase works.
You write the changeset. And since you can write changesets in sql, its just you writing the database scripts your application needs.
Yes, the DATABASECHANGELOG table is the audit log that gets written after a liquibase update that shows what changesets got executed.
I would recommend taking the fundamentals course provided for free at Liquibase University that covers these very basic concepts. Without it, it will be much harder to be successful using Liquibase. My experience: You can pretty much finish the course in one sitting or maybe an hour each day for a few days.
We have a C# application that receives a file each day with ~35,000,000 rows. It opens the file, parses each record individually, formats some of the fields and then inserts one record at a time into a table. It's really slow, which is expected, but I've been asked to optimize it.
I have instructed that that any optimizations must be contained to SQL only. i.e., there can be no changes to the process or the C# code. I'm trying tom come up with ideas on how I can speed up this process while being limited to SQL modifications only. I have a couple of ideas I want to try but I'd also like feedback from anyone who has found themselves in this situation before.
Ideas:
1. Create a clustered index on the table so the insert always occurs at the tale end of the file. The records in the file are ordered by date/time and the current table has no clustered index so this seems like a valid approach.
Somehow reduce the logging overheard. This data is volatile in nature so it's not a big deal to be able to rollback. Even if the process blew up halfway through, they would just restart it.
Change the isolation level. Perhaps there is an isolation level that is more suited for sequential single-record inserts.
Reduce connection time. The C# app is opening/closing a connection for each insert. We can't change the C# code though so perhaps there is a trick to reducing overhead/time to make a connection.
I appreciate anyone taking the time to read my post and throw out any ideas they feel would be worth it.
Thanks,
Dean
I would suggest the following -- if possible.
Load the data into a staging table.
Do the transformations in SQL.
Bulk insert the data into the final table.
The second suggestion would be:
Modify the C# code to write the data into a file.
Bulk insert the file, either into a staging table or the final table.
Unfortunately, your problem is 35 million round trips from C# to the database. I doubt there is any database optimization that can fix that performance problem. In other words, you need to change the C# code to fix the performance issue. Anything else is probably just a waste of your time.
You can minimize logging either by using simple recovery or writing to a temporary table. Either of those might help. However, consider the second option, because it would be a minor change to the C# code and could result in big improvements.
Or, if you have to do the best in a really bad situation:
Run the C# code and database on the same server. Be sure it has lots of processors.
Attach lots of SSD or memory for the database (if you are not already using it).
Load the data into table spaces that are only on SSD or in memory.
Copy the data from the local database to the remote one.
I am using liquibase 3.2.0 on ORCID, and finding it really useful.
We now have over 200 changeSets on top of the original schema.
These run many times during unit tests because we are using an in memory database (hsqldb).
I would like to 'reset' liquibase by making a new install.xml from the current schema, so that we do not have to run all the changeSets every time.
However, the production database (postgres) has a databasechangelog table with all the old changeSets, so it will try to apply the new install.xml.
How can I start again from a new install.xml without causing problems for production?
Will
Restarting a changeLog from scratch is the same as adding liquibase to an existing project, which is discussed in documentation here
I generally recommend against resetting your changeLog, however, because normally the costs outweigh any benefits in performance. Your 200 changeSet changelog has been fully tested and you know it is correct whereas something regenerated manually or with generateChangeLog can easily have minor differences that can cause problems.
For existing databases, the startup cost of parsing the changelog file and comparing it to the contents of databasechangelog is very low, regardless of the number of changeSets.
For a new database, especially in-memory databases, DDL operations are generally very fast and the speed of going through 200 changeSets to build up your database will probably not be a lot different than building it up in 50 changeSets.
IF there are performance differences, what I've generally seen is that there are a few isolated changeSets that are the problem such as creating an index then dropping it then creating it again. I would recommend looking for any changeSets that may be a problem and carefully removing or combining them vs. a wholesale redo of the changelog.
I'm trying to understand how PostgreSQL builds index concurrently without write lock.
Could someone describe the steps performed by PostgreSQL for that to do this while continuously written into the table data?
The relevant detail is in the source code comments. See the comments on validate_index in src/backend/catalog/index.c around line 2607:
We do a concurrent index build by first inserting the catalog entry
for the index via index_create(), marking it not indisready and not
indisvalid. Then we commit our transaction and start a new one, then
we wait for all transactions that could have been modifying the table
to terminate.
.... and lots, lots more. Basically "it's complicated". I'll attempt to explain it, but I haven't read the code in detail and I don't know this part of the codebase, so the only correct explanation is the comments and source code.
My understanding is that it does an initial build based on an MVCC snapshot of the table state, committing it when it's done. It then waits until all transactions can see the (broken) index, at which point they'll all be updating it when they change things in the table. It then compares what was visible when it built the index to what is visible now and updates the index to reflect the differences between the snapshots. It then waits to make sure there are no transactions that could see the index while it was in an invalid state, marks the index valid, and commits again.
The whole process relies heavily on MVCC snapshots and visibility. It's also considerably more expensive in terms of I/O, CPU and RAM than a regular index build is.
validate_index is called by DefineIndex in src/backend/commands/indexcmds.c, which contains details about the overall process.
I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.