I was wondering why my CouchDB database was growing to fast so I wrote a little test script. This script changes an attributed of a CouchDB document 1200 times and takes the size of the database after each change. After performing these 1200 writing steps the database is doing a compaction step and the db size is measured again. In the end the script plots the databases size against the revision numbers. The benchmarking is run twice:
The first time the default number of document revision (=1000) is used (_revs_limit).
The second time the number of document revisions is set to 1.
The first run produces the following plot
The second run produces this plot
For me this is quite an unexpected behavior. In the first run I would have expected a linear growth as every change produces a new revision. When the 1000 revisions are reached the size value should be constant as the older revisions are discarded. After the compaction the size should fall significantly.
In the second run the first revision should result in certain database size that is then keeps during the following writing steps as every new revision leads to the deletion of the previous one.
I could understand if there is a little bit of overhead needed to manage the changes but this growth behavior seems weird to me. Can anybody explain this phenomenon or correct my assumptions that lead to the wrong expectations?
First off, CouchDB saves some information even for deleted revisions (just the ID and revision identifier), because it needs this for replication purposes.
Second, inserting documents one at a time is suboptimal because of the way the data is saved on disk (see WikiPedia), this could explain the superlinear growth in the first graph.
Related
I am trying to understand the concept Full Table vs Incremental Table vs Delta table and in principle in order to simplify (faster loading process) the daily ETL loads is it a good practice to use Incremental table
FULL TABLE
INCREMENTAL TABLE
DELTA TABLE
i have read some where
Using incremental loads to move data can shorten the run times of your ETL processes and reduce the risk when something goes wrong
can some one please help me understanding the concept ?
full, as its name says, loads everything, the whole source data file
incremental - or delta (those are synonyms, not two different types) - mean that you load only data which you haven't loaded yet. It usually means that time of the last loading has been recorded. The next loading session loads data created after the last successful loading timestamp
As of
"shortening run times": obviously, if you don't have to load everything but just what's missing, it takes less time
"reducing the risk": you don't mess up with data already loaded, it stays in the database. If something goes wrong, it goes wrong with current loading session so you can discard changes you've made and start over
Well you did not provide the reference for your quote, but in my experience it is only 50% correct.
I'd read it:
Using incremental loads to move data can shorten the run times of your ETL processes but increase the risk that something goes wrong.
The problem is in the error accumulation. If you get corrupt or incomplete data in full load you through them away on the next load and there is a good chance that the new load is valid.
On contrary with delta load the errors remain and can accumulate within the time.
Therefor is a good practice while implementing a delta load is to perform periodical check (daily, monthly etc.) that the complete snashot in the source and target are identical.
My rule of thumb is - choose delta load only if the full load is not feasible (i.e. for transaction tables and large dimensions).
I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.
I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.
There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.
I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.
I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.
There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.
I have a large table (~170 million rows, 2 nvarchar and 7 int columns) in SQL Server 2005 that is constantly being inserted into. Everything works ok with it from a performance perspective, but every once in a while I have to update a set of rows in the table which causes problems. It works fine if I update a small set of data, but if I have to update a set of 40,000 records or so it takes around 3 minutes and blocks on the table which causes problems since the inserts start failing.
If I just run a select to get back the data that needs to be updated I get back the 40k records in about 2 seconds. It's just the updates that take forever. This is reflected in the execution plan for the update where the clustered index update takes up 90% of the cost and the index seek and top operator to get the rows take up 10% of the cost. The column I'm updating is not part of any index key, so it's not like it reorganizing anything.
Does anyone have any ideas on how this could be sped up? My thought now is to write a service that will just see when these updates have to happen, pull back the records that have to be updated, and then loop through and update them one by one. This will satisfy my business needs but it's another module to maintain and I would love if I could fix this from just a DBA side of things.
Thanks for any thoughts!
Actually it might reorganise pages if you update the nvarchar columns.
Depending on what the update does to these columns they might cause the record to grow bigger than the space reserved for it before the update.
(See explanation now nvarchar is stored at http://www.databasejournal.com/features/mssql/physical-database-design-consideration.html.)
So say a record has a string of 20 characters saved in the nvarchar - this takes 20*2+2(2 for the pointer) bytes in space. This is written at the initial insert into your table (based on the index structure). SQL Server will only use as much space as your nvarchar really takes.
Now comes the update and inserts a string of 40 characters. And oops, the space for the record within your leaf structure of your index is suddenly too small. So off goes the record to a different physical place with a pointer in the old place pointing to the actual place of the updated record.
This then causes your index to go stale and because the whole physical structure requires changing you see a lot of index work going on behind the scenes. Very likely causing an exclusive table lock escalation.
Not sure how best to deal with this. Personally if possible I take an exclusive table lock, drop the index, do the updates, reindex. Because your updates sometimes cause the index to go stale this might be the fastest option. However this requires a maintenance window.
You should batch up your update into several updates (say 10000 at a time, TEST!) rather than one large one of 40k rows.
This way you will avoid a table lock, SQL Server will only take out 5000 locks (page or row) before esclating to a table lock and even this is not very predictable (memory pressure etc). Smaller updates made in this fasion will at least avoid concurrency issues you are experiencing.
You can batch the updates using a service or firehose cursor.
Read this for more info:
http://msdn.microsoft.com/en-us/library/ms184286.aspx
Hope this helps
Robert
The mos brute-force (and simplest) way is to have a basic service, as you mentioned. That has the advantage of being able to scale with the load on the server and/or the data load.
For example, if you have a set of updates that must happen ASAP, then you could turn up the batch size. Conversely, for less important updates, you could have the update "server" slow down if each update is taking "too long" to relieve some of the pressure on the DB.
This sort of "heartbeat" process is rather common in systems and can be very powerful in the right situations.
Its wired that your analyzer is saying it take time to update the clustered Index . Did the size of the data change when you update ? Seems like the varchar is driving the data to be re-organized which might need updates to index pointers(As KMB as already pointed out) . In that case you might want to increase the % free sizes on the data and the index pages so that the data and the index pages can grow without relinking/reallocation . Since update is an IO intensive operation ( unlike read , which can be buffered ) the performance also depends on several factors
1) Are your tables partitioned by data 2) Does the entire table lies in the same SAN disk ( Or is the SAN striped well ?) 3) How verbose is the transaction logging . Can the buffer size of the transaction loggin increased to support larger writes to the log to suport massive inserts ?
Its also important which API/Language are you using? e.g JDBC support a batch update feature which makes the updates a little bit efficient if you are doing multiple updates .