This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times where the insert query takes a few seconds. (I am not trying to bulk insert data.) So I setup a simulation for the insert process to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I removed the id (primary key field), but the delay still happens.
I have a MyISAM table: daniel_test_insert (this table starts completely empty):
create table if not exists daniel_test_insert (
id int unsigned auto_increment not null,
value_str varchar(255) not null default '',
value_int int unsigned default 0 not null,
primary key (id)
)
I insert data into it, and sometimes a insert query takes > 2 seconds to run. There are no reads on this table - Only writes, in serial, by a single threaded program.
I ran the exact same query 100,000 times to find why the query occasionall takes a long time. So far, it appears to be a random occurrence.
This query for example took 4.194 seconds (a very long time for an insert):
Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status | duration | cpu_user | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting | 0.000042 | 0.000000 | 0.000000 | 0 | 0 | 0
checking permissions | 0.000024 | 0.000000 | 0.000000 | 0 | 0 | 0
Opening tables | 0.000024 | 0.001000 | 0.000000 | 0 | 0 | 0
System lock | 0.000022 | 0.000000 | 0.000000 | 0 | 0 | 0
Table lock | 0.000020 | 0.000000 | 0.000000 | 0 | 0 | 0
init | 0.000029 | 0.000000 | 0.000000 | 1 | 0 | 0
update | 4.067331 | 12.151152 | 5.298194 | 204894 | 18806 | 477995
end | 0.000094 | 0.000000 | 0.000000 | 8 | 0 | 0
query end | 0.000033 | 0.000000 | 0.000000 | 1 | 0 | 0
freeing items | 0.000030 | 0.000000 | 0.000000 | 1 | 0 | 0
closing tables | 0.125736 | 0.278958 | 0.072989 | 4294 | 604 | 2301
logging slow query | 0.000099 | 0.000000 | 0.000000 | 1 | 0 | 0
logging slow query | 0.000102 | 0.000000 | 0.000000 | 7 | 0 | 0
cleaning up | 0.000035 | 0.000000 | 0.000000 | 7 | 0 | 0
(This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.)
Now the update has an incredible number of context switches and minor page faults. Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)
Stats:
MySQL 5.0.89
Hardware: 32 Gigs of ram / 8 cores # 2.66GHz; raid 10 SCSI harddisks (SCSI II???)
I have had the hard drives and raid controller queried: No errors are being reported.
CPUs are about 50% idle.
iostat -x 5 (reports less than 10% utilization for harddisks)
top report load average about 10 for 1 minute (normal for our db machine)
Swap space has 156k used (32 gigs of ram)
I'm at a loss to find out what is causing this performance lag. This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables. Does anyone have any suggestions? (This is a production system, so nothing exotic!)
I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.
I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.
Also Ruled Out
Server load -- no correlation with high load
Engine -- happens with InnoDB/MyISAM/Memory
MySQL Query Cache -- happens whether it's on or off
Log rotations -- no correlation in events
The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.
MySQL Verion 5.1.48
Update
I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.
I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:
For 1, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk for every commit.
For 2, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk approximately every 1-2 seconds.
For 0, the log buffer is flushed to
the log file every second, and the log
file is flushed to disk every second.
Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.
Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).
So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.
I had this problem using INNODB tables. (and INNODB indexes are even slower to rewrite than MYISAM)
I suppose you are doing multiple other queries on some other tables, so the problem would be that MySQL has to handle disk writes in files that get larger and needs to allocate additional space to those files.
If you use MYISAM tables I strongly suggest using
LOAD DATA INFILE 'file-on-disk' INTO TABLE `tablename`
command; MYISAM is sensationally fast with this (even with primary keys) and the file can be formatted as csv and you can specify the column names (or you can put NULL as the value for the autoincrement field).
View MYSQL doc here.
The first Tip I would give you, is to disable the autocommit functionality and than commit manually.
LOCK TABLES a WRITE;
... DO INSERTS HERE
UNLOCK TABLES;
This benefits performance because the index buffer is flushed to disk only once, after all INSERT statements have completed. Normally, there would be as many index buffer flushes as there are INSERT statements.
But propably best you can do, and if that is possible in your application, you do a bulk insert with one single select.
This is done via Vector Binding and it's the fastest way you can go.
Instead
of:
"INSERT INTO tableName values()"
DO
"INSERT INTO tableName values(),(),(),().......(n) " ,
But consider this option only if parameter vector binding is possible with your mysql driver you're using.
Otherwise I would tend to the first possibility and LOCK the table for every 1000 inserts. Don't lock it for 100k inserts, because you'l get a buffer overflow.
Can you create one more table with 400 (not null) columns and run your test again? If the number of slow inserts became higher this could indicate MySQL is wasting time writing your records. (I dont know how it works, but he may be alocating more blocks, or moving something to avoid fragmentation.... really dont know)
We upgraded to MySQL 5.1 and during this event the Query cache became an issue with a lot of "Freeing items?" thread states. We then removed the query cache.
Either the upgrade to MySQL 5.1 or removing the query cache resolved this issue.
FYI, to future readers.
-daniel
We hit exactly the same issue and reported here:
http://bugs.mysql.com/bug.php?id=62381
We are using 5.1.52 and don't have solution yet. We may need to turn QC off to avoid this perf hit.
if you are using multiple insertion at one using for loop, then please take a break after every loop using PHP's sleep("time in seconds") function.
Read this on Myisam Performance:
http://adminlinux.blogspot.com/2010/05/mysql-allocating-memory-for-caches.html
Search for:
'The MyISAM key block size The key block size is important' (minus the single quotes), this could be what's going on. I think they fixed some of these types of issues with 5.1
Can you check the stats on the disk subsystem? Is the I/O satuated? This sounds like internal DB work going on flushing stuff to disk/log.
To check if your disk is behaving badly, and if you're in Windows, you can create a batch cmd file that creates 10,000 files:
#echo OFF
FOR /L %%G IN (1, 1, 10000) DO TIME /T > out%%G.txt
save it in a temp dir, like test.cmd
Enable command extensions running CMD with the /E:ON parameter
CMD.exe /E:ON
Then run your batch and see if the time between the first and the last out file differ in seconds or minutes.
On Unix/Linux you can write a similare shell script.
By any chance is there an SSD drive in the server? Some SSD drives suffer from 'studder', which could cause your symptom.
In any case, I would try to find out if the delay is occurring in MySQL or in the disk subsystem.
What OS is your server, and what file system is the MySQL data on?
Related
Assume you have a set as follows:
+-------+-------+
| PK | myBin |
+-------+-------+
| "1" | 24 |
+-------+-------+
1 row in set (1 secs)
How to get the LUT metadata for PK=1 for bin myBin?
NOTE: I'm looking for bin level LUT and not row level
Bin-level LUTs are not exposed to the clients due to backward compatibility reasons. So, you cannot query them from client. Also, bin-level LUTs are not always maintained. They are maintained only under certain XDR configurations.
A work around is to write an additional bin with the time stamp of the update along with the regular bin update. If you have few bins for which you need to know the update time stamp, this is a reasonable workaround with some overhead.
I have an application that receives messages from a database via the write ahead logs and every row looks something like this
| id | prospect_id | school_id | something | something else |
|----|-------------|------------|-----------|----------------|
| 1 | 5 | 10 | who | cares |
| 2 | 5 | 11 | what | this |
| 3 | 6 | 10 | is | blah |
Eventually, I will need to query the database for mapping between prospect_id and school name. The query results are in the 10000s. The schools table has a name column that I can query in a simple join. However, I want to store this information somewhere on my server that would be easily accessibly by my application. I need this information:
stored locally so I can access it quickly
capable of being updated once a day asynchronously
independent of the application so that when its deployed or restarted, it doesn't need to be queried again.
What can be done? What are some options?
EDIT
Is pickle a good idea? https://datascience.blog.wzb.eu/2016/08/12/a-tip-for-the-impatient-simple-caching-with-python-pickle-and-decorators/
What are limitations of pickle? The results of the sql query might be in the 10000s
The drawback of using pickle is that it is a python specific protocol. If you intend for other programming languages to read this file, then the tooling might not exist to read it and you would be better storing it in something like a JSON or XML file. If you will only be reading it with python then pickle is fine.
Here are a few options you have:
Load the data from SQL when the application is started up (the SQL data can be stored locally, doesn't have to be on an external system) in a global value.
Use pickle to serialize deserialize the data from a file when needed.
Load the data into redis, an in-memory caching system.
I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?
From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.
Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.
I am complete desperate with a performance differential and I have absolutely no clue WHY there is one.
Overwiew
VMware Workstation v11 on my local computer. I gave the VM just 2 cores and 4GB memory.
Hyper-V Server 2012 R2 with two 6-core-Xeon's (older ones) and 64GB memory. Just this VM is running with full hardware associated.
Referring to a CPU-benchmark I started in each VM, the VM within Hyper-V should be about 5x faster then my local one.
I stripped my code down to just this one operation which I set in a WHILE-loop to simulate parallel queries - normally this is done by a webserver.
Code
DECLARE #cnt INT = 1
WHILE #cnt <= 1000
BEGIN
BEGIN TRANSACTION Trans
UPDATE [Test].[dbo].[NumberTable]
SET Number = Number + 1
OUTPUT deleted.*
COMMIT TRANSACTION Trans
SET #cnt = #cnt + 1;
END
When I execute this in SSMS it needs:
VMware Workstation: 43s
Hyper-V Server: 59s
...which is about 2x slower although the system is at least 4x faster.
Some facts
the DB is the same - backuped and restored
the table has just 1 row and 13 fields
the table has 3 indexes, none of them is "Number"
logged in user is 'SA'
OS is identical
SQL Server version is identical (same iso)
installed SQL Server features are the same
to be sure Hyper-V is not the bottleneck I also installed a VMware ESXi v6 on another server with even less power - the result is nearly identical to the Hyper-V-machine
settings in SSMS should be identical - checked it twice
execution plan is identical on each machine - just execution time is different
the more loops I choose, the bigger is the relative time difference
ADDED when I comment out the OUTPUT-line to suppress the drawing of the line (and each value) my VMware Workstation does it in under 1s while the Hyper-V needs 5s. When I increase the loop number to 2000, my VMware Workstation needs one more time under 1s, the Hyper-V-version although needs 10s!
When running the full code from a local webserver the difference is about 0.8s versus about 9s! ...no, I have not forgotten the '0.'!!
Can you give me a hint what the hell is going on or what else I can proof?
EDIT
I tested the code above without the OUTPUT-line and with 10,000 passes. The client statistics on both systems look identical, except the time statistics:
VMware Workstation:
+-------------------------------+------+--+------+--+-----------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+------+--+------+--+-----------+
| Client processing time | 2328 | | 1084 | | 1706.0000 |
| Total execution time | 2343 | | 1098 | | 1720.5000 |
| Wait time on server replies | 15 | | 14 | | 14.5000 |
+-------------------------------+------+--+------+--+-----------+
Hyper-V:
+-------------------------------+-------+--+------+--+------------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+-------+--+------+--+------------+
| Client processing time | 55500 | | 1250 | | 28375.0000 |
| Total execution time | 55718 | | 1328 | | 28523.0000 |
| Wait time on server replies | 218 | | 78 | | 148.0000 |
+-------------------------------+-------+--+------+--+------------+
(1) : 10,000 passes without OUTPUT
(2) : 1,000 passes with OUTPUT
(3) : mean
EDIT (for HLGEM)
I compared both execution plans and indeed there are two differences:
fast system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="0" CompileCPU="0" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="104842" EstimatedPagesCached="26210" EstimatedAvailableDegreeOfParallelism="2" />
slow system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="1" CompileCPU="1" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="524272" EstimatedPagesCached="655341" EstimatedAvailableDegreeOfParallelism="10" />
Did you check hardware fully?
It looks as OUTPUT operator spend some time to show data to you.
https://msdn.microsoft.com/en-us/library/ms177564%28v=sql.120%29.aspx
Time differences depend on many things. A local server may be faster because you are not sending data through a full network pipeline. Other work happening concurrently on each server may affect speed.
Typically in dev there is little or no other work load and things can be faster than on Prod where there are thousands of users trying to things at the same time. This is why load testing is important if you have a large system.
You don't mention indexing but that too can be different on different servers (even when it is supposed to be the same!). So at least check that.
Look at the execution plans see if you can find the difference. Outdated statistics can also result in a less than optimal execution plan too.
Does one of the servers run applications other than the database? That could be limiting the amount of memory the server has available for the database to use.
Honestly, this is a huge topic and there are many many things you should be checking. If you are doing this kind of analysis, I would suggest you buy a performance tuning book and read through it to figure out what things can affect this. This s not something that can easily be answered by a question on the Internet; you need to get some in depth knowledge.
Query speed has little to do with CPU/memory speed, especially queries that update data.
Query speed is mainly limited by disk I/O speed, which is at least 1000 times slower than CPU/RAM speed. Making queries faster is ultimately about avoiding unnecessary disk I/O, but your query must read and write every row.
The VM box (probably) uses a virtual drive that is mapped to a file on disk and there is probably some effort required to keep the two aligned, possibly even asynchronously, while other processes are running and contending with the drive.
Maybe your workstation has less contention or a simpler virtual file system etc.
Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .