Assume you have a set as follows:
+-------+-------+
| PK | myBin |
+-------+-------+
| "1" | 24 |
+-------+-------+
1 row in set (1 secs)
How to get the LUT metadata for PK=1 for bin myBin?
NOTE: I'm looking for bin level LUT and not row level
Bin-level LUTs are not exposed to the clients due to backward compatibility reasons. So, you cannot query them from client. Also, bin-level LUTs are not always maintained. They are maintained only under certain XDR configurations.
A work around is to write an additional bin with the time stamp of the update along with the regular bin update. If you have few bins for which you need to know the update time stamp, this is a reasonable workaround with some overhead.
Related
I have an application that receives messages from a database via the write ahead logs and every row looks something like this
| id | prospect_id | school_id | something | something else |
|----|-------------|------------|-----------|----------------|
| 1 | 5 | 10 | who | cares |
| 2 | 5 | 11 | what | this |
| 3 | 6 | 10 | is | blah |
Eventually, I will need to query the database for mapping between prospect_id and school name. The query results are in the 10000s. The schools table has a name column that I can query in a simple join. However, I want to store this information somewhere on my server that would be easily accessibly by my application. I need this information:
stored locally so I can access it quickly
capable of being updated once a day asynchronously
independent of the application so that when its deployed or restarted, it doesn't need to be queried again.
What can be done? What are some options?
EDIT
Is pickle a good idea? https://datascience.blog.wzb.eu/2016/08/12/a-tip-for-the-impatient-simple-caching-with-python-pickle-and-decorators/
What are limitations of pickle? The results of the sql query might be in the 10000s
The drawback of using pickle is that it is a python specific protocol. If you intend for other programming languages to read this file, then the tooling might not exist to read it and you would be better storing it in something like a JSON or XML file. If you will only be reading it with python then pickle is fine.
Here are a few options you have:
Load the data from SQL when the application is started up (the SQL data can be stored locally, doesn't have to be on an external system) in a global value.
Use pickle to serialize deserialize the data from a file when needed.
Load the data into redis, an in-memory caching system.
We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429
Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.
I am complete desperate with a performance differential and I have absolutely no clue WHY there is one.
Overwiew
VMware Workstation v11 on my local computer. I gave the VM just 2 cores and 4GB memory.
Hyper-V Server 2012 R2 with two 6-core-Xeon's (older ones) and 64GB memory. Just this VM is running with full hardware associated.
Referring to a CPU-benchmark I started in each VM, the VM within Hyper-V should be about 5x faster then my local one.
I stripped my code down to just this one operation which I set in a WHILE-loop to simulate parallel queries - normally this is done by a webserver.
Code
DECLARE #cnt INT = 1
WHILE #cnt <= 1000
BEGIN
BEGIN TRANSACTION Trans
UPDATE [Test].[dbo].[NumberTable]
SET Number = Number + 1
OUTPUT deleted.*
COMMIT TRANSACTION Trans
SET #cnt = #cnt + 1;
END
When I execute this in SSMS it needs:
VMware Workstation: 43s
Hyper-V Server: 59s
...which is about 2x slower although the system is at least 4x faster.
Some facts
the DB is the same - backuped and restored
the table has just 1 row and 13 fields
the table has 3 indexes, none of them is "Number"
logged in user is 'SA'
OS is identical
SQL Server version is identical (same iso)
installed SQL Server features are the same
to be sure Hyper-V is not the bottleneck I also installed a VMware ESXi v6 on another server with even less power - the result is nearly identical to the Hyper-V-machine
settings in SSMS should be identical - checked it twice
execution plan is identical on each machine - just execution time is different
the more loops I choose, the bigger is the relative time difference
ADDED when I comment out the OUTPUT-line to suppress the drawing of the line (and each value) my VMware Workstation does it in under 1s while the Hyper-V needs 5s. When I increase the loop number to 2000, my VMware Workstation needs one more time under 1s, the Hyper-V-version although needs 10s!
When running the full code from a local webserver the difference is about 0.8s versus about 9s! ...no, I have not forgotten the '0.'!!
Can you give me a hint what the hell is going on or what else I can proof?
EDIT
I tested the code above without the OUTPUT-line and with 10,000 passes. The client statistics on both systems look identical, except the time statistics:
VMware Workstation:
+-------------------------------+------+--+------+--+-----------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+------+--+------+--+-----------+
| Client processing time | 2328 | | 1084 | | 1706.0000 |
| Total execution time | 2343 | | 1098 | | 1720.5000 |
| Wait time on server replies | 15 | | 14 | | 14.5000 |
+-------------------------------+------+--+------+--+-----------+
Hyper-V:
+-------------------------------+-------+--+------+--+------------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+-------+--+------+--+------------+
| Client processing time | 55500 | | 1250 | | 28375.0000 |
| Total execution time | 55718 | | 1328 | | 28523.0000 |
| Wait time on server replies | 218 | | 78 | | 148.0000 |
+-------------------------------+-------+--+------+--+------------+
(1) : 10,000 passes without OUTPUT
(2) : 1,000 passes with OUTPUT
(3) : mean
EDIT (for HLGEM)
I compared both execution plans and indeed there are two differences:
fast system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="0" CompileCPU="0" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="104842" EstimatedPagesCached="26210" EstimatedAvailableDegreeOfParallelism="2" />
slow system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="1" CompileCPU="1" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="524272" EstimatedPagesCached="655341" EstimatedAvailableDegreeOfParallelism="10" />
Did you check hardware fully?
It looks as OUTPUT operator spend some time to show data to you.
https://msdn.microsoft.com/en-us/library/ms177564%28v=sql.120%29.aspx
Time differences depend on many things. A local server may be faster because you are not sending data through a full network pipeline. Other work happening concurrently on each server may affect speed.
Typically in dev there is little or no other work load and things can be faster than on Prod where there are thousands of users trying to things at the same time. This is why load testing is important if you have a large system.
You don't mention indexing but that too can be different on different servers (even when it is supposed to be the same!). So at least check that.
Look at the execution plans see if you can find the difference. Outdated statistics can also result in a less than optimal execution plan too.
Does one of the servers run applications other than the database? That could be limiting the amount of memory the server has available for the database to use.
Honestly, this is a huge topic and there are many many things you should be checking. If you are doing this kind of analysis, I would suggest you buy a performance tuning book and read through it to figure out what things can affect this. This s not something that can easily be answered by a question on the Internet; you need to get some in depth knowledge.
Query speed has little to do with CPU/memory speed, especially queries that update data.
Query speed is mainly limited by disk I/O speed, which is at least 1000 times slower than CPU/RAM speed. Making queries faster is ultimately about avoiding unnecessary disk I/O, but your query must read and write every row.
The VM box (probably) uses a virtual drive that is mapped to a file on disk and there is probably some effort required to keep the two aligned, possibly even asynchronously, while other processes are running and contending with the drive.
Maybe your workstation has less contention or a simpler virtual file system etc.
Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .
This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times where the insert query takes a few seconds. (I am not trying to bulk insert data.) So I setup a simulation for the insert process to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I removed the id (primary key field), but the delay still happens.
I have a MyISAM table: daniel_test_insert (this table starts completely empty):
create table if not exists daniel_test_insert (
id int unsigned auto_increment not null,
value_str varchar(255) not null default '',
value_int int unsigned default 0 not null,
primary key (id)
)
I insert data into it, and sometimes a insert query takes > 2 seconds to run. There are no reads on this table - Only writes, in serial, by a single threaded program.
I ran the exact same query 100,000 times to find why the query occasionall takes a long time. So far, it appears to be a random occurrence.
This query for example took 4.194 seconds (a very long time for an insert):
Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status | duration | cpu_user | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting | 0.000042 | 0.000000 | 0.000000 | 0 | 0 | 0
checking permissions | 0.000024 | 0.000000 | 0.000000 | 0 | 0 | 0
Opening tables | 0.000024 | 0.001000 | 0.000000 | 0 | 0 | 0
System lock | 0.000022 | 0.000000 | 0.000000 | 0 | 0 | 0
Table lock | 0.000020 | 0.000000 | 0.000000 | 0 | 0 | 0
init | 0.000029 | 0.000000 | 0.000000 | 1 | 0 | 0
update | 4.067331 | 12.151152 | 5.298194 | 204894 | 18806 | 477995
end | 0.000094 | 0.000000 | 0.000000 | 8 | 0 | 0
query end | 0.000033 | 0.000000 | 0.000000 | 1 | 0 | 0
freeing items | 0.000030 | 0.000000 | 0.000000 | 1 | 0 | 0
closing tables | 0.125736 | 0.278958 | 0.072989 | 4294 | 604 | 2301
logging slow query | 0.000099 | 0.000000 | 0.000000 | 1 | 0 | 0
logging slow query | 0.000102 | 0.000000 | 0.000000 | 7 | 0 | 0
cleaning up | 0.000035 | 0.000000 | 0.000000 | 7 | 0 | 0
(This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.)
Now the update has an incredible number of context switches and minor page faults. Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)
Stats:
MySQL 5.0.89
Hardware: 32 Gigs of ram / 8 cores # 2.66GHz; raid 10 SCSI harddisks (SCSI II???)
I have had the hard drives and raid controller queried: No errors are being reported.
CPUs are about 50% idle.
iostat -x 5 (reports less than 10% utilization for harddisks)
top report load average about 10 for 1 minute (normal for our db machine)
Swap space has 156k used (32 gigs of ram)
I'm at a loss to find out what is causing this performance lag. This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables. Does anyone have any suggestions? (This is a production system, so nothing exotic!)
I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.
I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.
Also Ruled Out
Server load -- no correlation with high load
Engine -- happens with InnoDB/MyISAM/Memory
MySQL Query Cache -- happens whether it's on or off
Log rotations -- no correlation in events
The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.
MySQL Verion 5.1.48
Update
I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.
I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:
For 1, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk for every commit.
For 2, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk approximately every 1-2 seconds.
For 0, the log buffer is flushed to
the log file every second, and the log
file is flushed to disk every second.
Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.
Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).
So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.
I had this problem using INNODB tables. (and INNODB indexes are even slower to rewrite than MYISAM)
I suppose you are doing multiple other queries on some other tables, so the problem would be that MySQL has to handle disk writes in files that get larger and needs to allocate additional space to those files.
If you use MYISAM tables I strongly suggest using
LOAD DATA INFILE 'file-on-disk' INTO TABLE `tablename`
command; MYISAM is sensationally fast with this (even with primary keys) and the file can be formatted as csv and you can specify the column names (or you can put NULL as the value for the autoincrement field).
View MYSQL doc here.
The first Tip I would give you, is to disable the autocommit functionality and than commit manually.
LOCK TABLES a WRITE;
... DO INSERTS HERE
UNLOCK TABLES;
This benefits performance because the index buffer is flushed to disk only once, after all INSERT statements have completed. Normally, there would be as many index buffer flushes as there are INSERT statements.
But propably best you can do, and if that is possible in your application, you do a bulk insert with one single select.
This is done via Vector Binding and it's the fastest way you can go.
Instead
of:
"INSERT INTO tableName values()"
DO
"INSERT INTO tableName values(),(),(),().......(n) " ,
But consider this option only if parameter vector binding is possible with your mysql driver you're using.
Otherwise I would tend to the first possibility and LOCK the table for every 1000 inserts. Don't lock it for 100k inserts, because you'l get a buffer overflow.
Can you create one more table with 400 (not null) columns and run your test again? If the number of slow inserts became higher this could indicate MySQL is wasting time writing your records. (I dont know how it works, but he may be alocating more blocks, or moving something to avoid fragmentation.... really dont know)
We upgraded to MySQL 5.1 and during this event the Query cache became an issue with a lot of "Freeing items?" thread states. We then removed the query cache.
Either the upgrade to MySQL 5.1 or removing the query cache resolved this issue.
FYI, to future readers.
-daniel
We hit exactly the same issue and reported here:
http://bugs.mysql.com/bug.php?id=62381
We are using 5.1.52 and don't have solution yet. We may need to turn QC off to avoid this perf hit.
if you are using multiple insertion at one using for loop, then please take a break after every loop using PHP's sleep("time in seconds") function.
Read this on Myisam Performance:
http://adminlinux.blogspot.com/2010/05/mysql-allocating-memory-for-caches.html
Search for:
'The MyISAM key block size The key block size is important' (minus the single quotes), this could be what's going on. I think they fixed some of these types of issues with 5.1
Can you check the stats on the disk subsystem? Is the I/O satuated? This sounds like internal DB work going on flushing stuff to disk/log.
To check if your disk is behaving badly, and if you're in Windows, you can create a batch cmd file that creates 10,000 files:
#echo OFF
FOR /L %%G IN (1, 1, 10000) DO TIME /T > out%%G.txt
save it in a temp dir, like test.cmd
Enable command extensions running CMD with the /E:ON parameter
CMD.exe /E:ON
Then run your batch and see if the time between the first and the last out file differ in seconds or minutes.
On Unix/Linux you can write a similare shell script.
By any chance is there an SSD drive in the server? Some SSD drives suffer from 'studder', which could cause your symptom.
In any case, I would try to find out if the delay is occurring in MySQL or in the disk subsystem.
What OS is your server, and what file system is the MySQL data on?