Does impala cache some data after queries - impala

I am new to Impala, I do some test cases on Impala.
I found similar SQLs is much faster when I called second time .
For example:
table1 = 4B rows
table2 = 50M rows
1st query: select * from table1 where id in (select id from table2 where xxx < 10000)
(20 seconds)
2nd query: select * from table1 where id in (select id from table2 where xxx < 9999)
(10 seconds)
3rd query: select * from table1 where id in (select id from table2 where xxx < 100)
(1 seconds)
I guess Impala do some special cache, could anyone can tell me its reason?
Thanks.

Impala uses the cache of the OS and additional HDFS Caching.
An excerpt from Using HDFS Caching with Impala:
"The Linux OS cache [...] only keeps the most recently used data in memory. Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when using data from the Linux OS cache."
This may explain the difference in execution time between your 1st and 2nd queries. However, the reason that your 3rd query is much faster than the first two is probably not (only) caching but the fact that it only queries about 1/100th of the data (assuming a uniform distribution of xxx).

When I posted a question about this on a user group mailing list, Cloudera’s Jean-Daniel Cryans explained that this could be due to OS page cache. OS will cache data from disks in its page cache the first time it is accessed. Queries that hit tables that have not been read before will cause data to be read from disks, while tables that are already pre-loaded to page cache will be much faster as they are fetched from RAM. We observed the same thing with native Impala queries. We ran a few tests, restarting Impala and/or Kudu and could reproduce this by resetting the page cache. For this reason, we measured separately the execution time for the very first run after page cache reset, and the execution time for 3 consecutive runs after the first one. Tests were repeated 3 times, recycling the cache for each loop.
In other words, Impala does not cache results of the queries but would simply cache files from disks that it has to use to process your query.

Related

the faster way to extract all records from oracle

I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.

Simultaneous queries on the same table / view affect performance

If I had a view (or table) which contained millions of rows and I executed these two queries from different sessions, would one query be adversely affected by another? (Please note no DML will be going on)
e.g. Select * from t1 where sex = 'M'; (Returns 20 columns and 10,000 rows)
select sex from t1 where rownum < 2;
What about if I had multiple sessions executing query 1? Would they all be equally slow until one of them had been cached (provided it was large enough)?
I am currently experiencing degraded performance when executing similar queries in a load balancing test for the quicker queries, however when executed separately (even when the result hasn't been cached) I am getting 'normal' response times.
If the tables are not being modified and the queries are using base tables, then it would be surprising if these two queries were interfering with each other. In particular, the second query:
select sex
from t1
where rownum < 2;
Should simply be fetching one row and going very fast.
The first can take advantage of an index on t1(sex).
If t1 is really a view, then Oracle probably has to do all the processing for the view. Twice, once for each query. If the view is complex, then this would put a load on the server and slow everything down.
Have you looked at what is happening in the buffer cache, in particular V$DB_CACHE_ADVICE for buffer hit/miss ration? Are there any candidates (in the underlying tables) for adding to the "KEEP" buffer to avoid IO? To be fair it can take a while to monitor this and understand what the picture is before deciding what action to take, but it is worth looking at. More information here: https://docs.oracle.com/database/121/TGDBA/tune_buffer_cache.htm#TGDBA555 .

BigQuery's query is extremely slow

I have a table with 1.6 billion rows. I have been running a query that uses a group-by field that has over 5 million unique values and then sort by sum of another integer value in descending order and finally return only the top 10. Notice after more than an hour, that query is still stuck in running state.
I have created this big table by using "bq cp -a ". Originally those source tables are "bq cp" from 1000 smaller tables and each table were loaded from over 12 compressed csv load files.
I have searched related question and found "Google BigQuery is running queries slowly" mention slowness caused by fragmentation from a lot of small ingestion. Is my approach of data infestion consider as "too small data bit" during ingestion which caused fragmentation?
Is it possible 5 million unique values is too much and that is the root cause of slow response?
We've had a latency spike yesterday, and a smaller one today. Can you give project id + job ids of query jobs that took longer than you expected?

Performance of queries using count(*) on tables with many rows (300 million+)

I understand there are limitations to using sqlite, but I'd like to know if it should be able to handle this scenario.
My table has over 300 million records and the db is about 12 gigs. The data import util with sqlite is nice and fast. But then I added an index to a string column in this table, and it ran all night to complete this operation. I haven't compared this to other db's, but seemed quite slow to me.
Now that my index is added, I'm wanting to look for duplicates in the data. So I'm trying to run a "having count > 0" query and it seems to be taking hours as well. My query looks like:
select col1, count(*)
from table1
group by col1
having count(*) > 1
I would assume this query would use my index on col1, but the slow query execution makes me wonder if it is not?
Would perhaps sql server handle this kind of thing better?
SQLite's count() isn't optimized - it does a full table scan even if indexed. Here is the recommended approach to speed things up. Run EXPLAIN QUERY PLAN to verify and you'll see:
EXPLAIN QUERY PLAN SELECT COUNT(FIELD_NAME) FROM TABLE_NAME;
I get something like this:
0|0|0|SCAN TABLE TABLE_NAME (~1000000 rows)
But then I added an index to a string column in this table, and it ran all night to complete this
operation. I haven't compared this to other db's, but seemed quite slow to me.
I hate to tell yuo, but how does your server look like? Not arguing, but that is a possibly very resoruce intensive operation that may require a lot of IO and normal computers or chehap web servers with a slow hard disc are not suited for significant database work. I run hundreds og gigabyte db project work and my smallest "large data" server has 2 SSD and 8 Velociraptors for data and log. The largest one has 3 storage nodes with a total of 1000gb SSD discs - simply because IO is what a db server lives and breathes on.
So I'm trying to run a "having count > 0" query and it seems to be taking hours as well
How much RAM? ENough to fit it all in memory, or a low memory virtual server where the missing memory blows up to bad IO? How much memory can / does SqlLite use? How is the temp setup? In memory? Sql server would possibly use a lot of memory / tempdb space for this type of check.
increase the sqlite cache via PRAGMA cache_size=<number of pages>. the memory used is <number of pages> times <size of page>. (which can be set via PRAGMA page_size=<size of page>)
by setting those values to 16000 and 32768 respectively (or about 512MB), i was able to get this one program's bulk load down from 20mins to 2mins. (although i think that if the disk on that system wasn't so slow, this might not have had as much effect)
but you might not have this extra memory available on lesser embedded platforms, i don't recommend increasing it as much as i did on those, but for desktop or laptop level systems it can greatly help.

Same query, different DB's, same DB structure, same DB server, different execution plans

This one has me beat!
One of our clients has a single SQL Server instance with 2 DB's, one test and one production. Both DB's are identical, table structure, indexes etc. When a certain query (structure below) is run on the test DB, rows are returned within 30 sec. On the production DB, rows take 1.5 hours to return!! Activity Monitor shows the query running with no wait type. All statistics are up to date. When execution plans are compared, the main difference is a nested (prod) vs hash join (test). We can add an index on the production DB to improve the performance, receiving around the same execution time as the test server, but remembering that the test server doesn't need this additional index!
Unfortunately the client is not technical hence use of views joining to views (cringe). The issue itself is resolved in terms of performance (by adding the additional index on the production DB). But we still cannot work out why the test DB performs find without the additional index.
Can anyone shed some light? We're stumped!
query structure:
SELECT field1,
field2,
field3,
DATEDIFF(mi, field5, field5) as 'Time'
FROM MainView t
JOIN SecondaryView i ON t.NonIndexedColumn = i.NonIndexedColumn
WHERE date >= '2010/07/05'
AND date < '2011/09/27'
AND anotherDate IS NULL
AND Code LIKE 'Abc%'
AND Desc LIKE('%ABC%')
It could be because of different data distribution on a disk or "table fragmentation".
If data was imported, for example, using
INSERT INTO test.table
SELECT * FROM prod.table
statements, then data in the test database doesn't have as much spaces as production. But this depends on how data is being treated in the production.
Check sys.dm_db_index_physical_stats function