I am interested in performance testing my query in Redshift.
I would like to disable the query from using any cached results from prior queries. In other words, I would like the query to run from scratch. Is it possible to disable cached results only for the execution of my query?
I would not like to disable cached results for the entire database/all queries.
SET enable_result_cache_for_session TO OFF;
From enable_result_cache_for_session - Amazon Redshift:
Specifies whether to use query results caching. If enable_result_cache_for_session is on, Amazon Redshift checks for a valid, cached copy of the query results when a query is submitted. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query. If enable_result_cache_for_session is off, Amazon Redshift ignores the results cache and executes all queries when they are submitted.
Ran across this during a benchmark today and wanted to add an alternative to this. The benchmark tool I was using has a setup and teardown, but they don't run in the same session/transaction, so the enable_result_cache_for_session setting was having no effect. So I had to get a little clever.
From the Redshift documentation:
Amazon Redshift uses cached results for a new query when all of the following are true:
The user submitting the query has access permission to the objects used in the query.
The table or views in the query haven't been modified.
The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.
The query doesn't reference Amazon Redshift Spectrum external tables.
Configuration parameters that might affect query results are unchanged.
The query syntactically matches the cached query.
In my case, I just added a GETDATE() column to the query to force it to not use the result cache on each run.
Related
There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.
I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.
I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.
I can create a materialised view in RDS (postgreSQL) to keep track of the 'latest' data output from a SQL query, and then visualise this in QuickSight. This process is also very 'quick' as it doesn't result in calling additional AWS services and/or re-processing all data again (through the SQL query). My assumption is how this works is it runs a SQL, re-runs the SQL but not for the whole data again, so that if you structure the query correctly, you can end up having a 'real time running total' metric for example.
The issue is, creating materialised views (per 5 seconds) for 100's of queries, and having them all stored in a database is not scalable. Imagine a DB with 1TB data, creating an incremental/materialised view seems much less painful than using other AWS services, but eventually won't be optimal for processing time/cost etc.
I have explored various AWS services, none of which seem to solve this problem.
I tried using AWS Glue. You would need to create 1 script per query and output it to a DB. The lag between reading and writing the incremental data is larger than creating a materialised view; because you can incrementally process data, but then to append it to the current 'total' metric is another process.
I explored using AWS Kinesis followed by a Lambda to run a SQL on the 'new' data in the stream, and store the value in S3 or RDS. Again, this adds latency and doesn't work as well as a materialised view.
I read that AWS Redshift does not have materialised views therefore stuck to RDS (PostgreSQL).
Any thoughts?
[A similar issue: incremental SQL query - except I want to avoid running the SQL on "all" data to avoid massive processing costs.]
Edit (example):
table1 has schema (datetime, customer_id, revenue)
I run this query: select sum(revenue) from table1.
This would scan the whole table to come up with a metric per customer_id.
table1 now gets updated with new data as the datetime progresses e.g. 1 hour extra data.
If I run select sum(revenue) from table1 again, it scans all the data again.
A more efficient way is to just compute the query on the new data, and append the result.
Also, I want the query to actively run where there is a change in data, not have to 'run it with a schedule' so that my front end dashboards basically 'auto update' without the customer doing much.
I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.
I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.
This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.
I am running into Serious issue "Resources Exceeds Query Execution" when Google Big Query large table (105M records) with 'Order Each by' clause.
Here is the sample query (which using public data set: Wikipedia):
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
How to solve this without adding Limit keyword.
Using order by on big data databases is not an ordinary operation and at some point it exceeds the attributes of big data resources. You should consider sharding your query or run the order by in your exported data.
As I explained to you today in your other question, adding allowLargeResults will allow you to return large response, but you can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
One option here that you may try is sharding your query.
where ABS(HASH(Id) % 4) = 0
You can play with the above parameters a lot to achieve smaller resultsets and then combining.
Also read Chapter 9 - Understanding Query Execution it explaines how internally sharding works.
You should also read Launch Checklist for BigQuery
I've gone through the same problem and fixed it following the next steps
Run the query without ORDER BY and save in a dataset table.
Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)
Download the files to a folder in your machine.
Install XAMPP (if you get a UAC warning) and change some settings after.
Start Apache and MySQL in your XAMPP control panel.
Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)
Create a database and a table with its fields.
Go to Tools > Import CSV file, configure accordingly and import.
Once all data is imported, do the ORDER BY and export the table.