I have used default degree of parallelism in order to gain performance tuning and I got the best results too. but I doubt it will impact when some other job access the same table at same time.
sample code below.
select /*+ FULL(customer) PARALLEL(customer, default) */ customer_name from customer;
The number of servers available is 8 . How this default degree of parallelism works? will it affect if some other job running query on same table at same time? Before moving this query to production , I would like to know whether this will impact ? Thanks!
From documentation:
PARALLEL (DEFAULT):
The optimizer calculates a degree of parallelism equal to the number
of CPUs available on all participating instances times the value of
the PARALLEL_THREADS_PER_CPU initialization parameter.
The maximum degree of parallelism is limited by the number of CPUs in the system. The formula used to calculate the limit is :
PARALLEL_THREADS_PER_CPU * CPU_COUNT * the number of instances available
by default, all the opened instances on the cluster but can be constrained using PARALLEL_INSTANCE_GROUP or service specification. This is the default.
Related
What are the factors to decide the Dop in Parallel hint in Oracle.
eg.
SELECT /*+ PARALLEL(employees 3) */ e.last_name, d.department_name
FROM employees e, departments d
WHERE e.department_id=d.department_id;
How to decide 3 is the correct DOP?
Most of the system I maintain are neither OTLP now Datawarehouse, But something in between. So sometimes it is necessary to use parallelism but you do not want to use all resources (all CPU cores) for a single report. Theoretically you can limit DOP by profiles, but resource limiting is problematic component in Oracle. Also Adaptive DOP in Oracle by default decides by number of free CPU resources, so such a query usually has huge number of parallel workers, which totally kills disk IO.
Oracle session worker processes are single-threaded, neither of them will utilize more than single CPU core. So DOP should always correspond to number of CPU cores and whether it is CPU bound.
I have oracle table contain 900 million records , this table partioned to 24 partion , and have indexes :
i try to using hint and i put fetch_buffer to 100000:
select /+ 8 parallel +/
* from table
it take 30 minutes to get 100 million records
my question is :
is there are any way more faster to get the 900 million (all data in the table ) ? should i use partions and did 24 sequential queries ? or should i use indexes and split my query to 10 queries for example
The network is almost certainly the bottleneck here. Oracle parallelism only impacts the way the database retrieves the data, but data is still sent to the client with a single thread.
Assuming a single thread doesn't already saturate your network, you'll probably want to build a concurrent retrieval solution. It helps that the table is already partitioned, then you can read large chunks of data without re-reading anything.
I'm not sure how to do this in Scala, but you want to run multiple queries like this at the same time, to use all the client and network resources possible:
select * from table partition (p1);
select * from table partition (p2);
...
Not really an answer but too long for a comment.
A few too many variables can impact this to give informed advice, so the following are just some general hints.
Is this over a network or local on the server? If the database is remote server then you are paying a heavy network price. I would suggest (if possible) running the extract on the server using the BEQUEATH protocol to avoid using the network. Once the file(s) complete, is will be quicker to compress and transfer to destination than transferring the data direct from database to local file via JDBC row processing.
With JDBC remember to set the cursor fetch size to reduce round tripping - setFetchSize. The default value is tiny (10 I think), try something like 1000 to see how that helps.
As for the query, you are writing to a file so even though Oracle might process the query in parallel, your write to file process probably doesn't so it's a bottleneck.
My approach would be to write the Java program to operate off a range of values as command line parameters, and experiment to find which range size and concurrent instances of the Java give optimal performance. The range will likely fall within discrete partitions so you will benefit from partition pruning (assuming the range value is an a indexed column ideally the partition key).
Roughly speaking I would start with range of 5m, and run concurrent instances that match the number of CPU cores - 2; this is not a scientifically derive number just one that I tend to use as my first stab and see what happens.
I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization
Currently working on improving the efficiency of a few queries. After looking at the query plan I found that SQL is not performing parallelism when a top clause is in place, increasing the query time from 1-2s to several minutes.
The query in question is using a view with various joins and unions, I'm looking for a general answer/understanding as to why this is happening - google has thus far failed me.
Thanks
As you may be aware that
Generally, SQL Server processes queries in parallel in the following cases:
When the number of CPUs is greater than the number of active connections.
When the estimated cost for the serial execution of a query is higher than the query plan threshold (The estimated cost refers to the elapsed time in seconds required to execute the query serially.)
Certain types of statements cannot be processed in parallel unless they contain clauses, however.
For example, UPDATE, INSERT, and DELETE are not normally processed in parallel even if the related query meets the criteria.
But if the UPDATE or DELETE statements contain a WHERE clause, or an INSERT statement contains a SELECT clause, WHERE and SELECT can be executed in parallel. Changes are applied serially to the database in these cases.
To configure parallel processing, simply do the following:
In the Server Properties dialog box, go to the Advanced page.
By default, the Max Degree Of Parallelism setting has a value of 0, which means that the maximum number of processors used for parallel processing is controlled automatically. Essentially, SQL Server uses the actual number of available processors, depending on the workload. To limit the number of processors used for parallel processing to a set amount (up to the maximum supported by SQL Server), change the Max Degree Of Parallelism setting to a value greater than 1. A value of 1 tells SQL Server not to use parallel processing.
Large, complex queries usually can benefit from parallel execution. However, SQL Server performs parallel processing only when the estimated number of seconds required to run a serial plan for the same query is higher than the value set in the cost threshold for parallelism. Set the cost estimate threshold using the Cost Threshold For Parallelism box on the Advanced page of the Server Properties dialog box. You can use any value from 0 through 32,767. On a single CPU, the cost threshold is ignored.
Click OK. These changes are applied immediately. You do not need to restart the server.
You can use the stored procedure sp_configure to configure parallel processing. The Transact-SQL commands are:
exec sp_configure "max degree of parallelism", <integer value>
exec sp_configure "cost threshold for parallelism", <integer value>
Quoted from Technet article Configure Parallel Processing in SQL Server 2008
TOP automatically places the query into serial (non parallel mode). This is a restriction and cannot be overcome. Attempt using a rank where rand value=1 as a possible alternative to the TOP function...
How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium