Is there a limit to queries using Bigquery's library and api? - sql

I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?

There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization

Related

Google Bigtable under usage "performance"

I have seen the warnings of not using Google Big Table for small data sets.
Does this mean that a workload of 100 QPS could run slower (total time; not per query) than a workload of 8000 QPS?
I understand that 100 QPS is going to be incredibly inefficient on BigTable; but could it be as drastic as 100 inserts takes 15 seconds to complete; where-as a 8000 inserts could run in 1 second?
Just looking for a "in theory; from time to time; yes" vs "probably relatively unlikely" type answer to be a rough guide for how I structure my performance test cycles.
Thanks
There's a flat start up cost to running any Cloud Bigtable operations. That start up cost generally is generally less than 1 second. I would expect 100 operations should take less than 8000 operations. When I see extreme slowness, I usually suspect network latency or some other unique condition.
We're having issues with running small workloads on our Developer Big Table instance (2.5 TB) One instance instead of 3.
We have a key set up on user id and around 100 rows on the key user id. Total records in the database are a few million. We querying big table and seeing 1.4 seconds of latency from fetching the rows associated with a single key of user id. Total number of records returned is less than 100 and we're seeing way over a second of latency. It seems to me that giant workloads are the only way to use this data store. We're looking at other NoSQL alternatives like Redis.

Getting table-specific costs via SQL query

Is there a query I can run to determine how much queries against each table are costing us? For instance, the result of this query would at least include something like:
dataset.table1 236TB processed dataset.table2 56GB processed dataset.table3 24kB processed etc
Also is there a way to know what specific queries are costing us the most?
Thanks!
Let's talk first about data and respective data-points to do such a query!
Take a look at Job Resources
Here you have few useful properties
configuration.query.query - BigQuery SQL query to execute.
statistics.query.referencedTables - Referenced tables for the job.
statistics.query.totalBytesBilled - Total bytes billed for the job.
statistics.query.totalBytesProcessed - Total bytes processed for the job.
statistics.query.billingTier - Billing tier for the job.
Having above data-points would allow you to write relativelly simple query to answer your cost per query and cost per table questions!
So, now - how to get this data available?
You can collect your jobs using Job.list API and than loop thru all available jobs and retrieve respective stats via Job.get API - of course dumping retrieved data into BigQuery table. Than you can enjoy analysis!
Or you can use BigQuery's audit logs to track access and cost details (as described in the docs) and export them back to BigQuery for analysis.
The former option (Jobs.list and than Job.get in loop)) gives you ability to get your jobs info even if you don't have audit logs enabled yet, because Job.get API returns information about a specific job that is available for a six month period after creation - so plenty of data for analysis!
In my understanding currently, it is not possible to get processed bytes per table.
In my understanding it would be a great feature through which you can identify and optimize costs and also have a better possibility to understand effectivity of partioning and clustering changes. Currently is just possible to get the totalprocessed bytes for a query and also see which tables were referenced. But there are no easy query and no query at all which makes possible to analyze this cost on the table level which is more granuar then query level.

BigQuery execution time and scaling

I created a test dataset of roughly 450GB in BigQuery and I am getting execution speed of ~9 seconds to query the largest table (10bn rows) when running from WebUI. I just wanted to check if this is a 'normal' expected result and whether it would get worse with larger size (i.e. 100bn rows+) and if the queries become more complex. I am aware of table partitioning/etc. but I just want to get a sense of what is 'normal' expected speed without first getting into optimization, since the above seems like 'smallish' size for what BQ is meant for.
The above result is achieved on a simple query like this:
select ColumnA from DataSet.Table order by ColumnB desc limit 100
So the result returned to the client is very small. ColumnA is structured as UUIDs represented in String format and ColumnB is integer.
It's almost impossible to say if this is "normal" or not. BigQuery is a multitenancy architecture/infrastructure. That means we all share the same resources (i.e. compute power) in the cluster when running queries. Therefore, query times are never deterministic in BigQuery i.e. they can vary depending on the number of concurrent queries executing from users at any given time. That said however, you can get reserved slots for a flat rate price. Although, you'd need to be spending quite a lot of money to justify that.
You can improve execution times by removing compute/shuffle/memory intensive steps like order by etc. Obviously, the complexity of the query will also have and impact on the query times.
On some of our projects we can smash through 3TB-5TB with a relatively complex query in about 15s-20s. Sometimes it quicker, sometimes is slower. We also run queries over much smaller datasets that can take the same amount of time. This is because what I wrote at the beginning - BigQuery query times are not deterministic.
Finally, BigQuery will cache results, so if you issue the same query multiple times over the same dataset it will be returned from the cache i.e. much quicker!

Real time queries in MongoDB for different criteria and processing the result

New to Mongodb. Is Mongodb efficient for real time queries where the values for the criteria changes every time for my query. Also there will be some aggregation of the resultset before sending the response back to the user. As an example my user case needs to produce the data in the following format after processing a collection for different criteria values.
Service Total Improved
A 1000 500
B 2000 700
.. .. ..
I see Mongodb has Aggregation which process records and return computed results. Should I be used aggregation instead for efficiency? If aggregation is the way to go, I guess I would do that every time my source data changes. Also, is this what Mongo Hadoop is used for? Am I on the right track in my understanding? Thanks in advance.
Your question is too general, IMHO.
Speed depends on the size of your data and on the kind of your query and if you have put an index on your key etc.
Changing values in your queries are not critical, AFAIK.
For example I work on a MongoDB with 3 million docs and can do some queries in a couple of seconds, some in a couple of minutes. A simple map reduce over all 3 M docs takes about 25 min on that box.
I have not tried the aggregation API yet, which seems to be a successor/alternative to map / reduce runs.
I did not know about the MongoDB / Hadoop integration. It seems to keep MongoDB as an easy-to-use storage unit, which feeds data to a Hadoop cluster and gets results from it, using the more advanced map reduce framework from Hadoop (more phases, better use of a cluster of Hadoop nodes)..
I would follow mongodbs guidelines for counting stuff.
See mongodbs documentation page for preaggregated reports.
Hadoop is good for batch processing, which you probably donĀ“t need for these counting use cases?
See this list for other typical hadoop use cases: link.
And heres a resource for typical mongo+hadoop use cases: link.

max memory per query

How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium