I am just starting with biquery, my DB is small (10K of rows 1 table) and my queries are simple count and group by.
Its takes and average of 3-4 sec per request but sometimes its jumps to 10 and event 15sec
I am querying from amazon linux server in Irland using the BQ tool.
Is it possible to get results faster (under 1sec) so I will be able to present my webpages faster.
1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.
That's the expected behaviour. In BigQuery you are using a shared infrastructure, so depending on the use at the moment you will get better or worse response time. Actually batch queries (those not needing interactivity) are encouraged and rewarded by not adding up to your quota.
You typically don't use BigQuery as your main database to show data in your web application. Depending on what you want to do, BigQuery can be a Big Data storage and you should have another intermediate store where you could store computed results to display to your users. Or maybe in your use case you don't really need BigQuery and there is a better solution.
In any case, you are not going to be able to avoid a few seconds wait (even if you go Premium, you get more guarantees about the service, but in no case a service fast enough as to be your main backend for a webapp)
Related
I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?
From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings
members,
Currently we synchronise salesdata into BigQuery, and it allows us to make fast, detailed, practically realtime reports of all kinds of stats that we otherwise would not have available. We want to have a website that is able to use these reports and present this information to website-users.
Some specs:
Users are using the data as 'readonly'
We want to do the analysis 'on request', so as soon as a user opens the page, we would query BigQuery and the user would see their stats depending on the query
The stats could change for external sources but often the result will be equal, I take into my mind that BigQuery would cache the query
The average query processes about 100Mb of data, it takes >2 seconds for the whole backend to respond (so user request, query, return resultset) so performance is what we want
Why I doubt:
BigQuery would not be adviced
Could it run 'out of hand'
Dataset will grow bigger, but we will need to keep using all historical data in any case
I would be an option to get aggregated data into another database for doing the main calls, but that would give me not a 'realtime' experience.
I would love to hear your thoughts.
As per your requirement, you can consider Bigquery as an option since Bigquery is fully managed and supports analytics over petabyte-scale data, it will be able to handle large amounts of data. Bigquery is specially designed for performing OLAP transactions so analysis can be performed on requests. Bigquery uses cached query results through which you can cache the query and fetch results quickly.
If your dataset is very large and grows then you can create partitioned tables to store and manage your data and easily query the tables. Since your data can go out of hand, Bigquery being a fully managed service will automatically handle that load. Historical data can be stored and accessed but for that you can set the expiration time of the table and also check the optimized storage according to your requirement.
If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.
These may be few basic questions.
When i load data into BQ tables, where exactly data stored? (If billing is already enabled). if it is data center, what would be data center capacity? Does our data co-exist with other users data?
When we fire queries, How our queries processed? What is the default compute engine used for this?
How can we increase query processing capacity?
Thanks
CP
BigQuery datacenter capacity is practically unlimited. If you plan to upload petabytes in a very short time frame you might need to contact support first just to make sure, but for normal big loads everything should be fine.
BigQuery doesn't use compute engine, but a series of very large clusters where all queries run. That's the secret to a low cost per query, without ongoing costs per hour like other alternatives.
BigQuery increases the number of CPUs involved in your query elastically as the query needs. You don't need to manage storage nor processing capacity.
currently as a single user, it takes the 260ms for a certain query to run from start to finish.
what will happen if I have 1000 queries sent at the same time? should I expect the same query to take ~4 minutes? (260ms*1000)
It is not possible to make predictions without any knowledge of the situation. There will be a number of factors which affect this time:
Resources available to the server (if it is able to hold data in memory, things run quicker than if disk is being accessed)
What is involved in the query (e.g. a repeated query will usually execute quicker the second time around, assuming the underlying data has not changed)
What other bottlenecks are in the system (e.g. if the webserver and database server are on the same system, the two processes will be fighting for available resource under heavy load)
The only way to properly answer this question is to perform load testing on your application.