How to improve frequent BigQuery reads? - google-bigquery

I'm using BigQuery for Java to do small reads on a table with about ~5GB of data. The queries I do follows the most standard SQL like SELECT foo FROM my-table WHERE bar=$1 where the result will be at most 1 row. I need to do this at a high frequency and therefore performance is a big concern. How do I optimize for this?
I thought about pulling the entire data set periodically since it's only 5GB, but then again 5GB sounds like a lot to be constantly keeping in memory.
Running this query in BigQuery console shows something like Query complete (0.6 sec elapsed, 4.2 GB processed). Fast for 4.2 GB but not fast enough. Again, I need to very frequently read from it but rarely (maybe once a day or week) write to it.
Maybe tell the server to cache the processed data somehow?

You don't have control over the Cache layer in BigQuery. That is something the service does automatically for you. Unfortunately typical cache lifetime is 24 hours, and the cached results are best-effort and may be invalidated sooner (Official docs).
Query completes in 0.6s seems to be goo for BQ. I'm afraide that If you are looking for something faster maybe BigQuery isn't the data warehouse for your use case.

BigQuery is built for analytical processing and not to interact with individual rows. The best practice would be as you mentioned to hold a copy of it in a place that allows quicker and more efficient reading of individual rows (like a MySQL database).
However, you can still vastly optimize the amount of data scanned in your query by clustering the table on the field that you're filtering on.
https://cloud.google.com/bigquery/docs/creating-clustered-tables

Related

Occasional slow performance copying data across projects in bigquery

I have encountered very slow movement when copying data across one project to another project located in the same data location in bigquery, however it took up to 2 minutes for the data being moved which is just about 100,000 records, as compared to other operations we have done on bigquery copying data with hundreds of millions which took only a matter of a few seconds, hence I would like to find out why this unusual slow movement for such a small data set occurred. Did anyone come across similar issue and have any idea what could be the cause behind it please?
Thanks.
Best regards,
The cause of the slow copy problem could come from method of creation your source table, e.g. it could have been created by several imports jobs that can caused such a fragmentation.
So the difference in time is not because the amount of data stored in your table, but the way the data is fragmented inside.
Although the running time is very reasonable, if you want to speed it up more, you can try COALESCE/MERGE your table. One way of doing this is to export the table to Google Cloud Storage and re-import it back (not append). This should reduce the fragmentation and help in case you want to optimize your operations and gain a few seconds.
Running time of few minutes for table copy method is considered internally as absolutely normal for a table copy job and this does not classify as a BigQuery deficiency.
Refer to official documentation. And if you want to know more about fragmentation in BigQuery, I recommend you O'REILLY "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book.
I hope you find the above pieces of information useful.

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

SQL DB performance and repeated queries at short intervals

If a query is constantly sent to a database at short intervals, say every 5 seconds, could the number of reads generated cause problems in terms of performance or availability? If the database is Oracle are there any tricks that can be used to avoid a performance hit? If the queries are coming from an application is there a way to reduce any impact through software design?
Unless your query is very intensive or horribly written then it won't cause any noticeable issues running once every few seconds. That's not very often for queries that are generally measured in milliseconds.
You may still want to optimize it though, simply because there are better ways to do it. In Oracle and ADO.NET you can use an OracleDependency for the command that ran the query the first time and then subscribe to its OnChange event which will get called automatically whenever the underlying data would cause the query results to change.
It depends on the query. I assume the reason you want to execute it periodically is because the data being returned will changed frequently. If that's the case, then application level caching is obviously not an option.
Past that, is this query "big" in terms of the number of rows returned, tables joined, data aggregated / calculated? If so, it could be a problem if:
You are querying faster than it takes to execute the query. If you are calling it once a second, but it takes 2 seconds to run, that's going to become a problem.
If the query is touching a lot of data and you have a lot of other queries accessing the same tables, you could run into lock escalation issues.
As with most performance questions, the only real answer is to test. In this case test with realistic data in the DB and run this query concurrent with the other query load you expect on the system.
Along the lines of Samuel's suggestion, Oracle provides facilities in JDBC to do database change notification so that your application can subscribe to changes in the underlying data rather than re-running the query every few seconds. If the data is changing less frequently than you're running the query, this can be a major performance benefit.
Another option would be to use Oracle TimesTen as an in memory cache of the data on the middle tier machine(s). That will reduce the network round-trips and it will go through a very optimized retrieval path.
Finally, I'd take a look at using the query result cache to have Oracle cache the results.

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.