I just received this from Google Support, and it surprised me as I didn't know there was a congestion issue - do other people have this experience?
To fasten up your query, I would recommend that you try to run your query in other time like midnight
BigQuery is nocturnal, so it runs better in the dark. There are fewer predators around, so BigQuery can be free to express itself and cavort across the prairies near the Google Datacenters.
Other techniques to "enfasten" the queries involve running them from the ley lines of power, which are described in the Alchemical diaries of Hermes Trismagistus. Unfortunately, I am not permitted to share their location, and may be putting myself at risk of excommunication from a number of secret societies by just mentioning their existence.
Finally, if you name your tables with the suffix __Turbo, BigQuery will run them in turbo mode, which means they run on 486/66 processors instead of the default Z80 datapath.
Edited to add:
In a non-snarky answer, if you do not have reserved BigQuery capacity (i.e. fixed-price reservations), you may experience lower throughput at certain times. BigQuery has a shared pool of resources, so if lots of other customers are using it at the same time, there may not be enough resources to give everyone the resources that their queries would need to run at full speed.
That said, BigQuery uses a very large pool of resources, and we (currently) run at a utilization rate where every user gets nearly all of the resources they need nearly all of the time.
If you are seeing your queries slow down by 20% at certain times of the day, this might not be surprising. If you see queries take 2 or 3 times as long as they usually do, there is probably something else going on.
Related
My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.
For the Google BigQuery infrastructure folks: we've been running a set of short running interactive queries for many months now averaging about 5 seconds to complete. Starting Friday 2/19 these response times have been rising steadily (SQL has not changed and we're dealing with a steady stream of data we're querying using a sliding window)
Is this a global BigQuery issue you are aware of?
edit: more granular response times:
There is good news and bad news; the good news is that the query took only 0.5 seconds to execute. The bad news is that it took 191 seconds to find the files where the data was stored.
We have a couple of performance regressions that cause high tail latency for resolving paths. Tables (like yours) where the data is stored in many paths will see worse performance.
This is performance issue is exacerbated by the fact that you're using time-range decorators, which mean that our efforts to optimize the file layout doesn't work as well.
We are starting the roll-out of a fix to the underlying performance problem this afternoon; it will likely take at least a week for it to take effect everywhere. I'll update this answer once it is complete (if I forget, please remind me)
In the mean time, you may get faster results by removing the time-range decorators from your queries. You are already filtering by time, so the queries should still be correct. Of course, this may mean that the queries cost a bit more to run.
My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.
did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7
Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.
Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.
I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.
I'm sure it repeats everywhere. You can 'feel' network is slow, or machine or slow or something. But the server/chassis logs are not showing anything, so IT doesn't believe you. What do you do?
Your regressions are taking twice the time ... but that's not enough
Okay you transfer 100 GB using dd etc, but ... that's not enough.
Okay you get server placed in different chassis for 2 week, it works fine ... but .. that's not enough...
so HOW do you get IT to replace the chassis ?
More specifically:
Is there any suite which I can run on two setups ( supposed to be identical ), which can show up difference in network/cpu/disk access .. which IT will believe ?
Computers don't age and slow down the same way we do. If your server is getting slower -- actually slower, not just feels slower because every other computer you use is getting faster -- then there is a reason and it is possible that you may be able to fix it. I'd try cleaning up some disk space, de-fragmenting the disk, and checking what other processes are running (perhaps someone's added more apps to the system and you're just not getting as many cycles).
If your app uses a database, you may want to analyze your query performance and see if some indices are in order. Queries that perform well when you have little data can start taking a long time as the amount of data grows if they have to use table scans. As a former "IT" guy, I'd also be reluctant to throw hardware at a problem because someone tells me the system is slowing down. I'd want to know what has changed and see if I could get the system running the way it should be. If the app has simply out grown the hardware -- after you've made suitable optimizations -- then upgrading is a reasonable choice.
Run a standard benchmark suite. See if it pinpoints memory, cpu, bus or disk, when compared to a "working" similar computer.
See http://en.wikipedia.org/wiki/Benchmark_(computing)#Common_benchmarks for some tips.
The only way to prove something is to do a stringent audit.
Now traditionally, we should keep the system constant between two different sets while altering the variable we are interested. In this case the variable is the hardware that your code is running on. So in simple terms, you should audit the running of your software on two different sets of hardware, one being the hardware you are unhappy about. And see the difference.
Now if you are to do this properly, which I am sure you are, you will first need to come up with a null hypothesis, something like:
"The slowness of the application is
unrelated to the specific hardware we
are using"
And now you set about disproving that hypothesis in favour of an alternative hypothesis. Once you have collected enough results, you can apply statistical analyses on them, to decide whether any differences are statistically significant. There are analyses to find out how much data you need, and then compare the two sets to decide if the differences are random, or not random (which would disprove your null hypothesis). The type of tests you do will mostly depend on your data, but clever people have made checklists to help us decide.
It sounds like your main problem is being listened to by IT, but raw technical data may not be persuasive to the right people. Getting backup from the business may help you and that means talking about money.
Luckily, both platforms already contain a common piece of software - the application itself - designed to make or save money for someone. Why not measure how quickly it can do that e.g. how long does it take to process an order?
By measuring how long your application spends dealing with each sub task or data source you can get a rough idea of the underlying hardware which is under performing. Writing to a local database, or handling a data structure larger than RAM will impact the disk, making network calls will impact the network hardware, CPU bound calculations will impact there.
This data will never be as precise as a benchmark, and it may require expensive coding, but its easier to translate what it finds into money terms. Log4j's NDC and MDC features, and Springs AOP might be good enabling tools for you.
Run perfmon.msc from Start / Run in Windows 2000 through to Vista. Then just add counters for CPU, disk etc..
For SQL queries you should capture the actual queries then run them manually to see if they are slow.
For instance if using SQL Server, run the profiler from Tools, SQL Server Profiler. Then perform some operations in your program and look at the capture for any suspicous database calls. Copy and paste one of the queries into a new query window in management studio and run it.
For networking you should try artificially limiting your network speed to see how it affects your code (e.g. Traffic Shaper XP is a simple freeware limiter).