Data not showing up intermittently on the OpenTSDB UI - bigtable

We are running some high volume tests by pushing metrics to OpenTSDB (2.3.0) with BigTable, and a curious problem surfaces from time to time. For some metrics, an hour of data stops showing up on the web UI when we run a query. The span of "missing" data is very clearcut and borders on the hour (UTC). After a while, while rerunning the same query, the data shows up. There does not seem to be any pattern that we can deduce here, other than the hour span. Any pointers on what to look for and debug this?

How long do you have to wait before the data shows up? Is it always the most recent hour that is missing?
Have you tried using OpenTSDB CLI when this is happening and issuing a scan to see if the data is available that way?
http://opentsdb.net/docs/build/html/user_guide/cli/scan.html
You could also check via an HBase shell scan to see if you can get the raw data that way (here's information on how it's stored in HBase):
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html
If you can verify the data is there then it seems likely to be a web UI problem. If not, the next likely culprit is something getting backed up in the write pipeline.
I am not aware of any particular issue in the Google Cloud Bigtable backend layer that would cause this behavior, but I believe some folks have encountered issues with OpenTSDB compactions during periods of high load that result in degraded performance.
It's worth checking in the Google Cloud Console to see if there's any outliers in the latency, CPU or throughput graphs that correlate with the times during which you experience the issue.

Related

Bigquery Cache Not Working

I noticed that BigQuery no longer cache the same query even I have chosen to use cache in the GUI (both Alpha and Classic). I didn't edit the query at all, just keep clicking run query button and every time GUI executed the query without using cache results.
It happens to my PHP script as well. Before, it was enable to use cache and came back with results very quick and now it executes the query every time even the same query has been executed minutes ago. I can confirm the behaviour in the logs.
I am wondering if there is anything changed in the last few weeks? Or some kind of account level settings control this? Because it was working fine for me.
As per official docs here cache is disable when:
...any of the tables referenced by the query have recently received
streaming inserts...
Even if you are streaming to one partition, and then querying to another, this will invalidate caching functionality for the whole table. There is this feature request opened where it is requested to be able to hit cache when doing streaming inserts to one partition but querying a different partition.
EDIT***:
After some investigation I've found out that some months ago there was an issue going on which was allowing to hit the cache even streaming inserts were being made. This was not expected behavior, and therefore it got solved in May. I guess this is the change you have experienced and what you are talking about.
Docs have not changed related to this, and they aren't/weren't incorrect. Just the previous behavior was the incorrect one.

UI testing fails with error "Failed to get snapshot within 15.0s"

I am having a table view with a huge amount of cells. I tried to tap a particular cell from this table view. But the test ending with this error:
Failed to get snapshot within 15.0s
I assume that, the system will take snapshot of the whole table view before accessing its element. Because of the huge number of cells, the snapshot taken time was not enough (15 secs is may be the system default time).
I manually set the sleep time / wait time (I put 60 secs). Still I can not access the cells after 60 seconds!!
A strange thing I found is, before accessing the cell, I print the object in Debugger like this:
po print XCUIApplication().cells.debugDescription
It shows an error like
Failed to get snapshot within 15.0s
error: Execution was interrupted, reason: internal ObjC exception breakpoint(-3)..
The process has been returned to the state before expression evaluation.
Again if I print the same object using
po print XCUIApplication().cells.debugDescription
Now it will print all cells in the tableview in Debugger view.
No idea about why this happens. Does anyone faced similar kind of issues? Help needed!!
I assume that, the system will take snapshot of the whole table view before accessing its element.
Your assumption is correct but there is even more to the story. The UI test requests a snapshot from the application. The application takes this snapshot and then sends the snapshot to the test where the test finally evaluates the query. For really large snapshots (like your table view) that means that:
The snapshot takes a long time for the application to generate and
the snapshot takes a long time to send back to the test for query evaluation.
I'm at WWDC 2017 right now and there is a lot of good news about testing - specifically some things that address your exact problem. I'll outline it here but you should go watch WWDC 2017 Session 409 and skip to timestamp 17:10.
The first improvement is remote queries. This is where the test will transmit the query to the application, the application will evaluate that query remotely from the test and then only transmit back the result of that query. Expect a marginal improvement from this enhancement ~20% faster and ~30% less memory.
The second improvement is query analysis. This enhancement will reduce the size of snapshots taken by using a minimum attribute set for taking snapshots. This means that full snapshots of the view are not taken by default when evaluating a query. Example is if you're querying to tap a button, the snapshot is going to be limited to buttons within the view. This means writing less ambiguous queries are even more important. I.e. if you want to tap a navigation bar button, specify it in the query like app.navigationBars.buttons["A button"]. You'll see even more performance improvement from this enhancement ~50% faster and ~35% less memory
The last and most notable (and dangerous) improvement is what they're calling the First Match API. This comes with some trade offs/risks but offers the largest performance gain. It offers a new .firstMatch property that returns the first match for a XCUIElement query. Ambiguous matches that result in test failures will not occur when using .firstMatch, so you run the risk of evaluating or performing an action on an XCUIElement that you didn't intend to. Expect a performance increase of ~10x faster and no memory spiking at all.
So, to answer your question - update to Xcode 9, macOS High Sierra and iOS 11. Utilize .firstMatch where you can with highly specific queries and your issue of timing out snapshots should be solved. In fact the time out issue you're experiencing might already be solved with the general improvements you'll receive from remote queries and query analysis without you having to use .firstMatch!

BigQuery Retrieval Times Slow

BigQuery is fast at processing large sets of data, however retrieving large results from BigQuery is not fast at all.
For example, I ran a query that returned 211,136 rows over three HTTP requests, taking just over 12 seconds in total.
The query itself was returned from cache, so no time was spent executing the query. The host server is Amazon m4.xlarge running in US-East (Virginia).
In production I've seen this process take ~90seconds when returning ~1Mn rows. Obviously some of this could be down to network traffic... but it seems too slow for that to be the only cause (those 211,136 rows were only ~1.7MB).
Has anyone else encountered such slow speed when having results returned, and have found a resolution?
Update: Reran test on VM inside Google Cloud with very similar results. Ruling out network issues beteween Google and AWS.
Our SLO on this API is 32 seconds,and a call taking 12 seconds is normal. 90 seconds sounds too long, it must be hitting some of our system's tail latency.
I understand that it is embarrassingly slow. There are multiple reasons to it, and we are working on improving the latency of this API. By the end of Q1 next year, we should be able to roll out a change that would cut tabledata.list time in half (by upgrading the API frontend to our new One Platform technology). If we have more resource, we would also make jobs.getQueryResults faster.
Concurrent Requests using TableData.List
It's not great, but there is a resolution.
Make a query, and set the max rows to 1000. If there is no page token simply return the results.
If there is a page token then disregard the results*, and use the TableData.List API. However rather than simply sending one request at a time, send a request for every 10,000 records* in the result. To so this one can use the 'MaxResults' and 'StartIndex' fields. (Note even these smaller pages may be broken into multiple requests*, so paging logic is still needed).
This concurrency (and smaller pages) leads to significant reductions in retrieval times. Not as good as BigQ simply streaming all results, but enough to start realizing the gains from using BigQ.
Potential Pitfals: Keep an eye on the request count, as with larger result-sets there could be 100req/s throttling. It's also worth noting that there's no guarantee of ordering, so using StartIndex field as pseudo-paging may not always return correct results*.
* Anything with a single asterix is still an educated guess, but not confirmed as true/best practise.

Google BigQuery: Slow streaming inserts performance

We are using BigQuery as event logging platform.
The problem we faced was very slow insertAll post requests (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/insertAll).
It does not matter where they are fired - from server or client side.
Minimum is 900ms, average is 1500s, where nearly 1000ms is connection time.
Even if there is 1 request per second (so no throttling here).
We use Google Analytics measurement protocol and timings from the same machines are 50-150ms.
The solution described in BigQuery streaming 'insertAll' performance with PHP suugested to use queues, but it seems to be overkill because we send no more than 10 requests per second.
The question is if 1500ms is normal for streaming inserts and if not, how to make them faster.
Addtional information:
If we send malformed JSON, response arrives in 50-100ms.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
Also the failure rate fits the 99.9% uptime we have in the SLA, so there is no reason for objection.
There's something to keep in mind in regards to the SLA, it's a very strictly defined structure, the details are here. The 99.9% is uptime not directly translated into fail rate. What this means is that if BQ has a 30 minute downtime one month, and then you do 10,000 inserts within that period but didn't do any inserts in other times of the month, it will cause the numbers to be skewered. This is why we suggest a exponential backoff algorithm. The SLA is explicitly based on uptime and not error rate, but logically the two correlates closely if you do streaming inserts throughout the month at different times with backoff-retry setup. Technically, you should experience on average about 1/1000 failed insert if you are doing inserts through out the month if you have setup the proper retry mechanism.
You can check out this chart about your project health:
https://console.developers.google.com/project/YOUR-APP-ID/apiui/apiview/bigquery?tabId=usage&duration=P1D
It happens that my response is on the linked other article, and I proposed the queues, because it made our exponential-backoff with retry very easy, and working with queues is very easy. We use Beanstalkd.
To my experience any request to bigquery will take long. We've tried using it as a database for performance data but eventually are moving out due to slow response times. As far as I can see. BQ is built for handling big requests within a 1 - 10 second response time. These are the requests BQ categorizes as interactive. BQ doesn't get faster by doing less. We stream quite some records to BQ but always make sure we batch them up (per table). And run all requests asynchronously (or if you have to in another theat).
PS. I can confirm what Pentium10 sais about faillures in BQ. Make sure you retry the stuff that fails and if it fails again log it to file for retrying it another time.

Bigquery Stream Benchmark

Bigquery officially becomes our device log data repository and live monitor/analysis/diagnostic base. As one step further, We need to measure and monitor data streaming performance. Any relevant benchmark you are using for Bigquery live stream? What relevant once I can refer to?
Since streaming has a limited payload size, see Quota policy it's easier to talk about times and other side effects.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout'
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
UPDATE
Someone requested in the comments new stats, so I posted 2017. It's still the same, there was some heavy data reorganization for us, you see the spike, but essentially it's the same it's around 2sec if you use the max of the streaming insert.