Slow BigQuery Response Time - google-bigquery

I don't really consider this as a duplicate of this
or this. It might be slightly related any way. But, i wrote a simple query on data SELECT * FROM [opendata.openQueryData] LIMIT 1000 that is not even up to 100,00 rows in total and it is taking forever. meanwhile similar query on sample data SELECT * FROM [publicdata:samples.shakespeare] LIMIT 1000 just took 1.2s. Is there anything i have to do to achieve this speed on my own data ?
edit
And i just noticed the number of rows on my data is not showing unlike the rest of the samples dataset and some i have used before now, Could this be the reason for slow query response ?

Related

Grafana displaying incomplete data from query results from BigQuery

I'm connected to BigQuery from Grafana, and I'm getting wildly different results from queries compared to the BigQuery console and other query tools I've connected to BigQuery. Simple things like select * from table yield very different results. Grafana is returning 1400 records from a select * on a table with 4 million records. Anyone seen this before or have any idea what is going on?
It seems like in the Query options, the default value for Max data points is limiting the number of results to 1369. Set it to a higher number and the query will return that number of rows (it is used as LIMIT value in the query sent to the BigQuery API).

How to efficiently sample hive table and run a query?

I want to sample 100 rows from a big_big_table (millions and millions of rows), and run some query on these 100 rows. Mainly for testing purposes.
The way I wrote this runs for really long time, as if it reads the whole big_big_table, and only then take the LIMIT 100:
WITH sample_table AS (
SELECT *
FROM big_big_table
LIMIT 100
)
SELECT name
FROM sample_table
ORDER BY name
;
Question: What's the correct/fast way of doing this?
Check hive.fetch.task.* configuration properties
set hive.fetch.task.conversion=more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1073741824; --1Gbyte
Set these properties before your query and if you are lucky, it will work without map-reduce. Also consider limiting to single partition.
It may not work depending on the storage type/serde and files size. If files are of small size/splittable and table is native, it may work fast without Map-Reduce started.

Check the execution time of a query accurate to the microsecond

I have a query in SQL Server 2019 that does a SELECT on the primary key fields of a table. This table has about 6 million rows of data in it. I want to know exactly how fast my query is down to the microsecond (or at least the 100 microsecond). My query is faster than a millisecond, but all I can find in SQL server is query measurements accurate to the millisecond.
What I've tried so far:
SET STATISTICS TIME ON
This only shows milliseconds
Wrapping my query like so:
SELECT #Start=SYSDATETIME()
SELECT TOP 1 b.COL_NAME FROM BLAH b WHERE b.key = 0x1234
SELECT #End=SYSDATETIME();
SELECT DATEDIFF(MICROSECOND,#Start,#End)
This shows that no time has elapsed at all. But this isn't accurate because if I add WAITFOR DELAY '00:00:00.001', which should add a measurable millisecond of delay, it still shows 0 for the datediff. Only if I wat for 2 milliseconds do I see it show up in the datediff
Looking up the execution plan and getting the total_worker_time from the sys.dm_exec_query_stats table.
Here I see 600 microseconds, however the microsoft docs seem to indicate that this number cannot be trusted:
total_worker_time ... Total amount of CPU time, reported in microseconds (but only accurate to milliseconds)
I've run out of ideas and could use some help. Does anyone know how I can accurately measure my query in microseconds? Would extended events help here? Is there another performance monitoring tool I could use? Thank you.
This is too long for a comment.
In general, you don't look for performance measurements measured in microseconds. There is just too much variation, based on what else is happening in the database, on the server, and in the network.
Instead, you set up a loop and run the query thousands -- or even millions -- of times and then average over the executions. There are further nuances, such as clearing caches if you want to be sure that the query is using cold caches.

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?
There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

tableUnavailable dependent upon size of search

I'm experiencing something rather strange with some queries that I'm performing in BigQuery.
Firstly, I'm using an externally backed table (csv.gz) with about 35 columns. The total data in the location is around 5Gb, with an average file size of 350mb. The reason I'm doing this, is that I continually add data and remove to the table on a rolling basis - to give me a view of the last 7 days of our activity.
When querying, if perform something simple like:
select * from X limit 10
everything works fine. It continues to work fine if you increase the limit up to 1 million rows. As soon as you up the limit to ten million:
select * from X limit 10000000
I end up with a tableUnavailable error "Something went wrong with the table you queried. Contact the table owner for assistance. (error code: tableUnavailable)"
Now according to to any literature on this, this usually results from using some externally owned table (I'm not). I can't find any other enlightening information for this error code.
Basically, If I do anything slightly complex on the data, I get the same result. There's a column called event that has maybe a couple hundred of different values in the entire dataset. If I perform the following:
select eventType, count(1) from X group by eventType
I get the same error.
I'm getting the feeling that this might be related to limits on external tables? Can anybody clarify or shed any light on this?
Thanks in advance!
Doug