Strange behavior on Bigquery dataset location - google-bigquery

I noticed a strange behavior in the Google cloud compute engine using Bigquery and VM instances.
I have a java process that streams data into Bigquery.
I expected to have better performances by choosing the same region for BigQuery dataset and the VM instances but my tests showed an unexpected behavior.
CASE1: VM on us-central1-a AND dataset location US
Average time on insertion Bigquery response: 150 milliseconds
CASE2: VM on europe-west1-c AND dataset location US
Average time on insertion Bigquery response: 700 milliseconds
CASE3: VM on us-central1-a AND dataset location EU
Average time on insertion Bigquery response: 1200 milliseconds
CASE4: VM on europe-west1-c AND dataset location EU
Average time on insertion Bigquery responset: 1700 milliseconds
I can understand the decrease of performances in CASE2 and CASE3 but what about CASE4?
The test shows that if the Bigquery dataset location is "EU" performance decrease even if the VM region is europe-west1-c.
My conclusion is: never use Bigquery in EU (sure, except for requirements on the location of the data)!
Anything wrong in my considerations?

Thanks for reporting.
Looks like the latency mentioned in the post includes both tables.get() + tabledata.insertAll(). The latency difference is mostly caused by tables.get().
We are aware that calling metadata related APIs (e.g. tables.get) is slower from EU than US. It is caused by some existing infrastructure limitations, and unfortunately there is short-term fix for it. But we are actively working on some backend changes to minimize this latency difference for the long term.
A few things you might consider to mitigate this:
pre-create your tables ahead of times, so no need to check table existence every time before a insertAll
If it is daily table, maybe try PartitionTable? Then you only need to create table once. https://cloud.google.com/bigquery/docs/partitioned-tables https://cloud.google.com/bigquery/docs/querying-partitioned-tables
If newly created tables have the same schema as a base table, try streaming to a template table. https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

How to interpret query process GB in Bigquery?

I am using a free trial of Google bigquery. This is the query that I am using.
select * from `test`.events where subject_id = 124 and id = 256064 and time >= '2166-01-15T14:00:00' and time <='2166-01-15T14:15:00' and id_1 in (3655,223762,223761,678,211,220045,8368,8441,225310,8555,8440)
This query is expected to return at most 300 records and not more than that.
However I see a message like this as below
But the table on which this query operates is really huge. Does this indicate the table size? However, I ran this query multiple times a day
Due to this, it resulted in error below
Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
How long do I have to wait for this error to go-away? Is the daily limit 1TB? If yes, then I didn't not use close to 400 GB.
How to view my daily usage?
If I can edit quota, can you let me know which option should I be editing?
Can you help me with the above questions?
According to the official documentation
"BigQuery charges for queries by using one metric: the number of bytes processed (also referred to as bytes read)", regardless of how large the output size is. What this means is that if you do a count(*) on a 1TB table, you will supposedly be charged $5, even though the final output is very minimal.
Note that due to storage optimizations that BigQuery is doing internally, the bytes processed might not equal to the actual raw table size when you created it.
For the error you're seeing, browse the Google Console to "IAM & admin" then "Quotas", where you can then search for quotas specific to the BigQuery service.
Hope this helps!
Flavien

How to use BigQuery Slots

Hi,there.
Recently,I want to run a query in bigquery web UI by using "group by" over some tables(tables' name suits xxx_mst_yyyymmdd).The rows will be over 10 million. Unhappily,the query failed with this error:
Query Failed
Error: Resources exceeded during query execution.
I did some improvements with my query language,the error may not happen for this time.But with the increasement of my data, the Error will also appear in the future.So I checked the latest release of Bigquery,maybe there two ways to solve this:
1.After 2016/01/01,Bigquery will change the Query pricing tiers to satisfy the "High Compute Tiers" so that the "resourcesExceeded error" will not happen again.
2.BigQuery Slots.
I checked some documents in Google and didn't find a way on how to use BigQuery Slots.Is there any sample or usecase of BigQuery Slots?Or I have to contact with BigQuery Team to open the function?
Hope someone can help me to answer this question,thanks very much!
A couple of points:
I'm surprised that a GROUP BY with a cardinality of 10M failed with resources exceeded. Can you provide a job id of the failed query so we can investigate? You mention that you're concerned about hitting these errors more often as your data size increases; you should likely be able to increase your data size by a few more orders of magnitude without seeing this; likely you've encountered either a bug or something was strange with either your query or your data.
"High Compute Tiers" won't necessarily get rid of resourcesExceeded. For the most part, resourcesExceeded means that BigQuery ran into memory limitations; high compute tiers only address CPU usage. (and note, they haven't been enabled yet).
BigQuery slots enable you to process data faster and with more reliable performance. For the most part, they also wouldn't help prevent resourcesExceeded errors.
There is currently (as of Nov 5) a bug where you may need to provide an EACH keyword with a GROUP BY. Recent changes should enable BigQuery to automatically select the execution strategy, so EACH shouldn't be needed, but there are a couple of cases where it doesn't pick the right one. When in doubt, add an EACH to your JOIN and GROUP BY operations.
To get your project eligible for using slots you need to contact support.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

How to create inhouse funnel analytics?

I want to create in-house funnel analysis infrastructure.
All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.
I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.
Your MapReduce is pretty simple:
Mapper reads row of a session in log file, its output is (stag-id, 1)
Set number of Reducers to be equal to the number of stages.
Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).
You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).
To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).
Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;
If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).
In this case your options are (I list here open-source projects only):
Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).
Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.
Shark/Spark - in-memory distributed database.