Big Query External table - Query performance degrades with increased number of files in the Source URI - google-bigquery

I have an external big query table created to read "Parquet" files from a GCS bucket.
The folder layout in the GCS bucket is as follows:
gs://mybucket/root/year=2022/model=abc/
gs://mybucket/root/year=2022/model=.../
gs://mybucket/root/year=2021/model=abc/
gs://mybucket/root/year=2021/model=.../
The layout is organized in such a way that it follows hive partitioning layout as explained the big query documentation. The columns "year" and "model" are seen as partition columns in the external table.
**External Data Configuration**
Source URI(s)- gs://mybucket/root/*
Source format - PARQUET
Hive Partitioning Mode - CUSTOM
Hive Partitioning Source URI Prefix - gs://mybucket/root/{year:INTEGER}/{model:STRING}
Hive Partitioning Column(s)- year, model
Problem: When I run queries on the external table as given below, I have observed that every query runs for an initial 2-3 minutes before the actual run happens. Big Query console shows "Query pending" during this time and as soon as it turns "Query Running" the output gets displayed with minimal slot time consumption (Slot time shows in 1-2 seconds.)
Select * from myTable Where year = 2022 and model = 'abc'
The underlying file count will vary and increases for every year and model. For years with more parquet files the initial time sometimes is around 4-5 minutes.
My understanding as per the documentation is that , if the partition columns are present in the query, some sort of partition pruning happens and I expect the query to be responsive immediately as per the documentation.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs#partition_pruning
But the observations made by me is contrary to this. If the source URIs are restricted to 1 year, the table reads the data from one year, the query initial time (where it remains "Query pending" on console) is reduced to 1-2 minute (or even less)
Source URI(s)- gs://mybucket/root/year=2022/*
Question: Is this the expected behavior ? because as volume of files increase in the GCS bucket, the query takes even longer to run (esp. the initial time, and the actual run time doesn't change much), though in the where clause we have the year and model partition columns applied.

This is likely expected behavior. Before partition pruning can happen objects in GCS need to be listed which is likely where the time is being taken. We are working on improvements in this area. The fact that slot time is so low is a good indicator that pruning is in fact happening (since most files are not being read there isn't a lot of slot time to consume).

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

How to efficiently filter a dataframe from an S3 bucket

I want to pull a specified number of days from an S3 bucket that is partitioned by year/month/day/hour. This bucket has new files added everyday and will grow to be rather large. I want to do spark.read.parquet(<path>).filter(<condition>), however when I ran this it took significantly longer (1.5 hr) than specifying the paths (.5 hr). I dont understand why it takes longer, should I be adding a .partitionBy() when reading from the bucket? or is it because of the volume of data in the bucket that has to be filtered?
That problem that you are facing is regarding the partition discovery. If you point to the path where your parquet files are with the spark.read.parquet("s3://my_bucket/my_folder") spark will trigger a task in the task manager called
Listing leaf files and directories for <number> paths
This is a partition discovery method. Why that happens? When you call with the path Spark has no place to find where the partitions are and how many partitions are there.
In my case if I run a count like this:
spark.read.parquet("s3://my_bucket/my_folder/").filter('date === "2020-10-10").count()
It will trigger the listing that will take 19 Seconds for around 1700 folders. Plus the 7 seconds to count, it has a total of 26 seconds.
To solve this overhead time you should use a Meta Store. AWS provide a great solution with AWS Glue, to be used just like the Hive Metastore in a Hadoop environment.
With Glue you can store the Table metadata and all the partitions. Instead of you giving the Parquet path you will point to the table just like that:
spark.table("my_db.my_table").filter('date === "2020-10-10").count()
For the same data, with the same filter. The list files doesn't exist and the whole process of counting took only 9 Seconds.
In your case that you partitionate by Year, Month, Day and Hour. We are talking about 8760 folders per year.
I would recommend you take a look at this link and this link
This will show how you can use Glue as your Hive Metastore. That will help a lot to improve the speed of Partition query.

segmentGranularity in Druid indexing task; exact meaning & implication during indexing

I still don't quite get this "segmentGranularity" in Druid. This page is quite ambiguous: http://druid.io/docs/latest/design/segments.html . It goes on mentioning segmentGranularity but it talks more about intervals (in the first paragraph).
Anyway, at this point the volume of my data is not that big. That page mentioned 300mb-700mb is the "ideal" size of a segment. Actually I can fit a week of data into one segment. That's why I'm thinking of setting segmentGranularity to "week" in my indexing-task json:
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "week",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"]
},
However, I plan to do the batch indexing every one hour (and this will normally only (re)process data within that same day). So that's why I put only one interval, that spans one day, in the "intervals" field above.
My question: how would that work when the segmentGranularity is set to week (instead of day)? Will it rebuild the cube for the entire segment (of a week)? Which is something I don't want; I want only to rebuild the cube for the day.
Thanks,
Raka
Yes segment granularity period specifies for what duration data should be kept in a particular segment. If your segment is set to weekly than each segment would hold data of a particular week.
Now if you are going to run ingestion task every hour, than the entire segment gets re-build, if you have addition of data only for the day, than its generally better to keep your segment granularity to "day".
But you can very well keep the segment granularity to "week" if your data is small, it shouldn't matter whether druid rebuilds the segments.
Since your data set is small, you can look into tranquility server, which can ingest data on the fly without batch ingestion. It should do fine for your use case.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability