How much storage capacity my dataset or table consume? - google-bigquery

I have multiple datasets, each with hundreds of tables in Google BigQuery. I'd like to remove some old, legacy data and I am looking for the most convenient way to know how much storage space my each dataset and table is occupying, so I could make educated decision on what datasets/tables I may remove.
I tried to use bq command-line tool but couldn't find a way to display table storage and entire dataset storage related information.

You can access metadata about the tables in a dataset by using the TABLES meta-table. I.e., and example:
select * from [publicdata:samples.__TABLES__]
returns
project_id dataset_id table_id creation_time last_modified_time row_count size_bytes type
publicdata samples github_nested 1348782587310 1348782587310 2541639 1694950811 1
publicdata samples github_timeline 1335915950690 1335915950690 6219749 3801936185 1
publicdata samples gsod 1335916040125 1440625349328 14420316 17290009238 1
publicdata samples natality 1335916045005 1440625330604 37826763 23562717384 1
publicdata samples shakespeare 1335916045099 1440625429551 164656 6432064 1
publicdata samples trigrams 1335916127449 1445684180324 68051509 277168458677 1
publicdata samples wikipedia 1335916132870 1445689914564 13797035 38324173849 1
More documentation here: https://cloud.google.com/bigquery/querying-data

Below is an example of how to combine use of metadata (as in answer by #Moshapasumansky) with visualization (as in recommendation by #DoITInternational) and all without leaving BigQuery Web UI, but you will need BigQuery Mate Chrome Extension
Assuming you have extension - Follow below steps:
Step 1 - Run Query against tables metadata in publicdata:samples dataset
SELECT
table_id,
DATE(TIMESTAMP(creation_time/1000)) AS Created,
DATE(TIMESTAMP(last_modified_time/1000)) AS Modified,
row_count AS Rows,
ROUND(size_bytes/POW(1024, 3)) AS GB
FROM [publicdata:samples.__TABLES__]
Step 2 - Move to JSON View
Step 3 - Expand Result Panel by Clicking on + Button
This is for two reasons:
To bring to result panel up to 500 records (which should cover your case as you mentioned you have hundreds tables) at a time vs. relatively limited amount of rows at a time that currently supported by native ui
To release more real estate for chart
Step 4 - Close Query Editor (optional) – more real estate for chart
Step 5 - Click Show Pivot to bring Pivot/Chart Tool up with data from Result and than design your pivot chart the way you like (as it is in below screenshot for example)
It might be not the best way - but at least it allows you to do what you want here w/o leaving web ui. In some cases it can be a preferred option I think.

Rather than using the BigQuery API (Tables: get method specifically) and looking into numBytes in the response, I can suggest to use BQdu or BigQuery Disk Usage web application. It will scan your project for datasets and tables and will display this nice visualization, mentioning how much storage each table (or entire dataset) is consuming.

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

Loading a spark DF from S3, multiple files. Which of these approaches is best?

I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.

How to unnest Google Analytics custom dimension in Google Data Prep

Background story:
We use Google Analytics to track user behaviour on our website. The data is exported daily into Big Query. Our implementation is quite complex and we use a lot of custom dimensions.
Requirements:
1. The data needs to be imported into our internal databases to enable better and more strategic insights.
2. The process needs to run without requiring human interaction
The problem:
Google Analytics data needs to be in a flat format so that we can import it into our database.
Question: How can I unnest custom dimensions data using Google Data Prep?
What it looks like?
----------------
customDimensions
----------------
[{"index":10,"value":"56483799"},{"index":16,"value":"·|·"},{"index":17,"value":"N/A"}]
What I need it to look like?
----------------------------------------------------------
customDimension10 | customDimension16 | customDimension17
----------------------------------------------------------
56483799 | ·|· | N/A
I know how to achieve this using a standard SQL query in Big Query interface but I really want to have a Google Data Prep flow that does it automatically.
Define the flat format and create it in BigQuery first.
You could
create one big table and repeat several values using CROSS JOINs on all the arrays in the table
create multiple tables (per array) and use ids to connect them, e.g.
for session custom dimensions concatenate fullvisitorid / visitstarttime
for hits concatenate fullvisitorid / visitstarttime / hitnumber
for products concatenate fullvisitorid / visitstarttime / hitnumber / productSku
The second options is a bit more effort but you save storage because you're not repeating all the information for everything.

Visualising data in Tableau when connected to BigQuery taking an eternity

I have a dataset that I've loaded into BigQuery, it consists of 3 separate tables with a common identifier in each of the files.
When I set up my project in Tableau I performed an inner join on two of the tables. I set the connection up as an extract and not live.
There's some geo info in my file, lats and longs. When I drag lat to the rows section on my worksheet it's taking an eternity to perform that task, currently it's taken 18 mins and counting to just process whatever it's doing when I drag the lat to the row section.
Is there some other way that I can take a random sample of my data for working on it rather than having to wait for each query to process? My data is not even that big, it's around 1M rows.
I've found Tableau to bog down quite a bit long before 1 million rows, and I supsect the join compounds the problem for you.
Aggregating as much as possible in BigQuery itself, before making the extract, is your friend. The random excerpt is a good idea, too. You could try:
SELECT
*
FROM
([subquery joining your tables])
WHERE RAND() < 0.05 # or whatever gives an acceptable sample size