Cannot create a PARTITION BY DATE(timestamp)

Cannot create a PARTITION BY DATE(timestamp) - google-bigquery

Hello I am using my personal GCP account to play around in Bigquery, and I am still within my free-tier range (a billing account is linked, but no fees incurred yet).
So I create a table to fetch baseball.games_wide table from bigquery-public-dataset. The following is my simple CREATE TABLE query with PARTITION on a timestamp column 'startTime'.
CREATE TABLE project.table
PARTITION BY date(startTime) AS
SELECT
gameId, seasonID, date(startTime) as game_date, startTime, year
FROM `bigquery-public-data.baseball.games_wide`
WHERE YEAR = 2016
The table was created successfully and I can see the worker has the write phase, which is an indicator that something is writing to the table. However, when I go to 'Preview' the table, there is no data to display, and table size is 0 KB.
I tried remove the second line (i.e., PARTITION BY date(startTime)) when creating table, the data can be ingested and I am able to Preview it in console. It seems the PARTITION command is causing problem, but I can't tell where goes wrong. Any idea?

As you have mentioned in the comment this issue is resolved by creating a new dataset after the billing account is linked to the project.
You can follow this tutorial to create billing account and link it to project.

Related

BigQuery Google Ads Transfer - Duplicate data

We are using the Google Ads transfer in BigQuery to ingest our Google Ads data. One thing I have noticed when querying the results is that all of the metrics are exactly 156x of the values we would expect in the Google Ads UI (cost, clicks, etc.)
We have tested multiple transfers and each time we have this same issue. The transfer process seems pretty straight forward, but am I missing something? Has anyone else noticed a similar issue or have any ideas of what to look at to adjust in the data transfer?

For which tables do you notice this behavior?
The dimension tables such as Customer, Campaign, AdGroup are exported every day and so are partitioned by day.
This could cause your duplication?!
You only need the latest partition/day.
So this is for example how I get the latest account / customer data:
SELECT
-- main reason I cast all the id's to string is because BI reporting tool will not see it as a metric but as a dimension field
CAST(customer_id AS STRING) AS account_id, --globally unique, see also: https://developers.google.com/google-ads/api/docs/concepts/api-structure
customer_descriptive_name,
customer_auto_tagging_enabled,
customer_currency_code,
customer_manager,
customer_test_account,
customer_time_zone,
_DATA_DATE AS date, --source table is paritioned on date
_LATEST_DATE,
CASE WHEN _DATA_DATE = _LATEST_DATE THEN TRUE ELSE FALSE END is_most_recent_record
FROM
`YOURPROJECTID.google_ads.ads_Customer_YOURID`
WHERE
_DATA_DATE = _LATEST_DATE

BigQuery create table error: dataset not found in location

Here is my situation:
My colleague has a dataset located in asia-northeast3 in his BigQuery. He has already give me reader access to his dataset. I'm trying to extract some necessary data from one of his tables and save them into a new table under my dataset (location: us-central).
I wrote the following sql to do this, but BigQuery reported error:
Not found: Dataset my_project_id:dataset_in_us was not found in
location asia-northeast3
CREATE OR REPLACE TABLE `my_project_id.dataset_in_us.my_tablename` AS
SELECT
create_date
, totalid -- id for article.
, urlpath -- format like /article/xxxx
, article_title -- text article title
FROM `my_colleagues_project_id.dataset_in_asia_northeast3.tablename`
ORDER BY 1 DESC
;
I can't change my dataset location or his. I need to join the data from his dataset with data from my dataset. How to solve this?

After 1 day of trying and failing, I found a not perfect solution.
I copied my colleague's entire dataset from asia-northeast3 to us-central following this guide.
After that I can run my query on the copied dataset.
This solution is time (and money) consuming. I'm still trying to figure out if there is a way to only copy a single table, instead of an entire dataset, from one location to another.

Daily Retention with Filter in BigQuery

I am using a query to calculate daily retention on my Firebase Analytics data exported to BigQuery. It is working well and the numbers match with the numbers in Firebase, but when I try to filter the query by a cohort of users, the numbers don't add up.
I want to compare the results of an A/B test from Firebase, and so I've looked at the user_property "firebase_exp_2" which is my A/B test, and I've split up the users in each group (0/1). The retention numbers do not match (at all) the numbers that I can see in my A/B test results in Firebase - actually they show the opposite pattern.
The query is adapted from here: https://github.com/sagishporer/big-query-queries-for-firebase/wiki/Query:-Daily-retention
All I've changed is adding the following under the "WHERE" clause:
WHERE
event_name = 'user_engagement' AND user_pseudo_id IN
(SELECT user_pseudo_id
FROM `analytics_XXX.events_*`,
UNNEST (user_properties) user_properties
WHERE user_properties.key = 'firebase_exp_2' AND user_properties.value.string_value='1')
Firebase says that there are 6,043 users in the Control group and 6,127 in the Variant A group, but my numbers are 5,632 and 5,730, and the retained users are around 1,000 users more than what Firebase reports.
What am I doing wrong?

The export to BigQuery happens on a daily basis and each imported table is named events_YYYYMMDD. Additionally, a table is imported for events received throughout the current day. This table is named events_intraday_YYYYMMDD.
The additions you made are querying from events_* which is fine. The example uses events_201812* though which would ignore the intraday table. That would explain why your numbers a lower. You are missing users added to the A/B test during the current day.

Adding a new column into Athena (Presto) table calculated by taking the difference between two rows

Over the past few weeks, I've written a pipeline that picks up all the clickstream data that is being broadcasted from a website. The pipeline makes use of AWS in the following way: S3 > EC2 (for transforms) > Athena (scanning a clean, partitioned s3). New data comes into the pipeline every 24hour and this works great - my clickstream data is easily queriable. However, I now need to add some additional columns i.e. time spent on each page. This can be achieved by sorting by user ID, timestamp and then taking the difference between the timestamp column of row_n1 and row_n2. So my questions are:
1) How can I do this via an SQL query? I'm struggling to get it to work, but my thinking is that once I do I can trigger this query every 24hours to run on the new clickstream data that's coming into Athena.
2) Is this a reasonable way to add additional columns or new aggregate tables? for example, build a query that runs every 24hours on new data to append to a new table.
Ideally, I don't want to touch any of the source code that's been written to do the "core" ETL pipeline
for reference my table looks similar to the following (with the new column time spent on page) :
| userID | eventNum | Category| Time | ...... | timeSpentOnPage |
'103-1023' '3' 'View' '12-10-2019...' 3s
Thanks for any direction/advice that can be provided.

I'm not entirely sure what you are asking, and some example data and expected output would be helpful. For example, I don't quite understand what you mean by row_n and row_m.
I'm going to guess that you mean something like calculating the difference between the timestamps of consecutive rows. That can be achieved by a query like
SELECT
userID,
timestamp - LAG(timestamp, 1) OVER (PARTITION BY userID ORDER BY timestamp) AS timeSpentOnPage
FROM events
The LAG window function returns the value from a previous row (1 in this case means the previous row) in the window given by the window frame (in this case all rows with the same userID and sorted by timestamp). It's kind of like GROUP BY but for each row, if that makes sense.
It wouldn't quite give you the time spent on each page, some page views would look like they were very long when in fact there was just not any activity between them (say someone browsed some, went to lunch, and browsed some more – the last page view before lunch would look like it spanned the whole lunch).
There is no way to do the equivalent of UPDATE in Athena. The closest thing is doing a "CTAS" (Create Table AS) to create a new table (which with some automation can be turned into creating new partitions for existing tables).
If you provide some more information about your data I can revise this answer with other suggestions.

Where do you get Google Bigquery usage info (mainly for processed data)

I know that BigQuery offers the first "1 TB of data processed" per month for free but I can't figure out where to look on my dashboard to see my monthly usage. I used to be able to "revert" to the old dashboard which had the info but for the past couple of weeks the "old dashboard" isn't accessible.

From the Google Cloud Console overview page for your project, click on the "details" section on the top-right, next to the charge estimate :
You'll get an estimate of the charges for the current month for each service and item in the service, including Big Query analysis :
If you want to track this usage, you can also export the data into CSV every day by going in the Billing settings and enable the usage export feature. Do not worry about the fact that it only mentions Compute Engine, it actually works for other services also.
You can also access directly the billing history by clicking on the billing account link :
You will get a detailed bill with the usage info :

Post GCP Console Redesign Answer
The GCP console was redesigned and now the other answer here no longer applies, but it is still possible to view your usage by going to IAM & Admin -> Quotas.
What you're looking for is "Big Query API: Query usage per day". It doesn't seem possible to view your usage over 30 days unfortunately, but you can see your current usage (per day) and your peak usage over the past 7 days. You can also set a daily quota. If you're just working infrequently or doing a lot in one day, you can set a quota to 1 TiB and prevent yourself from blowing your whole allocation in one day.
You can try sending feedback about these limitations, like I did, by clicking the question mark at the top right and then send feedback.

Theo is correct that there is no way to view the number of bytes processed or billed since the start of the month (inside of the free tier) in the GCP Billing Console. However, you can extract the bytes processed and bytes billed data from logs in Cloud Logging and calculate the total bytes processed/billed since the start of the month inside of BigQuery.
Here are the steps to count total bytes billed in a month:
Under Cloud Logging, go to Logs Explorer (NOT the Legacy Logs Explorer) and run the following query in the query builder frame:
resource.type="bigquery_project" AND
protoPayload.metadata.jobChange.job.jobStats.queryStats.totalBilledBytes>1 AND
timestamp>="2021-04-01T00:00:00Z"
The timestamp clause is not actually necessary, but it will speed up the query. You can set timestamp >= <value> to any valid timestamp you want as long as it returns at least one result.
In the Query Results frame, click the "Action" button, and select "Create Sink".
In the window that opens, give your sink a name, click "Next", and in the "Select sink service" dropdown menu select "BigQuery dataset".
In the "Select BigQuery dataset" dropdown menu, either select an existing dataset where you would like to create your sink (which is a table containing logs) or if you prefer, choose "Create new BigQuery dataset.
Finally, you will likely want to check the box for Partition Table, since this will help you control costs whenever you query this sink. As of the time of this answer, however, Google limits partition tables to 4000 partitions, so you may find it is necessary to clear out old logs eventually.
Click "Create Sink" (there is no need for any inclusion or exclusion filters).
Run a query in BigQuery that produces bytes billed (i.e. a query that does not return a previously cached result). This is necessary to instantiate the sink. Moments after your query runs, you should now see a table called <your_biquery_dataset>.cloudaudit_googleapis_com_data_access
Enter the following Standard SQL query in the BigQuery query editor:
WITH
bytes_table AS (
SELECT
JSON_VALUE(protopayload_auditlog.metadataJson,
'$.jobChange.job.jobStats.createTime') AS date_time,
JSON_VALUE(protopayload_auditlog.metadataJson,
'$.jobChange.job.jobStats.queryStats.totalBilledBytes') AS billedbytes
FROM
`<your_project><your_bigquery_dataset>.cloudaudit_googleapis_com_data_access`
WHERE
EXTRACT(MONTH
FROM
timestamp) = 4
AND EXTRACT(YEAR
FROM
timestamp) = 2021)
SELECT
(SUM(CAST(billedbytes AS INT64))/1073741824) AS total_GB
FROM
bytes_table;
You will want to chance the month from 4 to whatever month you intend to query, and 2021 to whatever year you intend to query. Also, you may find it helpful to save this query as a view if you intend to rerun it periodically.
Be advised that your sink does not contain your past BigQuery logs, only BigQuery logs produced after you created the sink. Therefore in the first month the number of GB returned by this query will not be an accurate count your bytes billed in month unless you happen to have created the sink prior to running any queries in BigQuery during the current month.

Might be related to How can I monitor incurred BigQuery billings costs (jobs completed) by table/dataset in real-time?
If you are fine by using BigQuery itself to get that information (instead of using a UI), you can use something like this:
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
ROUND(SUM(total_bytes_processed) / gb_divisor,2) as bytes_processed_in_gb,
ROUND(SUM(IF(cache_hit != true, total_bytes_processed, 0)) * cost_factor,4) as cost_in_dollar,
user_email,
FROM (
(SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
UNION ALL
(SELECT * FROM `other-project.region-us`.INFORMATION_SCHEMA.JOBS_BY_USER)
)
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
GROUP BY
user_email
Open in BigQuery UI
Explanation
Please consider the caveats I mentioned in my answer here

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas