Looks like there is no longer data being published to the public NOAA forecast table in bigquery's public dataset. Does anyone know why that is happening? I cannot find any info about the data being discontinued on either website.
project: bigquery-public-data
dataset: noaa_global_forecast_system
table: NOAA_GFS0P25
BigQuery sql that you can use to test this out:
SELECT * FROM `bigquery-public-data.noaa_global_forecast_system.NOAA_GFS0P25` WHERE DATE(creation_time) >= "2022-04-11" LIMIT 100
New forecast data has not been inserted into the table since 4/10/22. They have missed a day before, but we have not seen them miss multiple days in a row before. We would like to know if we need to migrate to a new forecast source, but we cannot find any info on whether this one is being shut down or if they are just having temporary technical difficulties.
Thanks for the heads up! This looks like a temporary technical issue, but we are working on getting this dataset back up and running.
Related
This is my first post so apologies if something is posted incorrectly - please let me know and I will fix it.
I am trying to build a SQL query in bigQuery that creates a cohort analysis so I can see how many customers have been retained over time by the month they joined (their cohort).
I work in insurance, so we have data on customers, when they joined, and any time they changed the policy (e.g., when they added a car to their coverage), but I do not have it laid out as every month of premium. The data is as follows:
Data vs how I need the data
Do you know how I could fill in the missing months?
We are using the Google Ads transfer in BigQuery to ingest our Google Ads data. One thing I have noticed when querying the results is that all of the metrics are exactly 156x of the values we would expect in the Google Ads UI (cost, clicks, etc.)
We have tested multiple transfers and each time we have this same issue. The transfer process seems pretty straight forward, but am I missing something? Has anyone else noticed a similar issue or have any ideas of what to look at to adjust in the data transfer?
For which tables do you notice this behavior?
The dimension tables such as Customer, Campaign, AdGroup are exported every day and so are partitioned by day.
This could cause your duplication?!
You only need the latest partition/day.
So this is for example how I get the latest account / customer data:
SELECT
-- main reason I cast all the id's to string is because BI reporting tool will not see it as a metric but as a dimension field
CAST(customer_id AS STRING) AS account_id, --globally unique, see also: https://developers.google.com/google-ads/api/docs/concepts/api-structure
customer_descriptive_name,
customer_auto_tagging_enabled,
customer_currency_code,
customer_manager,
customer_test_account,
customer_time_zone,
_DATA_DATE AS date, --source table is paritioned on date
_LATEST_DATE,
CASE WHEN _DATA_DATE = _LATEST_DATE THEN TRUE ELSE FALSE END is_most_recent_record
FROM
`YOURPROJECTID.google_ads.ads_Customer_YOURID`
WHERE
_DATA_DATE = _LATEST_DATE
I am aware that Google Analytics can be linked to Bigquery using BigQuery Linking features in the GA.
But I experienced the drawback that it's scheduled at a random time. So, it's messed up my table with dependencies to these GA data, which I set up at 9 AM using DBT -- so if the GA data is updated above 9 AM, my table won't have today's GA data.
My questions are:
Is there a way to schedule the updated GA data to have constant time, as the cronjob did?
Or if there is not any. Is there a way for DBT to run the job after the GA data is updated on bigquery?
Unfortunately Google provide no SLA on the BigQuery export from Google Analytics 3, if you have the option the best solution would be to migrate to Google Analytics 4, which was an almost realtime export to BigQuery and appears to be much more robust. Find out more on the official Google support page.
I currently get around this by using event based triggers that look at the meta data of a table, or check for the existence of a sharded table for yesterday, then proceed down downstream jobs, I'm sure you could achieve something similar with DBT.
Here is some example SQL code which checks for the existence of yesterday's Google Analytics sharded table by returning the maximum timestamp:
SELECT MAX(cast(PARSE_DATE('%Y%m%d', SUBSTR(table_id,13)) as timestamp)) as max_date
FROM `my_ga_dataset.__TABLES__`
WHERE table_id LIKE'%ga_sessions_%'
AND table_id NOT LIKE '%intraday%'
AND PARSE_DATE('%Y%m%d', SUBSTR(table_id,13)) >= CURRENT_DATE() -9
This works for sharded tables, if you want to use table metadata to get the date/time of the last table update you can use INFORMATION_SCHEMA:
https://cloud.google.com/bigquery/docs/information-schema-tables
I'm working on a way to stream status of some jobs that are running on an HPC resource (sort of like trying to create a dashboard to look at real time flight status). I generate and push data every 60 seconds. Unfortunately, this way i end up with a lot of repeated data as the status of each 'job' changes unpredictably. I need a way to only keep the latest data. I'm not an SQL pro and do this work in my free time so any help will be appreciated!
Here is my query:
SELECT
Job, Ref, Location, Queue, Description, Status, ElapTime, cast (Time as datetime) as Time
INTO
output_source
FROM
input_source
Here is what my output looks like when i test the query:
Query Test Result
As you can see, in the image, there are two sets of data with two different time stamps. I would like the query to return all the columns associated with only the last timestamp. How do i do this? Any ideas? Apologies if this is a repeated question. I have not found an answer that has helped me solve this problem.
Thanks for all your help!
I have 30 daily sharded tables in Big Query from Nov 1 to Nov 30, 2016.
Each of these tables follow the naming convention of "sample_datamart_YYYYMMDD".
Each of these daily tables have a field called timestampServer.
My goal is to advance the data by 24 hours at 00:00:00 UTC every day.
So that the data is kept current without me having to copy the tables.
Is there any way to :
1) do a calculation on the field timestampServer so that it gets updated every 24 hours?
2) and at the same time rename the table name from sample_datamart_20161130 to sample_datamart_20161201?
I've read the other posts and I think those are more on aggregations in a 30 day window. My objective is not to do any aggreagtions. I just want to move the whole dataset forward by 24 hours so that when I searched for the last 1 day, there will always be data there.
Does anyone know if Google Cloud Datasets: Update be able to perform the tasks?
https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update#try-it
Thanks very much for any guidance.
As of #2 - how to rename the table name from sample_datamart_20161130
to sample_datamart_20161201?
This can be achieved by copying table to new table and then deleting original table.
Zero extra cost as copy job is free of charge
Table can be copied with Jobs: Insert API using copy configuration and then table can be deleted using Tables: Delete API
Just wanted to note that above answer just directly answers your (second) question. But somehow I feel you can go wrong direction. If you want to describe in more details what your are trying to achieve (as oposed to how you think you will implement it) we might be able to provide better help for you. If you will go this way - I would recommend to post it as a separate question :o)