Google bigquery - update sql? - google-bigquery

It seems you can do select sql in bigquery but can you update only certain rows in the table through api or from their web console?

Currently BigQuery only accepts SELECT statements. Updates to data need to be done via API, web UI, or CLI.

BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE or DELETE row.

To delete data, you could re-materialize the table without the desired rows:
SELECT *
FROM [mytable]
WHERE id NOT IN (SELECT id FROM [rows_to_delete]
To update data, you could follow a similar process:
SELECT * FROM (
SELECT *
FROM [mytable]
WHERE id NOT IN (SELECT id FROM [rows_to_update]
), (
SELECT *
FROM [rows_to_update]
)
Re-materializing a table in BigQuery is fast enough - compared to native update/deletes on other analytical databases AFAIK.

Related

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)
Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.
You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz
was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.
I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows

Select name from system table and select from this table

I need dynamically obtain table name from system table and perform a select query on this table example:
SELECT "schema"+'.'+"table" FROM SVV_TABLE_INFO WHERE "table" LIKE '%blabla%'
it returns my_schema.the_main_blabla_table
And after I get this table name I need to perform :
SELECT * FROM my_schema.the_main_blabla_table LIMIT 100
Is it possible to in a single query?
If you are talking about select subquery after "from" i can say that you can do this.
You will get something like this:
SELECT * FROM
(
SELECT "schema"+'.'+"table" FROM SVV_TABLE_INFO WHERE "table" LIKE '%blabla%'
)
LIMIT 100
Unfortunately, i can't test it on yor data, but i very interested in result because i have never done something like this. If i get your question incorrect, tell me pls.
Amazon Redshift does not support the ability to take the output of a query and use it as part of another query.
Your application will need to query Redshift to obtain the relevant table name(s), then make another call to Redshift to query that table.

sql server shortcut for select statement writing

Many times a day I have to write similar queries to get single record:
select t.*
from some_table t
where t.Id = 123456
maybe there is some shortcuts for retrieving single record? Like entering id, table and SQL server generates rest code automatically
In Sql Server Go to
Tools-> Options-> Environments->Keyboard
You will get shortcuts, there you can define your own as well as get the standards.
you can set a short cut for a fully executable query like
select * from table where id =20
but not like below
select * from

BigQuery: how to convert this legacy SQL to standardSQL?

I have data import pipeline into BigQuery tables (the hourly tables named transactions_20170616_00 transactions_20170616_01 ... and there are more daily/weekly/... rollups), want to use a single view to always point to the latest one, found hard to do one static standardSQL view to point to latest, my current solution is to update the view's content to SELECT * FROM project.dataset.transactions_201706.... after every import successful,
Till I read this httparchive's latest view: it's all what I want but in legacy SQL; my project uses all standardSQL only, and prefer standardSQL because it's the future; wonder anyone knows how to convert this legacy SQL to standardSQL? then I won't need to constantly update my view
https://bigquery.cloud.google.com/table/httparchive:runs.latest_requests?tab=details
SELECT *
FROM TABLE_QUERY(httparchive:runs,
"table_id IN (
SELECT table_id FROM [httparchive:runs.__TABLES__]
WHERE REGEXP_MATCH(table_id, '2.*requests$')
ORDER BY table_id DESC LIMIT 1)")
following this guide, I'm trying to use
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#the_table_query_function
#standardSQL
SELECT * FROM `httparchive.runs.*`
WHERE _TABLE_SUFFIX IN
( SELECT table_id
FROM httparchive.runs.__TABLES__
WHERE REGEXP_CONTAINS(table_id, r'2.*requests$')
ORDER BY table_id DESC
LIMIT 1)
but the query failed of
Query Failed
Error: Views cannot be queried through prefix. Matched views are: httparchive:runs.latest_pages, httparchive:runs.latest_pages_mobile, httparchive:runs.latest_requests, httparchive:runs.latest_requests_mobile
Job ID: bidder-1183:bquijob_1400109e_15cb1dc3c0c
I found the wildcard can only be used at last? in this case why not SELECT * FROM httparchive.runs.*_requests WHERE ... work?
in this case, is it saying the Wildcard Tables feature in standardSQL isn't same flexible as TABLE_QUERY in legacySQL>?

BigQuery query creation without variables?

Coming from SQL Server and a little bit of MySQL, I'm not sure how to proceed on google's BigQuery web browser query tool.
There doesn't appear to be any way to create, use or Set/Declare variables. How are folks working around this? Or perhaps I have missed something obvious in the instructions or the nature of BigQuery? Java API?
It is now possible to declare and set variables using SQL. For more information, see the documentation, but here is an example:
-- Declare a variable to hold names as an array.
DECLARE top_names ARRAY<STRING>;
-- Build an array of the top 100 names from the year 2017.
SET top_names = (
SELECT ARRAY_AGG(name ORDER BY number DESC LIMIT 100)
FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
);
-- Which names appear as words in Shakespeare's plays?
SELECT
name AS shakespeare_name
FROM UNNEST(top_names) AS name
WHERE name IN (
SELECT word
FROM `bigquery-public-data`.samples.shakespeare
);
There is currently no way to set/declare variables in BigQuery. If you need variables, you'll need to cut and paste them where you need them. Feel free to file this as a feature request here.
Its not elegant, and its a a pain, but...
The way we handle it is using a python script that replaces a "variable placeholder" in our query and than sending the amended query via the API.
I have opened a feature request asking for "Dynamic SQL" capabilities.
If you want to avoid BQ scripting, you can sometimes use an idiom which utilizes WITH and CROSS JOIN.
In the example below:
the events table contains some timestamped events
the reports table contain occasional aggregate values of the events
the goal is to write a query that only generates incremental (non-duplicate) aggregate rows
This is achieved by
introducing a state temp table that looks at a target table for aggregate results
to determine parameters (params) for the actual query
the params are CROSS JOINed with the actual query
allowing the param row's columns to be used to constrain the query
this query will repeatably return the same results
until the results themselves are appended to the reports table
WTIH state AS (
SELECT
-- what was the newest report's ending time?
COALESCE(
SELECT MAX(report_end_ts) FROM `x.y.reports`,
TIMESTAMP("2019-01-01")
) AS latest_report_ts,
...
),
params AS (
SELECT
-- look for events since end of last report
latest_report_ts AS event_after_ts,
-- and go until now
CURRENT_TIMESTAMP() AS event_before_ts
)
SELECT
MIN(event_ts) AS report_begin_ts,
MAX(event_ts) AS report_end_ts
COUNT(1) AS event_count,
SUM(errors) AS error_total
FROM `x.y.events`
CROSS JOIN params
WHERE event_ts > event_after_ts
AND event_ts < event_before_ts
)
This approach is useful for bigquery scheduled queries.