add table_id to the result from multiple tables in BigQuery - google-bigquery

Below is how I structured the data in BigQuery database.
test
-> sales
-> monthly-2015
-> monthly-2016
-> ...
I want to combine the data of all tables with the table name , monthly-*, and below is how I wrote the sql from examples I found.
Running this sql leads an error like following Scalar subquery produced more than one element. How could I fix it to error?
SELECT
*,
(
SELECT
table_id
FROM
`test.sales.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'monthly-%')
FROM
`test.sales.monthly*`

I want to combine the data of all tables with the table name , monthly-*
Try below
SELECT *, 'monthly_' || _TABLE_SUFFIX as table_name
FROM `test.sales.monthly_*`

Related

filter data in Spark SQL in Databrick

When I run below command in Databrick , I get output where I have three columns called name, id and age but when I try to filter on name by running below I get below error as Name column do not exist. What wrong am i doing?
%sql
SELECT inline(environment.details) FROM TableA
This gives me a table with 3 column correctly.
Now I do filter oon Name like this
%sql
SELECT inline(environment.details) FROM TableA where `Name` == "XYZ"
and I get error as Name table do not exist.What wis wrong here. Also if someone can tell me how can I export the resultant output.
Thanks
Filtering happens before your expand your array of structs. You have two choices here:
Use common table expressions to explode first & then filter:
with exploded as (
SELECT inline(environment.details) FROM TableA
)
SELECT * from exploded where name = ....
Use the filter function to filter out data inside the array with something like that (not tested), but it may require doing the filtering two times:
SELECT inline(filter(environment.details, x -> x.Name = 'XYZ'))
FROM TableA
WHERE array_size(filter(environment.details, x -> x.Name = 'XYZ')) > 0

show columns in CTE returns an error - why?

I have a show columns query that works fine:
SHOW COLUMNS IN table
but it fails when trying to put it in a CTE, like this:
WITH columns_table AS (
SHOW COLUMNS IN table
)
SELECT * from columns_table
any ideas why and how to fix it?
Using RESULT_SCAN:
Returns the result set of a previous command (within 24 hours of when you executed the query) as if the result was a table. This is particularly useful if you want to process the output from any of the following:
SHOW or DESC[RIBE] command that you executed.
SHOW COLUMNS IN ...;
WITH columns_table AS (
SELECT *
FROM table(RESULT_SCAN(LAST_QUERY_ID()))
)
SELECT *
FROM columns_table;
CTE requires select clause and we cannot use SHOW COLUMN IN CTE's and as a alterative use INFORMATION_SCHEMA to retrieve metadata .Like below:
WITH columns_table AS (
Select * from INTL_DB.INFORMATION_SCHEMA.COLUMNS where TABLE_NAME='CURRENCIES'
)
SELECT * from columns_table;

pgp_sym_decrypt using with select query

I want to select all data from a table (all data in table were encrypted) in postgresql database. But can't get all data.Query only works with condition. This query works. select pgp_sym_decrypt(name::bytea,'code'), pgp_sym_decrypt(surname::bytea,'code'), pgp_sym_decrypt(context::bytea,'code') from schema.table_name where id=1;
But I want to use this query : select pgp_sym_decrypt(name::bytea,'code'), pgp_sym_decrypt(surname::bytea,'code'), pgp_sym_decrypt(context::bytea,'code') from schema.table_name;
How can I get all data?

Hive: read table partitions defined in subselect

I have a Hive table which is partitioned by partitionDate field.
I can read partition of my choice via simple
select * from myTable where partitionDate = '2000-01-01'
My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.
I have written a query which looks like
select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate
The query works but looks like partitions are not used. The query works too long.
I tried another approach:
select * from myTable where partitionDate in (select reportDate from thatTable)
.. and again I see that the query works too slowly.
Is there a way to implement this in Hive?
update: create table for myTable
CREATE TABLE `myTable`(
`theDate` string,
')
PARTITIONED BY (
`partitionDate` string)
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='2',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...
'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',
'spark.sql.sources.schema.partCol.0'='partitionDate')
If you are running Hive on Tez execution engine, try
set hive.tez.dynamic.partition.pruning=true;
Read more details and related configuration in the Jira HIVE-7826
and at the same time try to rewrite as a LEFT SEMI JOIN:
select *
from myTable t
left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate
If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344
Or this one: https://stackoverflow.com/a/53279839/2700344
Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

select latest Table in a Big Query Dataset - Standard SQL syntax

I have dataset containing multiple tables with similar names:
e.g.
affilinet_4221_first_20180911_204956
affilinet_4221_first_20180911_160004
affilinet_4221_first_20180911_085559
affilinet_4221_first_20180910_201323
affilinet_4221_first_20180910_201042
affilinet_4221_first_20180910_080006
affilinet_4221_first_20180909_160707
This query identifies the latest dataset (according to yyyymmdd_hhmmss naming convention) with __TABLES_SUMMARY__ method
SELECT max(table_id) as table_id FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%"
query result
this query extracts all values from a specific table with _TABLE_SUFFIX method
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = "affilinet_4221_first_20180911_204956"
query result
This query combines __TABLES_SUMMARY__ (which returns affilinet_4221_first_20180911_204956) and _TABLE_SUFFIX
SELECT * FROM `modemutti-8d8a6.feed_first.*`
WHERE _TABLE_SUFFIX = (
SELECT max(table_id) FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
where table_id LIKE "affilinet_4221_first_%")
this query fails:
Error: Cannot read field 'modemio_cat_level' of type INT64 as STRING
error screenshot
any idea why is this happening or how I could solve the issue?
------------EDIT------------
#Mikhail solution works correctly but processes a huge amount of data. See explicit call Vs the suggested Method. Another solution would have been
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*` WHERE _TABLE_SUFFIX =
(
SELECT MAX(_TABLE_SUFFIX) FROM`modemutti-8d8a6.feed_first.affilinet_4221_first_*`
)
but this processes also a much bigger amount of data compared to the explicit query. Is there are way to achieve through a view in the UI or should I rather use the Python / Java SDK via API?
Try below
#standardSQL
SELECT * FROM `modemutti-8d8a6.feed_first.affilinet_4221_first_*`
WHERE _TABLE_SUFFIX = (
SELECT REPLACE(MAX(table_id), 'affilinet_4221_first_', '')
FROM `modemutti-8d8a6.feed_first.__TABLES_SUMMARY__`
WHERE table_id LIKE "affilinet_4221_first_%"
)