BigQuery: Querying multiple datasets and tables using Standard SQL - google-bigquery

I have Google Analytics data that's spread across multiple BigQuery datasets, all using the same schema. I would like to query multiple tables each across these datasets at the same time using BigQuery's new Standard SQL dialect. I know I can query multiple tables within a single database like so:
FROM `12345678`.`ga_sessions_2016*` s
WHERE s._TABLE_SUFFIX BETWEEN '0501' AND '0720'
What I can't figure out is how to query against not just 12345678 but also against 23456789 at the same time.

How about using a simple UNION, with a SELECT wrapping around it (I tested this using the new standard SQL option and it worked as expected):
SELECT
SUM(foo)
FROM (
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>
UNION ALL
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>)

I believe that using table wild card & union (in bigquery, use comma to achieve the union function) will get what you need very quickly, if the tables have the same schema.
select *
from
(select * from table_table_range([dataset1], date1, date2),
(select * from table_table_range([dataset2], date3, date4),
......

Related

How to use multiple count distinct in the same query with other columns in Druid SQL?

I'm trying to use three projections in same query like below in a Druid environment:
select
__time,
count(distinct col1),
count(distinct case when (condition1 and condition2 then (concat(col2,TIME_FORMAT(__time))) else 0 end )
from table
where condition3
GROUP BY __time
But instead I get an error saying - Unknown exception / Cannot build plan for query
It seems to work perfectly fine when I put just one count(distinct) in the query.
How can this be resolved?
As a workaround, you can do multiple subqueries and join them. Something like:
SELECT x.__time, x.delete_cnt, y.added_cnt
FROM
(
SELECT FLOOR(__time to HOUR) __time, count(distinct deleted) delete_cnt
FROM wikipedia
GROUP BY 1
)x
JOIN
(
SELECT FLOOR(__time to HOUR) __time, count( distinct added) added_cnt
FROM wikipedia
GROUP BY 1
)y ON x.__time = y.__time
As the Druid documentation points out:
COUNT(DISTINCT expr) Counts distinct values of expr, which can be string, numeric, or hyperUnique. By default this is approximate, using a variant of HyperLogLog. To get exact counts set "useApproximateCountDistinct" to "false". If you do this, expr must be string or numeric, since exact counts are not possible using hyperUnique columns. See also APPROX_COUNT_DISTINCT(expr). In exact mode, only one distinct count per query is permitted.
So this is a Druid limitation: you either need to disable exact mode, or else limit yourself to one distinct count per query.
On a side note, other databases typically do not have this limitation. Apache Druid is designed for high performance real-time analytics, and as a result, its implementation of SQL has some restrictions. Internally, Druid uses a JSON-based query language. The SQL interface is powered by a parser and planner based on Apache Calcitea, which translates SQL into native Druid queries.

how to query many tables in one shot in Hive?

I have to query and then "union" many tables. I did manually in Hive but wondering if there's a more optimal (shorter) way to do it.
We have tables for each month, so instead of doing this for a whole year:
create table t_2019 as
select * from
(select * from t_jan where...
union all
select * from t_feb where...
union all
select * from t_mar where...);
Does Hive (or any kind of SQL) allow to loop through tables? I've seen for loop and while examples in T-SQL, but they are individual queries. In this case I want to union the tables.
#t_list = ('t_jan', 't_feb', 't_mar'...etc)
Then, how to query each table in #t_list and "union all"? Each month has about 800k rows, so it's big but Hive can handle.
You can solve this problem with partitioned hive table instead of multiple tables.
Ex: table_whole pointing to hdfs path hdfs://path/to/whole/ with partitions on Year and Month
Now you can query to get data from all months in 2019 using
select * from table_whole where year = '2019'
If you need just data from one month say Jan in 2019. you can filter by that partition
select * from table_whole where year = '2019' and month='JAN'

How to query multiple tables using wildcard for a particular partition in standard SQL of Big Query

I am trying to query multiple tables in BigQuery using a wildcard (I have tables from _[0-9] suffix)
This query for a specific table works:
SELECT
count(*)
FROM `maw_qa.rt_content_secondly_0`
where _PARTITIONTIME = timestamp('2017-01-24');
But this doesn't :
SELECT
count(*)
FROM `maw_qa.rt_content_secondly_*`
where _PARTITIONTIME = timestamp('2017-01-24');
Error:
Query Failed
Error: Unrecognized name: _PARTITIONTIME at [5:7]
I am using standard SQL. Legacy SQL does not even take wildcard * in the query.
What is the way to do this correctly?
Looks like wildcard and partition do not work together in query
Try below. it is in BigQuery Legacy SQL as in this version it is less bushy
Assuming you have 4 tables, if more - you need to enlist all of them here
SELECT COUNT(*)
FROM
[maw_qa.rt_content_secondly_0],
[maw_qa.rt_content_secondly_1],
[maw_qa.rt_content_secondly_2],
[maw_qa.rt_content_secondly_3]
WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24')
Of course similar can be written in BigQuery Standard SQL but it will require more typing with UNION ALL, etc.
For Standard SQL it can look like below
SELECT COUNT(*) FROM (
SELECT * FROM `maw_qa.rt_content_secondly_0` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_1` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_2` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24') UNION ALL
SELECT * FROM `maw_qa.rt_content_secondly_3` WHERE _PARTITIONTIME = TIMESTAMP('2017-01-24')
)
When you query a partitioned table, you don't need to use the _* syntax, which is reserved for table wildcards (where you filter on _TABLE_SUFFIX). In your case, you should just do:
SELECT
count(*)
FROM `maw_qa.rt_content_secondly`
where _PARTITIONTIME = '2017-01-24';

T-SQL Query to SELECT rows with same values of several columns (Azure SQL Database)

I need help with writing a T-SQL query on a table shown on the picture below. The table has ambiguous info about buildings, some of them appears more then one time, that is wrong. I need to select only rows that has the same street and building values, for I can manually delete bad rows then. So I want to select rows 1,2,4,5 on the picture below. I use an Azure SQL Database, it has some limitations on T-SQL.
I'm pretty sure Azure supports subqueries and window functions. So, try this:
select t.*
from (select t.*, count(*) over (partition by street, building) as cnt
from table t
) t
where cnt > 1;

Support UNION function in BigQuery SQL

BigQuery does not seem to have support for UNION yet:
https://developers.google.com/bigquery/docs/query-reference
(I don't mean unioning tables together for the source. It has that.)
Is it coming soon?
If you want UNION so that you can combine query results, you can use subselects
in BigQuery:
SELECT foo, bar
FROM
(SELECT integer(id) AS foo, string(title) AS bar
FROM publicdata:samples.wikipedia limit 10),
(SELECT integer(year) AS foo, string(state) AS bar
FROM publicdata:samples.natality limit 10);
This is almost exactly equivalent to the SQL
SELECT id AS foo, title AS bar
FROM publicdata:samples.wikipedia limit 10
UNION ALL
SELECT year AS foo, state AS bar
FROM publicdata:samples.natality limit 10;
(note that if want SQL UNION and not UNION ALL this won't work)
Alternately, you could run two queries and append the result.
BigQuery recently added support for Standard SQL, including the UNION operation.
When submitting a query through the web UI, just make sure to uncheck "Use Legacy SQL" under the SQL Version rubric:
You can always do:
SELECT * FROM (query 1), (query 2);
It does the same thing as :
SELECT * from query1 UNION select * from query 2;
Note that, if you're using standard SQL, the comma operator now means JOIN - you have to use the UNION syntax if you want a union:
In legacy SQL, the comma operator , has the non-standard meaning of UNION ALL when applied to tables. In standard SQL, the comma operator has the standard meaning of JOIN.
For example:
#standardSQL
SELECT
column_name,
count(*)
from
(SELECT * FROM me.table1 UNION ALL SELECT * FROM me.table2)
group by 1
This helped me out very much for doing a UNION INTERSECT with big query's StandardSQL.
#standardSQL
WITH
a AS (
SELECT
*
FROM
table_a),
b AS (
SELECT
*
FROM
table_b)
SELECT
*
FROM
a INTERSECT DISTINCT
SELECT
*
FROM
b
I STOLE/MODIFIED THIS EXAMPLE FROM: https://gist.github.com/yancya/bf38d1b60edf972140492e3efd0955d0
Unions are indeed supported. An excerpt from the link that you posted:
Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible schemas as follows:
// Find suspicious activity over several days
SELECT FORMAT_UTC_USEC(event.timestamp_in_usec) AS time, request_url
FROM [applogs.events_20120501], [applogs.events_20120502], [applogs.events_20120503]
WHERE event.username = 'root' AND NOT event.source_ip.is_internal;