BigQuery - Create a table from results of a query that uses complex CTEs? - sql

I have a multi CTE query with large underlying datasets that is run too frequently. I could just create a table of the results of that query for people to use instead, and refresh that daily. But I'm lost on the syntax to create such a table.
CREATE OR REPLACE TABLE dataset.target_table
AS
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
FROM final
Is basically what I have. This actually creates the target table with the right schema even, but doesn't insert any rows...Any hints? Thanks

If you really want to do this in one step, you can just do SELECT INTO...
with cte_one as (
select
stuff
from big.table
),
...
cte_five as (
select
stuff
from other_big.table
),
final as (
select *
from cte_five left join cte_x on cte_five.id = cte_x.id
)
SELECT
*
INTO dataset.target_table
FROM final
That said, since this isn't just a once-off need I recommend creating the landing table once initially and then scheduling a daily flush and fill (TRUNCATE + INSERT) to update the data. It will give you more explicit control over the data types and also lets you work with a persistent object rather than something built from scratch daily.

Related

Abbreviate a list in PostgreSQL

How can I abbreviate a list so that
WHERE id IN ('8893171511',
'8891227609',
'8884577292',
'886790275X',
.
.
.)
becomes
WHERE id IN (name of a group/list)
The list really would have to appear somewhere. From the point of view of your code being maintainable and reusable, you could represent the list in a CTE:
WITH id_list AS (
SELECT '8893171511' AS id UNION ALL
SELECT '8891227609' UNION ALL
SELECT '8884577292' UNION ALL
SELECT '886790275X'
)
SELECT *
FROM yourTable
WHERE id IN (SELECT id FROM cte);
If you have a persistent need to do this, then maybe the CTE should become a bona fide table somewhere in your database.
Edit: Using the Horse's suggestion, we can tidy up the CTE to the following:
WITH id_list (id) AS (
VALUES
('8893171511'),
('8891227609'),
('8884577292'),
('886790275X')
)
If the list is large, I would create a temporary table and store the list there.
That way you can ANALYZE the temporary table and get accurate estimates.
The temp table and CTE answers suggested will do.
Just wanted to bring another approach, that will work if you use PGAdmin for querying (not sure about workbench) and represent your data in a "stringy" way.
set setting.my_ids = '8893171511,8891227609';
select current_setting('setting.my_ids');
drop table if exists t;
create table t ( x text);
insert into t select 'some value';
insert into t select '8891227609';
select *
from t
where x = any( string_to_array(current_setting('setting.my_ids'), ',')::text[]);

Trying to create multiple temporary tables in a single query

I'd like to create 3-4 separate temporary tables within a single BigQuery query (all of the tables are based on different data sources) and then join them in various ways later on in the query.
I'm trying to do this by using multiple WITH statements, but it appears that you're only able to use one WITH statement in a query if you're not nesting them. Each time I've tried, I get an error saying that a 'SELECT' statement is expected.
Am I missing something? I'd prefer to do this all in one query if at all possible.
I don't know what you mean by "temporary tables", but I suspect you mean common table expressions (CTEs).
Of course you can have a query with multiple CTEs. You just need the right syntax:
with t1 as (
select . . .
),
t2 as (
select . . .
),
t3 as (
select . . .
)
select *
from t1 cross join t2 cross join t3;
bq mk --table --expiration [INTEGER] --description "[DESCRIPTION]"
--label [KEY:VALUE, KEY:VALUE] [PROJECT_ID]:[DATASET].[TABLE]
Expiration date will make your table temporary. You can create one table at the time with "bq mk" but you can use this in a script.
You can use the DDL, but here also you can only create one table at the time.
{CREATE TABLE | CREATE TABLE IF NOT EXISTS | CREATE OR REPLACE TABLE}
table_name [( column_name column_schema[, ...] )]
[PARTITION BY partition_expression] [OPTIONS(options_clause[, ...])]
[AS query_statement]
If by "temporary table" you meant "subqueries" this is the syntax you have to use:
WITH
subQ1 AS (
SELECT * FROM Roster
WHERE SchoolID = 52),
subQ2 AS (
SELECT SchoolID FROM subQ1)
SELECT DISTINCT * FROM subQ2;
You should be able to use common table expressions without issue. However, if you have large queries with a significant amount of common table expressions/subqueries you may run into resource issues in BQ specifically regarding the ability for BQ to create an execution plan. Temporary tables have helped me in these scenarios, but there are likely better practices.
CREATE TEMP TABLE T1 AS (
WITH
X as (query),
Y as (query),
Z as (query)
SELECT * FROM
(combination_of_above_queries)
);
CREATE TEMP TABLE T2 AS (
WITH
A as (query),
B as (query),
C as (query)
SELECT * FROM
(combination_of_above_queries)
);
CREATE TABLE new_table AS (
SELECT *
FROM (combination of T1 and T2)
)
Apologies, I just saw happened to come across this question and wanted to share what helped me... hope it is helpful for you.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#temporary_tables

sql temporary tables in rstudio notebook's sql chunks?

I am trying to use temp tables in an sql codechunk in rstudio.
An example: When I select one table and return it into an r object things seem to be working:
```{sql , output.var="x", connection='db' }
SELECT count(*) n
FROM origindb
```
When I try anything with temp tables it seems like the commands are running but returns an empty r data.frame
```{sql , output.var="x", connection='db' }
SELECT count(*) n
INTO #whatever
FROM origindb
SELECT *
FROM #whatever
```
My impression is that the Rstudio notebook sql chunks are just set to make one single query. So my temporary solution is to create the tables in a stored procedure in the database. Then I can get the results I want with something simple. I would prefer to have a bit more flexibility in the sql code chunks.
my db connection looks like this:
```{r,echo=F}
db <- DBI::dbConnect(odbc::odbc(),
driver = "SQL Server",
server = 'sql',
database = 'databasename')
```
Like this question, it will work if you put
set nocount on
at the top of your chunk. R seems to get confused when it's handed back the rowcount for the temp table.
I accomplished my goal using CTEs. As long as you define your CTEs in the order that they will be used it works. It is just like using temp tables with one big exception. The CTEs are gone after the query finishes where temp tables exist until you spid is kill (typically via a disconnect).
WITH CTE_WHATEVER AS (
SELECT COUNT(*) n
FROM origindb
)
SELECT *
FROM CTE_WHATEVER
You can also do this for multiple temp table examples
WITH CTE1 AS (
SELECT
STATE
,COUNTY
,COUNT(*) n
FROM origindb
GROUP BY
STATE
,COUNTY
),
CTE2 AS (
SELECT
STATE
,AVG(n)
,COUNTY_AVG
FROM CTE1
GROUP BY
STATE
)
SELECT *
FROM CTE2
WHERE COUNTY_AVG > 1000000
Sorry for the formatting. I couldn't figure out how to get the carriage returns to work in the code block.
I hope this helps.
You could manage a transaction within the SQL chunk defining a BEGIN and COMMIT clauses. For example:
BEGIN ;
CREATE TABLE foo (id varchar) ;
COMMENT ON TABLE foo IS 'Foo';
COMMIT ;

SQL Server query with extremely large IN clause results in numerous queries in activity monitor

SQL Server 2014 database. Table with 200 million rows.
Very large query with HUGE IN clause.
I originally wrote this query for them, but they have grown the IN clause to over 700 entries. The CTE looks unnecessary because I have omitted all the select columns and their substring() transformations for simplicity.
The focus is on the IN clause. 700+ pairs of these.
WITH cte AS (
SELECT *
FROM AODS-DB1B
WHERE
Source+'-'+Target
IN
(
'ACY-DTW',
'ACY-ATL',
'ACY-ORD',
:
: 700+ of these pairs
:
'HTS-PGD',
'PIE-BMI',
'PGD-HTS'
)
)
SELECT *
FROM cte
order by Source, Target, YEAR, QUARTER
When running, this query shoots CPU to 100% for hours - not unexpectedly.
There are indexes on all columns involved.
Question 1: Is there a better, or more effecient way to accomplish this query other than the huge IN clause? Would 700 UNION ALLs be better?
Question 2: When this query runs, it creates a Session_ID that contains 49 "threads" (49 processes that all have the same Session_ID). Every one of them an instance of this query with it's "Command" being this query text.
21 of them SUSPENDED,
14 of them RUNNING, and
14 of them RUNNABLE.
This changes rapidly as the task is running.
WHAT the heck is going on there? Is this SQL Server breaking the query up into pieces to work on it?
I recommend you store your 700+ strings in a permanent table as it is generally perceived as bad practice to store that much meta data in a script. You can create the table like this:
CREATE TABLE dbo.LookUp(Source varchar(250), Target varchar(250))
CREATE INDEX IX_Lookup_Source_Target on dbo.Lookup(Source,Target)
INSERT INTO dbo.Lookup (Source,Target)
SELECT 'ACY','DTW'
UNION
SELECT 'ACY','ATL'
.......
and then you can simply join on this table:
SELECT * FROM [AODS-DB1B] a
INNER JOIN dbo.Lookup lt ON lt.Source = a.Source
AND lt.Target=a.Target
ORDER BY Source, Target, YEAR, QUARTER
However, even better would be to normalise the AODS-DB1B table and store SourceId and TargetId INT values instead, with the VARCHAR values stored in Source and Target tables. You can then write a query that only performs integer comparisons rather than string comparisons and this should be much faster.
Put all of your codes into a temporary table (or permamnent if suitable).....
SELECT *
FROM AODS-DB1B
INNER JOIN NEW_TABLE ON Source+'-'+Target = NEWTABLE.Code
WHERE
...
...
you can create a temp table with all those values and then JOIN to that table, it would make the process a lot faster
I like the answer from Jaco
Have an index on source, target
It may be worth giving this a try
where ( source = 'ACY' and target in ('DTW', 'ATL', 'ORD') )
or ( source = 'HTS' and target in ('PGD') )

Alternative SQL ways of looking up multiple items of known IDs?

Is there a better solution to the problem of looking up multiple known IDs in a table:
SELECT * FROM some_table WHERE id='1001' OR id='2002' OR id='3003' OR ...
I can have several hundreds of known items. Ideas?
SELECT * FROM some_table WHERE ID IN ('1001', '1002', '1003')
and if your known IDs are coming from another table
SELECT * FROM some_table WHERE ID IN (
SELECT KnownID FROM some_other_table WHERE someCondition
)
The first (naive) option:
SELECT * FROM some_table WHERE id IN ('1001', '2002', '3003' ... )
However, we should be able to do better. IN is very bad when you have a lot of items, and you mentioned hundreds of these ids. What creates them? Where do they come from? Can you write a query that returns this list? If so:
SELECT *
FROM some_table
INNER JOIN ( your query here) filter ON some_table.id=filter.id
See Arrays and Lists in SQL Server 2005
ORs are notoriously slow in SQL.
Your question is short on specifics, but depending on your requirements and constraints I would build a look-up table with your IDs and use the EXISTS predicate:
select t.id from some_table t
where EXISTS (select * from lookup_table l where t.id = l.id)
For a fixed set of IDs you can do:
SELECT * FROM some_table WHERE id IN (1001, 2002, 3003);
For a set that changes each time, you might want to create a table to hold them and then query:
SELECT * FROM some_table WHERE id IN
(SELECT id FROM selected_ids WHERE key=123);
Another approach is to use collections - the syntax for this will depend on your DBMS.
Finally, there is always this "kludgy" approach:
SELECT * FROM some_table WHERE '|1001|2002|3003|' LIKE '%|' || id || '|%';
In Oracle, I always put the id's into a TEMPORARY TABLE to perform massive SELECT's and DML operations:
CREATE GLOBAL TEMPORARY TABLE t_temp (id INT)
SELECT *
FROM mytable
WHERE mytable.id IN
(
SELECT id
FROM t_temp
)
You can fill the temporary table in a single client-server roundtrip using Oracle collection types.
We have a similar issue in an application written for MS SQL Server 7. Although I dislike the solution used, we're not aware of anything better...
'Better' solutions exist in 2008 as far as I know, but we have Zero clients using that :)
We created a table valued user defined function that takes a comma delimited string of IDs, and returns a table of IDs. The SQL then reads reasonably well, and none of it is dynamic, but there is still the annoying double overhead:
1. Client concatenates the IDs into the string
2. SQL Server parses the string to create a table of IDs
There are lots of ways of turning '1,2,3,4,5' into a table of IDs, but the Stored Procedure which uses the function ends up looking like...
CREATE PROCEDURE my_road_to_hell #IDs AS VARCHAR(8000)
AS
BEGIN
SELECT
*
FROM
myTable
INNER JOIN
dbo.fn_split_list(#IDs) AS [IDs]
ON [IDs].id = myTable.id
END
The fastest is to put the ids in another table and JOIN
SELECT some_table.*
FROM some_table INNER JOIN some_other_table ON some_table.id = some_other_table.id
where some_other_table would have just one field (ids) and all values would be unique