loop on tables in information schema to find their size and return a report - google-bigquery

I would like to find the number of records for a list of tables that start with a common prefix via a SQL query.
In BigQuery I am trying a combination of CONCAT() or FORMAT() along with INFORMATION_SCHEMA.TABLES in FOR loops or WHILE loops that in my mind should eventually be executed with EXECUTE IMMEDIATE but I am not able to setup the final query.
This is what I have:
DECLARE query_string STRING;
FOR record IN (
select CONCAT(
"SELECT COUNT(*) AS cnt_rows,",
"\"", table_name, "\"", " AS t_name ",
"FROM ", CONCAT("`",table_schema,".", table_name, "`" )
) AS MY_STMNT
FROM `<<my-project-id>>.<<my-dataset>>.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE "my-table-prefix__YYYY-MM-DDT%"
) DO
EXECUTE IMMEDIATE
-- 1st attempt:
-- FORMAT( """
-- %s
-- """, record.MY_STMNT);
-- 2nd attempt (this doesn't even execute because of "Syntax error: Unexpected keyword SET at xxx"):
-- SET query_string = query_string || record.MY_STMNT || " UNION ALL ";
END FOR;
This fails because the results are separate as one for each table. I would Like all of them in one final output result where each row is:
table name as t_name
number of rows as cnt_rows
How do I do that?

The solution is to use a string variable to accumulate SQL statements combined with UNION ALL.
The issue with my previous attempts was that I wanted to EXECUTE IMMEDIATE at each iteration in the for loop, instead of doing it after the entire loop.
DECLARE query_string STRING;
SET query_string = "";
FOR record IN (
-- dynamically build the SELECT query based on the data in INFORMATION_SCHEMA e.g.:
-- - the table name
-- - the number of rows in that table
SELECT CONCAT(
"SELECT ",
"'", table_name, "'", " AS t_name, ",
"COUNT(*) AS cnt_rows, ",
"FROM ", CONCAT("`", table_schema, ".", table_name, "`")
) AS MY_STMNT
FROM `<<my-project-id>>.<<my-dataset>>.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE "my-table-prefix__YYYY-MM-DDT%"
) DO
SET query_string = query_string || record.MY_STMNT || " UNION ALL ";
END FOR;
-- just add a dummy "last row" to make "UNION ALL" work
SET query_string = query_string || "SELECT 'x' AS t_name, 0 AS cnt_rows";
-- make sure we rank the output table by the number of records from biggest to smallest
SET query_string = query_string || " ORDER BY cnt_rows DESC";
-- to inspect the SQL query, just print it
SELECT query_string;
-- invoke the SQL query
EXECUTE IMMEDIATE query_string;
Be careful though: if you combine together with UNION ALL multiple complex SQL queries e.g. SELECT, then BigQuery may throw an error like this:
Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.. at [1:1]
This of course depends on the content of the SQL query you are building. The procedure above may apply to your case without throwing errors when invoked at scale.

Related

BigQuery overhead of Stored Procedures

We have a use case to create some sophisticated search predicates against Google BigQuery. We have built some search screens that use stored procedures that create dynamic sql that is then run using "execute immediate". Here is a completely synthetic example:
CREATE OR REPLACE PROCEDURE `somewhere`.author_name_search(
IN firstNamePrefix STRING,
IN lastNamePrefix STRING
)
BEGIN
/* note select '*' is bad practice */
DECLARE DYNAMIC_SQL STRING DEFAULT 'SELECT * FROM `somewhere_else.author`';
DECLARE WHERE_CLAUSE STRING DEFAULT '';
IF ASCII(firstNamePrefix) != 0 THEN
SET WHERE_CLAUSE = WHERE_CLAUSE || " firstName like '" || firstNamePrefix || "%'";
END IF;
IF ASCII(lastNamePrefix) != 0 THEN
IF ASCII(WHERE_CLAUSE) != 0 THEN
SET WHERE_CLAUSE = WHERE_CLAUSE || " OR ";
END IF;
SET WHERE_CLAUSE = WHERE_CLAUSE || "lastName like '" || lastNamePrefix || "%'";
END IF;
/* if there no optional input this is a full table scan which is VERY BAD. you need to partition and force that all search inputs only return some partitions */
IF ASCII(WHERE_CLAUSE) != 0 THEN
SET DYNAMIC_SQL = DYNAMIC_SQL || " WHERE " || WHERE_CLAUSE;
END IF;
SET DYNAMIC_SQL = DYNAMIC_SQL || " LIMIT 100000";
EXECUTE IMMEDIATE DYNAMIC_SQL;
END;
Note that is a really basic example with lots of performance issues. In reality the data is partitioned by date, require partition filter is set to true, the UI suggests a reasonable date range, the UI only select a subset of fields not all of them. Where things get complex is that we have sort column and sort order and up to a dozen optional search input fields which is why we are looking to use dynamic sql with execute immediate within a procedure. We are aware that BigQuery is not built for low latency OLTP queries so we optimize our datasets for these "interactive search" screens and set user expectations that the response times will a few seconds depending on the complexity of the search they are running.
We ran some performance tests comparing the overhead of just querying the underlying table/view as opposed to querying it from within a proc. The simplest test proc is hardcoded to just return some fixed data from a view so that we can do a direct comparison such as:
CREATE OR REPLACE PROCEDURE `somewhere`.author_name_search(
IN firstNamePrefix STRING,
IN lastNamePrefix STRING
)
BEGIN
/* not the logic here is not quite identical in reality we are hardcoding the query to return a known result to compare with other test runs */
SELECT * FROM `somewhere_else.author` where firstName like firstNamePrefix OR lastName like lastNamePrefix;
END;
The real test proc is more or less hardcoded to pull some exact test rows from the underlying table. If we query the table directly without a proc everything is sub-second end-to-end. When we simply hardcode the same query statement into a minimal stored proc (as above) everything is one whole second slower. So using the proc seems to add a fixed overhead which is more than 100% the full application time when not going via a proc.
It this expected or is there something we can do to tune things so that procs are not a massive overhead?

How to Select * FROM all tables in a schema

I have 100s of tables with the same structure in the same schema. I would like to run a query to see all rows where the 'sqft' column is NULL
SELECT * FROM table WHERE sqft = NULL
The tables I would like to iterate over all begin with the prefix 'tb_'
e.g 'tb_115_spooner_st'
After trying numerous solutions posted on here I cannot properly iterate over all these tables with a single script.
This is what I am currently working with
do $$
declare
rec record;
query text;
begin
for rec in select * from pg_tables where schemaname = 'public'
loop
query = format('SELECT * FROM %s WHERE sqft = NULL LIMIT 1', rec.tablename);
--raise notice '%', query;
execute query;
end loop;
end
$$ language plpgsql;
I am quite new to writing more complex SQL commands like this and have trouble understanding what is going wrong. I know there needs to be a section where the prefix is a condition, but the code running right now just returns a 'DO' in the console. Any help is appreciated.
Consider using the INFORMATION_SCHEMA.COLUMNS view to find the tables you need to query.
SELECT
CONCAT('SELECT * FROM ', table_name,' WHERE sqft IS NULL;')
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
COLUMN_NAME = 'sqft'
This will get you a list of SQL statements you can copy and paste into a new terminal and just run as a normal query.

Create UNION ALL statements via a loop

Manually, I can select partitions in an inner query with the first code block below. Is there a way to do this in a more elegant way via a loop? I'm showing 3 partitions here, but I have about 200 and the partitions are based on a date column and therefore the partition names will need to change when I run this query again at a future date.
SELECT *
FROM (
SELECT * FROM RSS_ACQ.TRX_ARQ PARTITION("SYS_P211048") UNION ALL
SELECT * FROM RSS_ACQ.TRX_ARQ PARTITION("SYS_P210329") UNION ALL
SELECT * FROM RSS_ACQ.TRX_ARQ PARTITION("SYS_P176323")
) TRX_ARQ
;
With this statement, I've created a loop that outputs the UNION ALL statements.
BEGIN
FOR ALL_TAB_PARTITIONS IN
(
SELECT PARTITION_NAME
FROM ALL_TAB_PARTITIONS
where TABLE_OWNER = 'TABLEOWNER'
AND TABLE_NAME = 'TABLENAME'
AND PARTITION_POSITION > 123
ORDER BY partition_position DESC
)
LOOP
DBMS_OUTPUT.PUT_LINE( 'SELECT * FROM RSS_ACQ.TRX_ARQ PARTITION(\"'
|| ALL_TAB_PARTITIONS.PARTITION_NAME || '\") UNION ALL');
END LOOP;
END;
And in this block, I've attempted to use the loop inside the inner query. It's not yet formatted correctly and I'll need to avoid having UNION ALL for the very last partition.
SELECT *
FROM (
BEGIN
FOR ALL_TAB_PARTITIONS IN
(
SELECT PARTITION_NAME
FROM ALL_TAB_PARTITIONS
where TABLE_OWNER = 'TABLEOWNER'
AND TABLE_NAME = 'TABLENAME'
AND PARTITION_POSITION > 123
ORDER BY partition_position DESC
)
LOOP
DBMS_OUTPUT.PUT_LINE( 'SELECT * FROM RSS_ACQ.TRX_ARQ PARTITION(\"'
|| ALL_TAB_PARTITIONS.PARTITION_NAME || '\") UNION ALL');
END LOOP;
END;
) TRX_ARQ
;
Here are some of the errors, but there were also many more. They are syntax errors pointing to other parts of the query so I would expect that I have an issue with escaping the quotes.
Error starting at line : 99 in command -
END LOOP
Error report -
Unknown Command
Error starting at line : 100 in command -
END
Error report -
Unknown Command
Error starting at line : 101 in command -
)
Error report -
Unknown Command
Error starting at line : 102 in command -
) TABLENAME
Error report -
Unknown Command
This is a bit of a guess, but it's too long for a comment.
I am assuming your table is interval partitioned. In that case, getting all the data from partition positions > 123 is the same as getting all the rows with a higher date than the highest date in partition 123.
You can obtain that date from ALL_TAB_PARTITIONS and then use it to query the table. Like this:
WITH FUNCTION get_high_value RETURN DATE IS
l_high_val_expr ALL_TAB_PARTITIONS.HIGH_VALUE%TYPE;
l_high_value DATE;
BEGIN
SELECT high_value
INTO l_high_val_expr
FROM all_tab_partitions
WHERE table_owner = 'RSS_ACQ'
AND table_Name = 'TRX_ARQ'
and partition_position = 123;
EXECUTE IMMEDIATE 'SELECT ' || l_high_val_expr || ' FROM DUAL' INTO l_high_value;
RETURN l_high_value;
END;
SELECT * FROM rss_acq.trx_arq
-- Replace "partitioned_date_column" with the name of the column on which the
-- table is interval partitioned.
WHERE partitioned_date_column > get_high_value;
We can't execute an anonymous PL/SQL block in a SELECT statement.
What you need to do is spool the output of the ALL_TAB_PARTITIONS loop to a file (or a SQL worksheet if you're using an IDE like SQL Developer). This will give you a script you can run separately after editing it (you need to trim UNION ALL from the final generated SELECT.
Probably there are more elegant ways of achieving the same thing, but the task seems sufficiently wrong that it doesn't strike me as being worth the effort. You want to query 200 partitions in a single statement. That is a brute force operation and there isn't mush to be gained from querying named blocks. In fact, producing a union of 200 separate queries may be more expensive than a single query. So why not try something like this?
select * from RSS_ACQ.TRX_ARQ
where partition_key_col >= date '2018-08-01' -- or whatever
"I think you are overlooking the 12c feature of using PL/SQL in the WITH clause"
That 12c feature is for functions not procedures, so it won't help the OP run their code. It would be possible to use a WITH clause function but that would require:
creating a type with the same projection as the target table
and a nested table type based on that type
a WITH clause function which assembles and executes a dynamic SQL statement
we can't use REF CURSORs in SQL so ...
the function has to execute the dynamic select INTO a local collection variable ...
then loop over the collection and PIPE ROW to output those rows ...
so the main query can call the function with a table() call
Can a WITH clause function be pipelined? I can't find anything in the documentation to say we can't (don't have access to 12c right now to test).

Export data in file in Postgres

I have one table with id, name and complex queries. Below is just a sample of that table..
ID name Query
1 advisor_1 "Select * from advisor"
2 student_1 "Select * from student where id = 12"
3 faculty_4 "Select * from student where id = 12"
I want to iterate over this table and save each record into the csv file
Is there any way I can do it though Anonymous block automatically.
I don't want to do this manually as table has lots of rows.
Can anyone please help?
Not being superuser means the export can't be done in a server-side DO block.
It could be done client-side in any programming language that can talk to the database, or assuming a psql-only environment, it's possible to generate a list of \copy statements with an SQL query.
As an example of the latter, assuming the unique output filenames are built from the ID column, something like this should work:
SELECT format('\copy (%s) TO ''file-%s.csv'' CSV', query, id)
FROM table_with_queries;
The result of this query should be put into a file in a format such that it can be directly included into psql, like this:
\pset format unaligned
\pset tuples_only on
-- \g with an argument treats it as an output file.
SELECT format('\copy (%s) TO ''file-%s.csv'' CSV', query, id)
FROM table_with_queries \g /tmp/commands.sql
\i /tmp/commands.sql
As a sidenote, that process cannot be managed with the \gexec meta-command introduced in PG 9.6, because \copy itself is a meta-command. \gexec iterates only on SQL queries, not on meta-commands. Otherwise the whole thing could be done by a single \gexec invocation.
You may use a function like: (IF your problem is the code)
DECLARE
rec RECORD;
BEGIN
FOR rec IN SELECT id, query FROM table_name
LOOP
EXECUTE('COPY (' || rec.query || ') TO ' || QUOTE_LITERAL('d:/csv' || rec.id || '.csv') || ' CSV');
END LOOP;
END;
for permission problem, You should use some places on server that you have writing access to them (or request from vendor).

Find tables, columns with specific value

I'm using Firebird 2.5.0. I know a value and need to find all tables, columns in which it occurs.
I created procedure:
CREATE OR ALTER PROCEDURE NEW_PROCEDURE (
searching_value varchar(30))
returns (
table_with_value varchar(100),
column_with_value varchar(100))
as
declare variable all_tables varchar(50);
declare variable all_columns varchar(50);
declare variable all_values varchar(50);
begin
FOR SELECT
r.rdb$relation_name, f.rdb$field_name
from rdb$relation_fields f
join rdb$relations r on f.rdb$relation_name = r.rdb$relation_name
and r.rdb$view_blr is null
and (r.rdb$system_flag is null or r.rdb$system_flag = 0)
order by 1, f.rdb$field_position INTO :all_tables, :all_columns
DO
BEGIN
FOR SELECT all_columns FROM all_tables
INTO :all_Values
DO
BEGIN
IF (SEARCHING_VALUE = all_Values) THEN
BEGIN
table_With_Value = all_Tables;
column_With_Value = all_Columns;
SUSPEND;
END
END
END
END^
When I run it I get error message:
Undefined name.
Dynamic SQL Error.
SQL error code = -204.
Table unknown.
ALL_TABLES.
At line 21, column 13.
So in this select statement "SELECT all_columns FROM all_tables" it is not taking values from previous for select statement but just trying to find table all_tables. How to fix it?
The problem is that all_columns is considered to be a colum name and all_tables a table name and not your variables in:
SELECT all_columns FROM all_tables
You can't parametrize objectnames in a query like this. Also note that if it had been possible to parametrize object names, you would have had to use :all_columns and :all_tables for disambiguation.
Instead you will need to create a dynamic SQL statement and execute that with EXECUTE STATEMENT (or more specifically: FOR EXECUTE STATEMENT).
In this case:
FOR EXECUTE STATEMENT 'SELECT "' || all_columns || '" FROM "' || all_tables || '"'
INTO :all_values
DO
BEGIN
/* .... */
END
I have quoted the object names to account for case sensitive column and table names (or identifiers that are invalid unquoted). Constructing a query like this might leave you open to SQL injection if the values are obtained from another source than the Firebird metadata tables.