we have a data set in Big Query with more than 500000 tables, when we run queries against this data set using legacy SQL, its throwing an error
As per Jordan Tigani, it executes SELECT table_id FROM .TABLES_SUMMARY to get relevant tables to query
How do I use the TABLE_QUERY() function in BigQuery?
Does queries using _TABLE_SUFFIX(standard SQL) executes TABLES_SUMMARY to get relevant tables to query?
According to the documentation TABLE_SUFFIX is a pseudo column that contains the values matched by the table wildcard and it is olny available in StandardSQL. Meanwhile, __TABLE_SUMMARY_ is a meta-table that contains information about the tables within a dataset and it is available in Standard and Legacy SQL. Therefore, they have two different concepts.
However, in StandardSQL, you can use INFORMATION_SCHEMA.TABLES to retrieve information about the tables within the chosen dataset, similarly to __TABLE_SUMMARY_. Here you can find examples of usage and also its limitations.
Below, I queried against a public dataset using both methods:
First, using INFORMATION_SCHEMA.TABLES.
SELECT * FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES`
And part of the output:
Secondly, using __TABLES_SUMMARY__.
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES_SUMMARY__`
And part of the output table,
As you can see, for each method the output has a particular. Even though, both retrieve metadata about the tables within a particular dataset.
NOTE: BigQuery's queries have quotas. This quotas applies for some situations, including for the number of tables a single query can reference, which is 1000 per query, here.
No, querying using wildcard table does not execute TABLES_SUMMARY. You can have more than 500k tables in the dataset, but it does require that the number of tables matching the prefix pattern to be less than 500k. For other limitations on wildcard tables you can refer to the documentation.
Related
Does BigQuery provides any sql commands for retrieving table cardinality?
For example, some RDBMS providers have sql commands like:
show table_stats schemaname tablename
for getting table cardinality.
Also, what about column stats? Like the number of distinct values in a col and MIN, MAX, etc.
I saw that the BigQuery console provides both table and col stats but I wonder whether these info are accessible through SQL statements
Thanks!
The features you would like to use are more proper for the language, instead of the tool or service itself.
To get stats about the table. I found the Getting table metadata which explains how to get table metadata for Tables and Columns. Some of the information you will get when running the queries found in that doc.
For Tables: the name of the dataset that contains the table, the default lifetime, in days and other Table_Options view results.
For Columns: The name of the project that contains the dataset, the the column's standard SQL data type, if the value is updatable, stored, or hidden. Find more Results for the Columns view.
To get stats about columns. You can use the COUNT DISTINCT function, which retrieves a statistical approximation on the unique values in certain columns.
I found this Community blog, where they show different examples and ways to get unique values. It even explains how to increase the approximation threshold.
EDIT
It seems that BigQuery does not offer a count of unique fields. However, you can always take a look at the Schema and Details tabs in your BigQuery UI where the Fields' name are shown, including the type and description.
Example from the Public Datasets:
Hope this is helpful.
Problem:
We use entity framework (6.21) as our ORM manager.
Our database is Azure Sql Database.
Because some of the parametrized queries (frequently used in our app) are slow on some of the inputs (on some input it runs 60 seconds on other input it runs 0.4 seconds)
We started investigate those queries using QueryStore and QueryStore explorer in MS SQL Management Studio (MSSMS -> Object Explorer -> Query Store).
We found out, that QueryStore stores two same (same sql query but different params - params are not even stored) queries as different queries (with different query_id).
By different query I mean different row in table
sys.query_store_query).
I checked this by looking into QueryStore tables:
SELECT
qStore.query_id,
qStore.query_text_id,
queryTextStore.query_sql_text
ROW_NUMBER() OVER(PARTITION BY query_sql_text ORDER BY query_sql_text ASC) AS rn
FROM
sys.query_store_query qStore
INNER JOIN
sys.query_store_query_text queryTextStore
ON qStore.query_text_id = queryTextStore.query_text_id
I am not able to compare plans of those queries easily in MSSMS, because each query has its own associated plan.
Expected behaviour:
I would assume that each subsequent run of same query with different parametres would result in either:
1/ re-use of existing plan
or
2/ in creation of another plan based on passed params values...
Example:
The query would look like this (in reality queries are much more complex as they are generated by EntityFramework):
SELECT * FROM tbl WHERE a = #__plinq__
and it's two subsequent runs (with different params) would result in two rows in sys.query_store_query.
Question:
How can I make Azure to save queries with same text as same queries? Or am I missing something or is this expected behaviour?
Or more generally how to tune database queries if they are generated by Entity Framework?
How SQL Server Query Store considers two queries same or different?
Edit1: Update
Based on #PeterB comment (Adding a query hint when calling Table-Valued Function) we were able to solve our problem with slow queries on some params values (we added hint "recompile" on problematic queries).
Based on #GrantFritchey hint I checked context_settings, but there are still multiple rows in query_store table which have same query_sql_text and same context_settings_id but with different query_id .
So we still wonder how SQL Server Query Store consider two queries same or different?
As for the different query entries, the key that Query Store uses for a query consists of:
query_text_id,
context_settings_id,
object_id,
batch_sql_handle,
query_parameterization_type
If any of these is different for a query it will generate a new entry in the query table. Note that batch_sql_handle is only populated for queries referencing temp tables.
So you can check which of these values is different for the queries that you listed.
Currently there are no settings that control the way Query Store aggregates queries. The only way to make it treat them as same is to change your workload so that the fields listed above match. But alternatively probably better approach is that you write your own reporting queries that will aggregate queries and their statistics according to your needs.
we are using Looker (dashboard/reporting solution) to create persistent derived tables in BigQuery. These are normal tables as far as bigquery is concerned, but the naming is as per looker standard (it creates a hash based on DB + SQL etc.) and names the table accordingly. These tables are generated through view in scheduled time daily. The table names in BigQuery look like below.
table_id
LR_Z504ZN0UK2AQH8N2DOJDC_AGG__table1
LR_Z5321I8L284XXY1KII4TH_MART__table2
LR_Z53WLHYCZO32VK3FWRS2D_JND__table3
If I query the resulting table in BQ by explicit name then the result is returned as expected.
select * from `looker_scratch.LR_Z53WLHYCZO32VK3FWRS2D_JND__table3`
Looker changes the hash value in the table name when the table is regenerated after a query/job change. Hence I wanted to create a view with a wildcard table query to make the changes in the table name transparent to outside world.
But the below query always fails.
SELECT *
FROM \`looker_scratch.LR_*\`
where _table_suffix like '%JND__table3'
I either get a completely random schema with null values or errors such as:
Error: Cannot read field 'reportDate' of type DATE as TIMESTAMP_MICROS
There are no clashing table suffixes and I have used all sort of regular expression checks (lower , contains, etc)
Is this happening since the table names have hash values in them? I have run multiple tests on other datasets and there are absolutely no problem, we have been running wildcard table queries since a long time and have faced no issues whatsoever.
Please let me know your thoughts.
When you are using wildcard like below
`looker_scratch.LR_*`
you actually looking for ALL tables with this prefix and than - when you apply below clause
LIKE '%JND__table3'
you further filter in tables with such suffix
So the trick here is that very first (chronologically) table defines the schema of your output
To address your issue - verify if there are more tables that match your query and than look into very first one (the one that was created first)
If I have 2,000 tables that I'd like to union together, can I do that using a wildcard query, like this?
Or does the 1,000-tables referenced per query limit still apply?
does the 1,000-tables referenced per query limit still apply?
Yes. It still applies!
BigQuery looks for how many tables involved in query (no matter what exactly syntax/functionality is used). If you explicitely list all needed tables or using wildcard - at the end it is the same number of tables to be involved - thus same limitation applied
Note: partitions in partitioned table are not considered as a separate tables
I an trying to find a way to determine whether or not an SQL SELECT query A is prone to return a subset of the results returned by another query B. Furthermore, this needs to be acomplished from the queries alone, without having access to the respective result sets.
For example, the query SELECT * from employee WHERE salary >= 1000 will return a subset of the results of query SELECT * from employee. I need to find an automated way to perform this validation for any two queries A and B, without accessing the database that stores the data.
If it is unfeasable to achieve this without the aid of an RDBMS, we can assume that I have access to a local, but empty RDBMS, but with the data stored somewhere else. In addition, this check must be done in code, either using an algorithm or a library. The language I am using is Java, but other language will also do.
Many thanks in advance.
I don't know how deep you want to get into parsing queries, but basically you can say that there are two general ways of making a subset of a query (given that source table and projection(select) staying the same):
using where clause to add condition to row values
using having clause to add conditions to aggregated values
So you can say that if you have two objects that represent queries and say they look something close to this:
{
'select': { ... },
'from': {},
'where': {},
'orderby': {}
}
and they have select, from and orderby to be the same, but one have extra condition in the where clause , you have a subset.
One way you might be able to determine if a query is a subset of another is by examining their source tables. If you don't have access to the data itself, this can be tricky. This question references using Snowflake joins to generate database diagrams based on a query without having access to the data itself:
Generate table relationship diagram from existing schema (SQL Server)
If your query is 800 characters or less, the tool is free to use: https://snowflakejoins.com/index.html
I tested it out using the AdventureWorks database and these two queries:
SELECT * FROM HumanResources.Employee
SELECT * FROM HumanResources.Employee WHERE EmployeeID < 200
When I plugged both of them into the Snowflake Joins text editor, this is what was generated:
SnowflakeJoins DB Diagram example
Hope that helps.