Does BigQuery provides any sql commands for retrieving table cardinality?
For example, some RDBMS providers have sql commands like:
show table_stats schemaname tablename
for getting table cardinality.
Also, what about column stats? Like the number of distinct values in a col and MIN, MAX, etc.
I saw that the BigQuery console provides both table and col stats but I wonder whether these info are accessible through SQL statements
Thanks!
The features you would like to use are more proper for the language, instead of the tool or service itself.
To get stats about the table. I found the Getting table metadata which explains how to get table metadata for Tables and Columns. Some of the information you will get when running the queries found in that doc.
For Tables: the name of the dataset that contains the table, the default lifetime, in days and other Table_Options view results.
For Columns: The name of the project that contains the dataset, the the column's standard SQL data type, if the value is updatable, stored, or hidden. Find more Results for the Columns view.
To get stats about columns. You can use the COUNT DISTINCT function, which retrieves a statistical approximation on the unique values in certain columns.
I found this Community blog, where they show different examples and ways to get unique values. It even explains how to increase the approximation threshold.
EDIT
It seems that BigQuery does not offer a count of unique fields. However, you can always take a look at the Schema and Details tabs in your BigQuery UI where the Fields' name are shown, including the type and description.
Example from the Public Datasets:
Hope this is helpful.
Related
I want to be able to get the main information regarding the various columns of tables located in Snowflake, like a df.describe() could do in Pandas:
column names,
data types,
min/max/average for numeric types,
and ideally unique values for string types
maybe other things that I'm missing
Granted, you could simply pull all the data into a local DataFrame then do the "describe" in Pandas, but this would be too costly for Snowflake tables counting millions of rows.
Is there a simple way to do this?
column names
data types
You could always query INFORMATION_SCHEMA:
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME ILIKE 'table_name';
Or
DESCRIBE TABLE 'TABLE_NAME';
min/max/average for numeric types,
and ideally unique values for string types
maybe other things that I'm missing
Automatic Contextual Statistics
Select columns, cells, rows, or ranges in the results table to view relevant information about the selected data in the inspector pane (to the right of the results table). Contextual statistics are automatically generated for all column types. The statistics are intended to help you make sense of your data at a glance.
...
Filled/empty meters
Histograms
Frequency distributions
Email domain distributions
Key distributions
There is no equivalent to df.describe().
The simplest way might be a query that replicates it. For example, you could compose a UDF that took as input the result of get_ddl() for the table and returned as output a query that had the correct SQL (min/max/avg etc) for each column.
If approximate answers are sufficient, one alternative would be to do what you described in a local DataFrame but implement a TABLESAMPLE clause to avoid loading all the data.
If you pursue the query route, the good news is that it should be mostly metadata-only operations which are very fast.
I wanted to reach out to ask if there is a practical way of finding out a given table's structure/schema e.g.,the column names and example row data inserted into the table(like the head function in python) if you only have the table name. I have access to several tables in my current role, however, a person who developed the tables left the team I am on. I was interested in examining the tables closer via SQL Assistant in Teradata (these tables often contain often hundreds of thousands of rows hence there are issues of hitting CPU exception criteria errors).
I have tried the following select statement, but there is an issue of hitting internal CPU exception criteria limits.
SELECT top10 * FROM dbc.table1
Thank you in advance for any tips/advice!
You can use one of these commands to get table's structure details in teradata
SHOW TABLE Database_Name.Table_Name;
or
HELP TABLE Database_Name.Table_Name;
It shows the table structure details
we have a data set in Big Query with more than 500000 tables, when we run queries against this data set using legacy SQL, its throwing an error
As per Jordan Tigani, it executes SELECT table_id FROM .TABLES_SUMMARY to get relevant tables to query
How do I use the TABLE_QUERY() function in BigQuery?
Does queries using _TABLE_SUFFIX(standard SQL) executes TABLES_SUMMARY to get relevant tables to query?
According to the documentation TABLE_SUFFIX is a pseudo column that contains the values matched by the table wildcard and it is olny available in StandardSQL. Meanwhile, __TABLE_SUMMARY_ is a meta-table that contains information about the tables within a dataset and it is available in Standard and Legacy SQL. Therefore, they have two different concepts.
However, in StandardSQL, you can use INFORMATION_SCHEMA.TABLES to retrieve information about the tables within the chosen dataset, similarly to __TABLE_SUMMARY_. Here you can find examples of usage and also its limitations.
Below, I queried against a public dataset using both methods:
First, using INFORMATION_SCHEMA.TABLES.
SELECT * FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES`
And part of the output:
Secondly, using __TABLES_SUMMARY__.
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES_SUMMARY__`
And part of the output table,
As you can see, for each method the output has a particular. Even though, both retrieve metadata about the tables within a particular dataset.
NOTE: BigQuery's queries have quotas. This quotas applies for some situations, including for the number of tables a single query can reference, which is 1000 per query, here.
No, querying using wildcard table does not execute TABLES_SUMMARY. You can have more than 500k tables in the dataset, but it does require that the number of tables matching the prefix pattern to be less than 500k. For other limitations on wildcard tables you can refer to the documentation.
They say there are no stupid questions, but this might be an exception.
I understand that BigQuery, being a columnar database, does a full table scan for any query over a specific column.
I also understand that query results can be cached or a named table can be created with the results of a query.
However I also see tabledata.list() in the documentation, and I'm unsure of how this fits in with query costs. Once a table is created from a query, am I free to access that table without cost through the API?
Let's say, for example, I run a query that is grouped by UserID, and I want to then present the results of that query to individual users based on that ID. As far as I understand there are two obvious ways of getting out the appropriate row for doing so.
I can write another query over the destination table with a WHERE userID=xxx clause
I can use the tabledata.list() endpoint to get all the (potentially paginated) data and get the appropriate row myself in my code
Where situation 1 would incur a query cost, and situation 2 would not? Am I getting this right?
Tabledata.list API is free as it actually does not use BigQuery Engine at all
so you are right for both 1 and 2
Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.