I understand that all the column statistics can be computed for a Hive table using the command-
ANALYZE TABLE Table1 COMPUTE STATISTICS;
Then Specific column level stats can be fetched through the command -
DESCRIBE FORMATTED Table1.Column1;
....
DESCRIBE FORMATTED Table1.Columnn;
Is it possible to fetch all column stats using a single command?
Related
Why does BigQuery perform a full table scan for SELECT * when a WHERE clause is provided?
SELECT *
FROM `githubarchive.month.202012`
WHERE login='__ThisUserDoesNotExist__'
This query performs a full table scan, even though it really just needs to do a full scan of the login column to determine that there are no records to return. Interested in references to relevant sections of BQ docs as well as papers on query planning for columnar databases.
BigQuery Is a columnar storage, with out partition/clustering on table it will do a full table scan.
BigQuery is doing a full table scan because the query asks for all the column to be output.
You'll find only the login column is scanned with query below:
SELECT login
FROM `githubarchive.month.202012`
WHERE login='__ThisUserDoesNotExist__'
Check public documentation here: https://cloud.google.com/bigquery/pricing#on_demand_pricing
BigQuery uses a columnar data structure. You're charged according to the total data processed in the columns you select, and the total data per column is calculated based on the types of data in the column. For more information about how your data size is calculated, see Data size calculation.
I was wondering if there is a way to disallow duplicates from BigQuery?
Based on this article I can deduplicate a whole or a partition of a table.
To deduplicate a whole table:
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
To deduplicate a table based on partitions defined in a WHERE clause:
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?
BigQuery doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.
If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:
First insert your data into an temporary table
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
Using the approach you mentioned as a scheduled query
Using a bq command such as bq query --use_legacy_sql=false --destination_table=<dataset.actual_table> 'select distinct * from <dataset.temporary_table>' that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table attribute. Its important to mention that this approach will also work for partitioned tables.
Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.
I hope it helps
I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;
The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;
Is there a way of getting a list of the partitions in a BigQuery date-partitioned table? Right now the best way I have found of do this is using the _PARTITIONTIME meta-column, but this needs to scan all the rows in all the partitions. Is there an equivalent to a show partitions call or maybe something in the bq command-line tool?
To list partitions in a table, query the table's summary partition by using the partition decorator separator ($) followed by PARTITIONS_SUMMARY. For example, the following command retrieves the partition IDs for table1:
SELECT partition_id from [mydataset.table1$__PARTITIONS_SUMMARY__];
I’m trying to copy a table’s schema to an empty table. It works for schemas with no nested records, but when I try to copy a schema with multiple nested records via this query:
SELECT * FROM [table] LIMIT 0
I get the following error:
Cannot output multiple independently repeated fields at the same time.
BigQuery will automatically flatten all results (see docs), which won't work when you have more than one nested record. In the BigQuery UI, click on Show Options:
Then select your destination table and make sure Allow Large Results is checked and Flatten Results is unchecked:
SELECT * FROM [table] LIMIT 0 with Allow Large Results and Unflatten
Results
The drawback of above approach is that user can end up with quite a bill – as this way of copying schema will cost the whole original table scan.
Instead I would programmatically get/acquire table schema and then create table with this schema