I see when I run query select count(*) from table; there's a count of data displaying. but when I check select * from table; there no data been displaying.
can you please help these are external table I went to the location of the tables and I see there's no data present.
run analyze table table_name COMPUTE STATISTICS ; on your table.
This will give correct result. As COLUMN_STATS (count) of the table in hive is indexed for fast retrieval. Hence after deleting the underlying data/file , it will give the old stats.
Related
I am looking a way to find the total no of partitions(count of partitions to find ahead if any table hitting 4000 limit threshold.) across bigquery tables in all datasets of a project. COuld someone please help me with the query.
Thanks
You can use INFORMATION_SCHEMA.PARTITIONS metadata table in order to extract partitions information from a whole schema/dataset.
It works as follows:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
In case you want to look at a specific table, you just need to include it in the WHERE clause:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'partitioned_table'
I have a partitioned table, with partitions based on timestamp column, like this:
...
partition by date_trunc(update_date, month)
...
Now, how do I query partitions in this table?
Afaik, I can't use _PARTITIONDATE pseudo column, since this is not an ingestion-time partitioned table. Do I just filter my query using date_trunc(update_date, month)?
select * from my_project.my_dataset.information_schema.partitions
where table_name = 'partitioned_table'
Here I can get partition_ids, but I can't address them in my query either.
I'm not sure, but whenever you want to to test if a partition is working: try typing out a SELECT * statement without a filter, and don't run it but check the estimated query size. Then try filtering on what you believe is the partition, also without running, and see if the estimated query size is reduced.
In this case try filtering on date_trunc(update_date, month) and see if it reduces the query size
I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;
The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;
I ran two queries to get count of records for two different dates from a Hive managed table partitioned on process date field.
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01' --returned 2 million
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02' --returned 3 million
But if I ran the below query with a UNION ALL clause, the counts returned are different from that of above mentioned individual queries.
SELECT '2018-01-01', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01'
UNION ALL
SELECT '2018-01-02', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02'
What can be the root cause for this difference?
One of our teammate helped us to identify the issue.
When we run a single count() query,the query is not physically executed on table rather count will be taken from statistics.
One of the remedy is to collect the stats on table agian,then the count() on single table will reflect actual count
Regards,
Anoop
I too faced a similar issue with count(*) returning incorrect count. I added the below to my code and the counts are consistent now.
For non-partitioned table use:
ANALYZE TABLE your_table_name COMPUTE STATISTICS
For partitioned table, analyze the recently added partition by specifying the partition value:
ANALYZE TABLE your_table_name
PARTITION(your_partition_name=your_partition_value)
COMPUTE STATISTICS;
I’m trying to copy a table’s schema to an empty table. It works for schemas with no nested records, but when I try to copy a schema with multiple nested records via this query:
SELECT * FROM [table] LIMIT 0
I get the following error:
Cannot output multiple independently repeated fields at the same time.
BigQuery will automatically flatten all results (see docs), which won't work when you have more than one nested record. In the BigQuery UI, click on Show Options:
Then select your destination table and make sure Allow Large Results is checked and Flatten Results is unchecked:
SELECT * FROM [table] LIMIT 0 with Allow Large Results and Unflatten
Results
The drawback of above approach is that user can end up with quite a bill – as this way of copying schema will cost the whole original table scan.
Instead I would programmatically get/acquire table schema and then create table with this schema