count(*) on Avro table returns 0 - hive

I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;

The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;

Related

Fetch all Column Statistics using Single Query Hive

I understand that all the column statistics can be computed for a Hive table using the command-
ANALYZE TABLE Table1 COMPUTE STATISTICS;
Then Specific column level stats can be fetched through the command -
DESCRIBE FORMATTED Table1.Column1;
....
DESCRIBE FORMATTED Table1.Columnn;
Is it possible to fetch all column stats using a single command?

Apache hive - How to limit partitions in show command

Is there any way to limit the number of Hive partitions while listing the partitions in show command?
I have a Hive table which has around 500 partitions and I wanted the latest partition alone. The show command list all the partitions. I am using this partition to find out the location details. I do not have access to metastore to query the details and the partition location is where the actual data resides.
I tried set hive.limit.query.max.table.partition=1 but this does not affect the metastore query. So, is there any other way to limit the partitions listed?
Thank you,
Revathy.
Are you running from the command line?
If so you can get your desired with something like this:
hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1
There is a "BAD" way to obtain what you want. You can treat the partitions columns like other columns and extract them into a select with limit query:
SELECT DISTINCT partition_column
FROM partitioned_table
ORDER BY partition_column
LIMIT 1;
The only way to filter a SHOW PARTION is using PARTITION:
SHOW PARTITIONS partitioned_table PARTION ( partitioned_column = "somevalue" );

BigQuery - Delete rows from Partitioned Table

I have a Day-Partitioned Table on BigQuery. When I try to delete some rows from the table using a query like:
DELETE FROM `MY_DATASET.partitioned_table` WHERE id = 2374180
I get the following error:
Error: DML statements are not yet supported over partitioned tables.
A quick Google search leads me to: https://cloud.google.com/bigquery/docs/loading-data-sql-dml where it also says: "DML statements that modify partitioned tables are not yet supported."
So for now, is there a workaround that we can use in deleting rows from a partitioned table?
DML has some known issues/limitation in this phase.
Such as:
DML statements cannot be used to modify tables with REQUIRED fields in their schema.
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted. For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
DML statements that modify partitioned tables are not yet supported.
Also be aware of the quota limits
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
What you can do is copy the entire partition to a non-partitioned table and execute the DML statement there. Than write back the temp table to the partition. Also if you ran into DML update limit statements per day per table, you need to create a copy of the table and run the DML on the new table to avoid the limit.
You could delete partitions in partitioned tables using the command-line bq rm, like this:
bq rm 'mydataset.mytable$20160301'
I've already done it without temporary table, steps:
1) prepare query which selects all the rows from particular partition which should be kept:
SELECT * FROM `your_data_set.tablename` WHERE
_PARTITIONTIME = timestamp('2017-12-07')
AND condition_to_keep_rows_which_shouldn't_be_deleted = 'condition'
if necessary run this for other partitions
2) choose Destination table for result of your query where you point TO THE PARTICULAR PARTITION, you need to provide table name like this:
tablename$20171207
3) Check option "Overwrite table" -> it will overwrite only particular partition
4) Run Query, as a result from pointed partition redundant rows will be deleted!
//remember that you could need run this for other partitions, where you rows to deleted are spread across more than one partition
Looks like as of my writing, this is no longer a BigQuery limitation!
In standard SQL, a statement like the above, over a partitioned table, will succeed, assuming rows being deleted weren't recently (within last 30 minutes) inserted via a streaming insert.
Current docs on DML: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language
Example Query that worked for me in the BQ UI:
DELETE
FROM dataset_name.partitioned_table_on_timestamp_column
WHERE
timestamp >= '2020-02-01' AND timestamp < '2020-06-01'
After the hamsters are done spinning, we get the BQ response:
This statement removed 101 rows from partitioned_table_on_timestamp_column

hive select count(*) data displaying, but no data visible

I see when I run query select count(*) from table; there's a count of data displaying. but when I check select * from table; there no data been displaying.
can you please help these are external table I went to the location of the tables and I see there's no data present.
run analyze table table_name COMPUTE STATISTICS ; on your table.
This will give correct result. As COLUMN_STATS (count) of the table in hive is indexed for fast retrieval. Hence after deleting the underlying data/file , it will give the old stats.

Why Hive use file from other files under the partition table?

I have a simple table in my Hive. It has only one partition:
show partitions hive_test;
OK
pt=20130805000000
Time taken: 0.124 seconds
But when I execute a simple query sql, it turns out to find the data file under folder 20130805000000. Why doesn't it just use the file 20130805000000?
sql:
SELECT buyer_id AS USER_ID from hive_test limit 1;
and this is the exception:
java.io.IOException: /group/myhive/test/hive/hive_test/pt=20130101000000/data
doesn't exist!
at org.apache.hadoop.hdfs.DFSClient.listPathWithLocations(DFSClient.java:1045)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:352)
at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listLocatedStatus(ChRootedFileSystem.java:270)
at org.apache.hadoop.fs.viewfs.ViewFileSystem.listLocatedStatus(ViewFileSystem.java:851)
at org.apache.hadoop.hdfs.Yunti3FileSystem.listLocatedStatus(Yunti3FileSystem.java:349)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listLocatedStatus(SequenceFileInputFormat.java:49)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:242)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:261)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1238)
And my question is why does hive try to find file "/group/myhive/test/hive/hive_test/pt=20130101000000/data", but not "/group/myhive/test/hive/hive_test/pt=20130101000000/"?
You are not getting error because you have created partition over your hive table but not assigning partition name during select statement.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried.
Please provide partition name in your select query or use your query like this:
select buyer_id AS USER_ID from hive_test where pt='20130805000000' limit 1;
Please see Link to know more about hive partition.