Hive query "COUNT" returns different results

Hive query "COUNT" returns different results - hive

I have an external HIVE table with HBase as underneath storage. When I ran a query:
SELECT COUNT(*)
FROM myTable
WHERE row_key>='1'
AND row_key<='100';
It returned 0 results.
However, when I ran it with more conditions:
SELECT COUNT(*)
FROM myTable
WHERE row_key>='1'
AND row_key<='100'
AND is_new = true;
It returned 80 results.
How can this be possible that it returns 0 results with less conditions?

Related

postgres jsonb_object_keys distinct or group by extremely slow

Database version is PostgreSQL 11.16
My table have 424868 records with json field. When I do:
SELECT jsonb_object_keys(raw_json) FROM table;
It returns result for me within a second. So, I need to remove duplicate keys, but when I do:
SELECT DISTINCT jsonb_object_keys(raw_json) FROM table;
My database CPU increase to 100% and it takes 15 min to get result. I tried solution with group by:
select array_agg(json_keys),id from (
select jsonb_object_keys(raw_json) as json_keys, id from table) a group by a.id
Same result.
For debugging I did this:
select count(*) from (SELECT jsonb_object_keys(raw_json) as k from table) test
and it returns me 41633935 keys

DB2 - table contains data but a select statement on specific columns returns empty

The database is DB2.
A table contains data, and "SELECT * FROM TABLE1" statement returns the data correctly.
However the statements like the following return empty:
SELECT COL1 FROM TABLE1
Also SELECT COUNT(COL1) FROM TABLE1 returns ZERO but SELECT COUNT(*) FROM TABLE1 returns the correct result.
The table contains 7 columns, and only 3 columns of them have such a problems, all the other columns are fine.
This is the first time I met such a situation in recent 10 years.

Hive count elements of the max partition column

I'm struggling with a query that may look simple but which is causing me a lot of trouble.
SELECT COUNT(*) FROM mytable where partition_column IN (SELECT MAX(partition_column) FROM mytable )
mytable is a 2To Hive External table, partitioned by the column partition_column. This query is taking 10 minutes to run..
When I do 2 separate queries :
SELECT MAX(partition_column) FROM mytable
> 2020-06-29
SELECT COUNT(*) FROM mytable where partition_column = '2020-06-29'
It works super fine and super quickly.
Am I missing something ?
Thank you
I'm on Hive 1.2.1 and Hadoop 2.7.3

It looks like the subquery is taking long time to process. Since you are filtering on the same column and table as the subquery, so the reducer step is taking a long time to process. Hence resulting in slow running of the query.
You could improve your query by introducing CTE which will create a temporary result set. Something as below:
WITH MY_CTE_SUBQUERY AS (
SELECT MAX(partition_column) as max_pc FROM mytable
)
SELECT COUNT(*)
FROM mytable
where partition_column IN (Select max_pc from MY_CTE_SUBQUERY);
More on hive CTE in the official doc.

Why count doesn't return 0 on empty table

I need to count a table's rows but I was prompt with an unusual behavior of count(*).
count(*) does not return results when I use a multi column select on an empty table. But returns expected results (0 rows) if I remove the other columns from the select statement (Single column Select).
In the code below you will find multiple tests to show you what I'm talking about.
The structure of the code bellow is:
1) Creation of a table
2) Multi column select on empty table tests, which returns unexpected results
3) Single column select on empty table test, which returns the expected result
4) Multi column select on filled table test, which returns the expected result
Question
Given this results my question is:
Why does the a multi column select on empty table doesn't return 0, and a single column select returns it?
Expected Results definition
Expected results to me means:
if a table is empty, count(*) returns 0.
If a table is not empty count returns the row count
--CREATE TEST TABLE
CREATE TABLE #EMPTY_TABLE(
ID INT
)
DECLARE #ID INT
DECLARE #ROWS INT
--MULTI COLUMN SELECT WITH EMPTY TABLE
--assignment attempt (Multi-column SELECT)
SELECT #ID = ID, #ROWS = COUNT(*)
FROM #EMPTY_TABLE
--return Null instead of 0
SELECT #ROWS Test_01 , ISNULL(#ROWS, 1 )'IS NULL'
--Set variable with random value, just to show that not even the assignment is happening
SET #ROWS = 29
--assignment attempt (Multi-column SELECT)
SELECT #ID = ID, #ROWS = COUNT(*)
FROM #EMPTY_TABLE
--return 29 instead of 0
SELECT #ROWS Test_02
--SINGLE COLUMN SELECT WITH EMPTY TABLE
--assignment attempt (Single-column SELECT)
SELECT #ROWS = COUNT(*)
FROM #EMPTY_TABLE
--returns 0 the expected result
SELECT #ROWS Test_03
--MULTI COLUMN SELECT WITH FILLED TABLE
--insert a row
INSERT INTO #EMPTY_TABLE(ID)
SELECT 1
--assignment attempt
SELECT #ID = ID, #ROWS = COUNT(*)
FROM #EMPTY_TABLE
--Returns 1
SELECT #ROWS Test_04

So I read up on the grouping mechanisms of sybase, and came to the conclusion, that in your query you have a "Transact-SQL extended column" (see: docs on group by under Usage -> Transact-SQL extensions to group by and having):
A select list that includes aggregates can include extended columns that are not arguments of aggregate functions and are not included in the group by clause. An extended column affects the display of final results, since additional rows are displayed.* (emphasis mine)
(regarding the *: this last statement is actually wrong in your specific case, since one rows turn into zero rows)
also in the docs on group by under Usage -> How group by and having queries with aggregates work you'll find:
The group by clause collects the remaining rows into one group for each unique value in the group by expression. Omitting group by creates a single group for the whole table. (emphasis mine)
So essentially:
having a COUNT(*) will trigger the whole query to be an aggregate, since it is an aggregate function (causing an implicit GROUP BY NULL)
adding ID in the SELECT clause, will then expand the first group (consisting of no rows) into its contained rows (none) and join it together with the aggregate result columns.
in your case: the count is 0, since you also query for the id, for every id a row will be generated to which the count is appended. however, since your table has no rows, there are no result rows whatsoever, thus no assignments. (Some examples are in the linked docs, and since there is no id and an existing id must be in the id column of your result, ...)
to always get the count, you should probably only SELECT #ROWS = COUNT(*) and select ids separately.

If you are counting rows and trying to get ID when there are no rows - you need to check if they EXISTS.
Something like this:
SELECT COUNT(*),
(CASE WHEN EXISTS(SELECT ID FROM EMPTY_TABLE) THEN (SELECT ID FROM EMPTY_TABLE) ELSE 0 END) AS n_id
FROM EMPTY_TABLE
In case with more than 1 row you will get subquery error.

This query:
SELECT #ID = ID, #ROWS = COUNT(*)
FROM #EMPTY_TABLE
The problem is that COUNT(*) makes this an aggregation query, but you also want to return ID. There is no GROUP BY.
I suspect your ultimate problem is that you are ignoring such errors.
This SQL Fiddle uses SQL Server (which is similar to Sybase). However, the failure is quite general and due to a query that would not work in almost any database.

Fast approximate counting in Postgres

I'm querying my database (Postgres 8.4) with something like the following:
SELECT COUNT(*) FROM table WHERE indexed_varchar LIKE 'bar%';
The complexity of this is O(N) because Postgres has to count each row. Postgres 9.2 has index-only scans, but upgrading isn't an option, unfortunately.
However, getting an exact count of rows seems like overkill, because I only need to know which of the following three cases is true:
Query returns no rows.
Query returns one row.
Query returns two or more rows.
So I don't need to know that the query returns 10,421 rows, just that it returns more than two.
I know how to handle the first two cases:
SELECT EXISTS (SELECT COUNT(*) FROM table WHERE indexed_varchar LIKE 'bar%');
Which will return true if one or more rows exists and false is none exist.
Any ideas on how to expand this to encompass all three cases in a efficient manner?

SELECT COUNT(*) FROM (
SELECT * FROM table WHERE indexed_varchar LIKE 'bar%' LIMIT 2
) t;

Should be simple. You can use LIMIT to do what what you want and return data (count) using a CASE statement.
SELECT CASE WHEN c = 2 THEN 'more than one' ELSE CAST(c AS TEXT) END
FROM
(
SELECT COUNT(*) AS c
FROM (
SELECT 1 AS c FROM table WHERE indexed_varchar LIKE 'bar%' LIMIT 2
) t
) v

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive query "COUNT" returns different results - hive

Related

postgres jsonb_object_keys distinct or group by extremely slow

DB2 - table contains data but a select statement on specific columns returns empty

Hive count elements of the max partition column

Why count doesn't return 0 on empty table

Fast approximate counting in Postgres

Categories

Resources