Incorrect results for Hive Count - hive

I am getting incorrect counts when doing full count of the table compared to when where clause is being used. Results shown below:
SELECT count(1) FROM Table_MAS MAS;
OK
11317322
hive> SELECT count(1) FROM Table_MAS where Col_A IS NOT NULL and Col_B is NOT NULL;
OK
552589106
I have already performed Analyze of the table and repair. Doesnt look like there is anything wrong.
Wanted to see if anyone else has faced a similar situation and if so how did you correct it?
I have already performed Analyze of the table and repair.
Obviously I expected the count with where clause to be always equal or lower than full count.

you should use:
select count(*) FROM Table_MAS MAS;
COUNT(*) will count the number of rows, while COUNT(1) will count non-null values in expression and COUNT(column) will count all non-null values in column.

Related

What happens when you use DISTINCT * in COUNT() in SQL?

I've just learned about the COUNT() function, and how it is possible to get the number of rows in a column by passing * as the argument.
SELECT COUNT(*) FROM table;
I've also learned that we can get the number of distinct rows of a column in a table by using DISTINCT.
SELECT COUNT(DISTINCT column) FROM table;
I've noticed that the following returns nothing.
SELECT COUNT(DISTINCT *) FROM table;
Why is this?
I suppose the root of my issue is that I don't quite fully understand what the COUNT() function with * as the argument does exactly. My resource says that the COUNT() function takes a column as an argument and counts how many non-NULL rows there are. So say we have a table that has a column with some rows having both NULL and non-NULL values. If COUNT(column) doesn't count the non-NULL rows, what happens differently in COUNT(*) so that all the rows are counted? And by extension, what happens during COUNT(DISTINCT *)?
This would be a syntax error in most databases. If it were allowed, it would probably be equivalent to:
select count(*)
from (select distinct * from t) t
However, NULL values might throw it off.

COUNT(DISTINCT) and COUNT(*) + GROUP BY give different results

We're querying one of data sets for unique IDs
SELECT count(distinct id) FROM [MyTable] LIMIT 1
Another query ran a similar command
SELECT count(*) From ( select id FROM MyTable group by id) A ;
The first command is more efficient, but the output should be identical. However, they are getting different results. The first query returns more results by about 1.5% of the dataset, of over 100 million rows.
COUNT(DISTINCT field) is just an estimate. If you need exact results you can use EXACT_COUNT_DISTINCT(field).
This is explained in the query reference: https://cloud.google.com/bigquery/query-reference?hl=en#countdistinct
Check COUNT([DISTINCT] field [, n]) definition
It is a statistical approximation and is not guaranteed to be exact.
The second query returns exact count, thus the difference

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

What does "select count(1) from table_name" on any database tables mean?

When we execute select count(*) from table_name it returns the number of rows.
What does count(1) do? What does 1 signify here? Is this the same as count(*) (as it gives the same result on execution)?
The parameter to the COUNT function is an expression that is to be evaluated for each row. The COUNT function returns the number of rows for which the expression evaluates to a non-null value. ( * is a special expression that is not evaluated, it simply returns the number of rows.)
There are two additional modifiers for the expression: ALL and DISTINCT. These determine whether duplicates are discarded. Since ALL is the default, your example is the same as count(ALL 1), which means that duplicates are retained.
Since the expression "1" evaluates to non-null for every row, and since you are not removing duplicates, COUNT(1) should always return the same number as COUNT(*).
Here is a link that will help answer your questions. In short:
count(*) is the correct way to write
it and count(1) is OPTIMIZED TO BE
count(*) internally -- since
a) count the rows where 1 is not null
is less efficient than
b) count the rows
Difference between count(*) and count(1) in oracle?
count(*) means it will count all records i.e each and every cell
BUT
count(1) means it will add one pseudo column with value 1 and returns count of all records
This is similar to the difference between
SELECT * FROM table_name and SELECT 1 FROM table_name.
If you do
SELECT 1 FROM table_name
it will give you the number 1 for each row in the table. So yes count(*) and count(1) will provide the same results as will count(8) or count(column_name)
There is no difference.
COUNT(1) is basically just counting a constant value 1 column for each row. As other users here have said, it's the same as COUNT(0) or COUNT(42). Any non-NULL value will suffice.
http://asktom.oracle.com/pls/asktom/f?p=100:11:2603224624843292::::P11_QUESTION_ID:1156151916789
The Oracle optimizer did apparently use to have bugs in it, which caused the count to be affected by which column you picked and whether it was in an index, so the COUNT(1) convention came into being.
SELECT COUNT(1) from <table name>
should do the exact same thing as
SELECT COUNT(*) from <table name>
There may have been or still be some reasons why it would perform better than SELECT COUNT(*)on some database, but I would consider that a bug in the DB.
SELECT COUNT(col_name) from <table name>
however has a different meaning, as it counts only the rows with a non-null value for the given column.
in oracle i believe these have exactly the same meaning
You can test like this:
create table test1(
id number,
name varchar2(20)
);
insert into test1 values (1,'abc');
insert into test1 values (1,'abc');
select * from test1;
select count(*) from test1;
select count(1) from test1;
select count(ALL 1) from test1;
select count(DISTINCT 1) from test1;
Depending on who you ask, some people report that executing select count(1) from random_table; runs faster than select count(*) from random_table. Others claim they are exactly the same.
This link claims that the speed difference between the 2 is due to a FULL TABLE SCAN vs FAST FULL SCAN.

Not getting the correct count in SQL

I am totally new to SQL. I have a simple select query similar to this:
SELECT COUNT(col1) FROM table1
There are some 120 records in the table and shown on the GUI.
For some reason, this query always returns a number which is less than the actual count.
Can somebody please help me?
Try
select count(*) from table1
Edit: To explain further, count(*) gives you the rowcount for a table, including duplicates and nulls. count(isnull(col1,0)) will do the same thing, but slightly slower, since isnull must be evaluated for each row.
You might have some null values in col1 column. Aggregate functions ignore nulls.
try this
SELECT COUNT(ISNULL(col1,0)) FROM table1
Slightly tangential, but there's also the useful
SELECT count(distinct cola) from table1
which gives you number of distinct column in the table.
You are getting the correct count
As per https://learn.microsoft.com
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates an expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates an expression for each row in a group and returns the number of unique, non null values.
In your case you have passed the column name in COUNT that's why you will get count of not null records, now you're in your table data you may have null values in given column(col1)
Hope this helps!