i have table of 2 columns act,dst in hive:
act dst
success Info
success High
failure medium
Blank low
failure Info
I want to count of every field of act columns like
act count
success 2
failue 2
Blank 1
Also if it is possible to find for both columns its good.
Use group by and aggregation as count(*) in your select query
Try with this query:
select act,count(*) from <table_name> group by act;
Related
I am working on an hive query that needs to count an enormous amount of data reported in a table like this one:
table[ |column_name |
| (...) |
|{"data":"number"}|
(...) ]
The performance of my query is important cause of the number of data I have to treat.
By searching on the web I found three methods to do this. Does some people know which is the fastest? or can propose another method which is faster than those three?
SELECT count(*) FROM table WHERE column LIKE TRIM('{"data":"number"}')
SELECT count(*) FROM table WHERE get_json_object(column,'$.data)=="number"
SELECT count(*) FROM (
SELECT json_tuple(column,'number') FROM table) AS new_column
WHERE new_column.col1 = number
I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1
I am new to SQL and after writing some queries I wanted to understand how SQL "internally" processes the queries. I take one query from another post in stackoverflow:
select name from contacts
group by name
having count(*) > 1
My question is: group by name merges all rows with the same name into one row, how does then count know how many rows with the same name were merged. I am trying to split all steps in the processing of the query in order to understand how it is exactly working, but in this case it seems like you cannot split it. Thanks in advance.
From your sql query that you show there
the execution sequence will be like this show below
from contacts
knowing which tables's data you are getting, next will be your WHERE clause but in this case you don't have one so will follow to the next step which is
group by name
group all the same name to a row of record.
side note: Now the SELECT statement still haven run yet, therefore when HAVING statement run can count the row that the same name has
Next is your
having count(*) > 1
filter up all the record which count more than 1, and lastly will be the SELECT
select name
above was the execute sequence for your example shown.
And these is the full sequence of sql query
1. FROM
2. ON
3. OUTER
4. WHERE
5. GROUP BY
6. CUBE | ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10. ORDER BY
11. TOP
Hope it help ya.
I have a table where i have ID,matchid,point1,point2. I need to get the ID which has the maximum points but the problem i am facing is i need find max record depending on sum of both (point1+point), I have no idea how I can get the max with the combination of 2 columns i have tried query such as,
SELECT MAX(column1+column2) FROM table
MAX(SUM(column1,column2)) FROM table
but nothing works I am using Ms:Access
This will return more than one answer if more than one sum=max:
SELECT ID FROM Table1
WHERE ([Field1]+[Field2])=(
SELECT Max([Field1]+[Field2]) AS Expr1
FROM Table1)
You can use a subquery e.g.
select id from table where point1+point2 = (select max(point1+point2) from table)
Note that this will return multiple rows if more than one record has the same maximum points.
I'm using sql-server 2005 and ASP.NET with C#.
I have Users table with
userId(int),
userGender(tinyint),
userAge(tinyint),
userCity(tinyint)
(simplified version of course)
I need to select always two fit to userID I pass to query users of opposite gender, in age range of -5 to +10 years and from the same city.
Important fact is it always must be two, so I created condition if ##rowcount<2 re-select without age and city filters.
Now the problem is that I sometimes have two returned result sets because I use first ##rowcount on a table. If I run the query.
Will it be a problem to use the DataReader object to read from always second result set? Is there any other way to check how many results were selected without performing select with results?
Can you simplify it by using SELECT TOP 2 ?
Update: I would perform both selects all the time, union the results, and then select from them based on an order (using SELECT TOP 2) as the union may have added more than two. Its important that this next select selects the rows in order of importance, ie it prefers rows from your first select.
Alternatively, have the reader logic read the next result-set if there is one and leave the SQL alone.
To avoid getting two separate result sets you can do your first SELECT into a table variable and then do your ##ROWCOUNT check. If >= 2 then just select from the table variable on its own otherwise select the results of the table variable UNION ALLed with the results of the second query.
Edit: There is a slight overhead to using table variables so you'd need to balance whether this was cheaper than Adam's suggestion just to perform the 'UNION' as a matter of routine by looking at the execution stats for both approaches
SET STATISTICS IO ON
Would something along the following lines be of use...
SELECT *
FROM (SELECT 1 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender AND
M1.userAge - 5 >= M2.userAge AND
M1.userAge + 15 <= M2.userAge AND
M1.userCity = M2.userCity
LIMIT TO 2 ROWS
UNION
SELECT 2 AS prio, *
FROM my_table M1 JOIN my_table M2
WHERE M1.userID = supplied_user_id AND
M1.userGender <> M2.userGender
LIMIT TO 2 ROWS)
ORDER BY prio
LIMIT TO 2 ROWS;
I haven't tried it as I have no SQL Server and there may be dialect issues.