HIVE - Getting ALL columns of the table with COUNT(*) with DISTINCT values

HIVE - Getting ALL columns of the table with COUNT(*) with DISTINCT values - sql

I have the table below called Current_Table
I want to get the output that is,
The Column personalemailtrim to be DISTINCT
The column Occurrences must be over Count >1
Order by the column personalemailtrim
My Query so far build is wrong in many levels, Group by cant with DISTINCT and also using Count(*) doesnt give me any results with Group my etc....
SELECT id,
personalemailtrim,
personworksatnumberofbsbs,
region,
district,
branch,
num,
countofapptsatbsb,
COUNT(personalemailtrim) occurrences
FROM Current_table
GROUP BY id,
personalemailtrim,
personworksatnumberofbsbs,
region,
district,
branch,
num,
countofapptsatbsb
HAVING COUNT(*) > 1
ORDER BY personalemailtrim
Any help provided is really appreciated . I tried several breaking down code methods but i am stuck on this
further to elaborate , The expected output should look like below
As you can see the,
Occurrences are > 1
personalemailtrim is now DISTINCT

I think you want:
select t.*
from (select t.*,
row_number() over (partition by personalemailtrim order by id) as seqnum
from Current_table t
) t
where seqnum = 1 and occurrences > 1;
This assumes that occurrences is the same for each personalemailtrim, which is consistent with your data and with your question.

Related

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?

If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.

In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t

This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation

Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

How to limit duplicated rows

I would like help regarding an SQL query.
Looking around the site, I found several code snippets to return duplicate rows.
Here is the one I went with:
select unumber, name, localid
from table1
where unumber
in (select unumber from table1 group by unumber having count (*) > 1 )
order by unumber
which works fine, however, in the table I have other columns as well, like timestamp etc.
As such, when I run the query I indeed get the duplicate rows, however, I get the duplicates several times due to different timestamps for example.
Is there any way to limit the results to 'unique' duplicate rows only?
Hope this makes sense!
Thank you in advance!

For what you describe, you can just use select distinct:
select distinct unumber, name, localid
from table1
where unumber in (select unumber from table1 group by unumber having count (*) > 1 )
order by unumber;
However, I would be more likely to write this using window functions:
select unumber, name, localid
from (select t1.*,
count(*) over (partition by unumber) as cnt,
row_number() over (partition by unumber, name, localid order by unumber) as seqnum
from table1 t1
) t1
where cnt > 1 and seqnum = 1;

How to work with problems correlated subqueries that reference other tables, without using Join

I am trying to work on public dataset bigquery-public-data.austin_crime.crime of the BigQuery. My goal is to get the output as three column that shows the
discription(of the crime), count of them, and top district for that particular description(crime).
I am able to get the first two columns with this query.
select
a.description,
count(*) as district_count
from `bigquery-public-data.austin_crime.crime` a
group by description order by district_count desc
and was hoping I can get that done with one query and then I tried this in order to get the third column showing me the Top district for that particular description (crime) by adding the code below
select
a.description,
count(*) as district_count,
(
select district from
( select
district, rank() over(order by COUNT(*) desc) as rank
FROM `bigquery-public-data.austin_crime.crime`
where description = a.description
group by district
) where rank = 1
) as top_District
from `bigquery-public-data.austin_crime.crime` a
group by description
order by district_count desc
The error i am getting is this. "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN."
I think i can do that by joins. Can someone has better solution possibly to do that using without join.

Below is for BigQuery Standard SQL
#standardSQL
SELECT description,
ANY_VALUE(district_count) AS district_count,
STRING_AGG(district ORDER BY cnt DESC LIMIT 1) AS top_district
FROM (
SELECT description, district,
COUNT(1) OVER(PARTITION BY description) AS district_count,
COUNT(1) OVER(PARTITION BY description, district) AS cnt
FROM `bigquery-public-data.austin_crime.crime`
)
GROUP BY description
-- ORDER BY district_count DESC

How to do a Postgresql group aggregation: 2 fields using one to select the other

I have a table - Data - of rows, simplified, like so:
Name,Amount,Last,Date
A,16,31,1-Jan-2014
A,27,38,1-Feb-2014
A,12,34,1-Mar-2014
B,8,37,1-Jan-2014
B,3,38,1-Feb-2014
B,17,39,1-Mar-2014
I wish to group them similar to:
select Name,sum(Amount),aggr(Last),max(Date) from Data group by Name
For aggr(Last) I want the value of 'Last' from the row that contains max(Date)
So the result I want would be 2 rows
Name,Amount,Last,Date
A,55,34,1-Mar-2014
B,28,39,1-Mar-2014
i.e. in both cases, the value of Last is the one from the row that contained 1-Mar-2014
The query I'm actually doing is basically the same, but with many more sum() fields and millions of rows, so I'm guessing an aggregate function could avoid multiple extra requests each group of incoming rows.

Instead, use row_number() and conditional aggregation:
select Name, sum(Amount),
max(case when seqnum = 1 then Last end) as Last,
max(date)
from (select d.*, row_number() over (partition by name order by date desc) as seqnum
from data d
) d
group by Name;

How do I use ROW_NUMBER()?

I want to use the ROW_NUMBER() to get...
To get the max(ROW_NUMBER()) --> Or i guess this would also be the count of all rows
I tried doing:
SELECT max(ROW_NUMBER() OVER(ORDER BY UserId)) FROM Users
but it didn't seem to work...
To get ROW_NUMBER() using a given piece of information, ie. if I have a name and I want to know what row the name came from.
I assume it would be something similar to what I tried for #1
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
but this didn't work either...
Any Ideas?

For the first question, why not just use?
SELECT COUNT(*) FROM myTable
to get the count.
And for the second question, the primary key of the row is what should be used to identify a particular row. Don't try and use the row number for that.
If you returned Row_Number() in your main query,
SELECT ROW_NUMBER() OVER (Order by Id) AS RowNumber, Field1, Field2, Field3
FROM User
Then when you want to go 5 rows back then you can take the current row number and use the following query to determine the row with currentrow -5
SELECT us.Id
FROM (SELECT ROW_NUMBER() OVER (ORDER BY id) AS Row, Id
FROM User ) us
WHERE Row = CurrentRow - 5

Though I agree with others that you could use count() to get the total number of rows, here is how you can use the row_count():
To get the total no of rows:
with temp as (
select row_number() over (order by id) as rownum
from table_name
)
select max(rownum) from temp
To get the row numbers where name is Matt:
with temp as (
select name, row_number() over (order by id) as rownum
from table_name
)
select rownum from temp where name like 'Matt'
You can further use min(rownum) or max(rownum) to get the first or last row for Matt respectively.
These were very simple implementations of row_number(). You can use it for more complex grouping. Check out my response on Advanced grouping without using a sub query

If you need to return the table's total row count, you can use an alternative way to the SELECT COUNT(*) statement.
Because SELECT COUNT(*) makes a full table scan to return the row count, it can take very long time for a large table. You can use the sysindexes system table instead in this case. There is a ROWS column that contains the total row count for each table in your database. You can use the following select statement:
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This will drastically reduce the time your query takes.

You can use this for get first record where has clause
SELECT TOP(1) * , ROW_NUMBER() OVER(ORDER BY UserId) AS rownum
FROM Users
WHERE UserName = 'Joe'
ORDER BY rownum ASC

ROW_NUMBER() returns a unique number for each row starting with 1. You can easily use this by simply writing:
ROW_NUMBER() OVER (ORDER BY 'Column_Name' DESC) as ROW_NUMBER

May not be related to the question here. But I found it could be useful when using ROW_NUMBER -
SELECT *,
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS Any_ID
FROM #Any_Table

select
Ml.Hid,
ml.blockid,
row_number() over (partition by ml.blockid order by Ml.Hid desc) as rownumber,
H.HNAME
from MIT_LeadBechmarkHamletwise ML
join [MT.HAMLE] h on ML.Hid=h.HID

SELECT num, UserName FROM
(SELECT UserName, ROW_NUMBER() OVER(ORDER BY UserId) AS num
From Users) AS numbered
WHERE UserName='Joe'

You can use Row_Number for limit query result.
Example:
SELECT * FROM (
select row_number() OVER (order by createtime desc) AS ROWINDEX,*
from TABLENAME ) TB
WHERE TB.ROWINDEX between 0 and 10
--
With above query, I will get PAGE 1 of results from TABLENAME.

If you absolutely want to use ROW_NUMBER for this (instead of count(*)) you can always use:
SELECT TOP 1 ROW_NUMBER() OVER (ORDER BY Id)
FROM USERS
ORDER BY ROW_NUMBER() OVER (ORDER BY Id) DESC

Need to create virtual table by using WITH table AS, which is mention in given Query.
By using this virtual table, you can perform CRUD operation w.r.t row_number.
QUERY:
WITH table AS
-
(SELECT row_number() OVER(ORDER BY UserId) rn, * FROM Users)
-
SELECT * FROM table WHERE UserName='Joe'
-
You can use INSERT, UPDATE or DELETE in last sentence by in spite of SELECT.

SQL Row_Number() function is to sort and assign an order number to data rows in related record set. So it is used to number rows, for example to identify the top 10 rows which have the highest order amount or identify the order of each customer which is the highest amount, etc.
If you want to sort the dataset and number each row by seperating them into categories we use Row_Number() with Partition By clause. For example, sorting orders of each customer within itself where the dataset contains all orders, etc.
SELECT
SalesOrderNumber,
CustomerId,
SubTotal,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY SubTotal DESC) rn
FROM Sales.SalesOrderHeader
But as I understand you want to calculate the number of rows of grouped by a column. To visualize the requirement, if you want to see the count of all orders of the related customer as a seperate column besides order info, you can use COUNT() aggregation function with Partition By clause
For example,
SELECT
SalesOrderNumber,
CustomerId,
COUNT(*) OVER (PARTITION BY CustomerId) CustomerOrderCount
FROM Sales.SalesOrderHeader

This query:
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
will return all rows where the UserName is 'Joe' UNLESS you have no UserName='Joe'
They will be listed in order of UserID and the row_number field will start with 1 and increment however many rows contain UserName='Joe'
If it does not work for you then your WHERE command has an issue OR there is no UserID in the table. Check spelling for both fields UserID and UserName.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVE - Getting ALL columns of the table with COUNT(*) with DISTINCT values - sql

Related

How do we find frequency of one column based off two other columns in SQL?

How to limit duplicated rows

How to work with problems correlated subqueries that reference other tables, without using Join

How to do a Postgresql group aggregation: 2 fields using one to select the other

How do I use ROW_NUMBER()?

Categories

Resources