Hive sql: count and avg - sql

I'm recently trying to learn Hive and i have a problem with a sql consult.
I have a json file with some information. I want to get the average for each register. Better in example:
country times
USA 1
USA 1
USA 1
ES 1
ES 1
ENG 1
FR 1
then with next consult:
select country, count(*) from data;
I obtain:
country times
USA 3
ES 2
ENG 1
FR 1
then i should get next out:
country avg
USA 0,42 (3/7)
ES 0,28 (2/7)
ENG 0,14 (1/7)
FR 0,14 (1/7)
I don't know how i can obtain this out from the first table.
I tried:
select t1.country, avg(t1.tm),
from (
select country,count(*)as tm from data where not country is null group by country
) t1
group by t1.country;
but my out is wrong.
Thanks for help!! BR.

Divide the each group count by total count to get the result. Use Sub-Query to find the total number of records in your table
Try this
select t1.country, count(*)/IFNULL((select cast(count(*) as float) from data),0)
from data
group by t1.country;

Related

Finding Percentage of Unmatching Records

There are 2 tables, hometown (showing the hometown) and residence (showing the places that the residents live in the past 10 years). I want to find the percentage of the residents that lived or is living out of there hometown. A resident can live in multiple places, and the state_of_residence can be duplicated; as long as there is a record that shows he/she lives in a state other than his/her hometown, it should be counted.
resident_id
hometown_state
1
ny
2
ma
3
ct
4
pa
5
vt
resident_id
state_of_residence
1
ny
1
ct
1
ny
2
ma
3
ca
4
wa
4
tx
5
vt
The query should return 60% since resident 1, 3, and 4 have one or more state of residence other than his/her hometown. The query I'm having isn't return distinct state of residence, and putting DISTINCT inside a CASE statement return a syntax error. Much appreciated!
SELECT ROUND((SUM(CASE WHEN r.state_of_residence != h.hometown_state
THEN 1 ELSE 0 END)/COUNT(DISTINCT h.resident_id))*100,10)
FROM hometown h INNER JOIN residence r
ON h.resident_id=r.resident_id;
You can try to use COUNT condition aggregate function with DISTINCT instead of SUM aggregate function
SELECT COUNT(DISTINCT CASE WHEN r.state_of_residence <> h.hometown_state THEN h.hometown_state END) * 1.0
/ COUNT(DISTINCT h.resident_id) * 100
FROM hometown h
INNER JOIN residence r
ON h.resident_id=r.resident_id
GROUP BY r.resident_id

Count results that have different column value related to same ID

I'm new to SQL and looking for help on how to best do this.
I have 2 tables with the following columns:
Investors: Round ID, Investor Name, Investor City, Investor Country
Rounds: Round ID, Company Name, Company City, Company Country
I joined them to get this result
Round ID
Investor Country
Company Country
1
US
Spain
1
UK
Spain
1
Spain
Spain
2
France
Germany
2
UK
Germany
3
UK
Italy
3
Italy
Italy
I will need to get the number of investors (per round ID) which have their country different from the Company Country, So like for Round 1 I will have 2, for Round 2 it's 0 and for round 3 it's 1.
How could I do this?
Thank you for your help!
Just use conditional aggregation:
select round,
sum(case when investor_country <> company_company then 1 else 0 end) as cnt
from t
group by round;
Looking at your expected output, I think you need the count = 0 in case there do not exists a single record for investor country = company country and if there is, then you need all other record count.
You can use conditions as follows:
select round_id,
case when count(case when investor_country = company_company then 1 end) = 0
then 0
else count(case when investor_country <> company_company then 1 end)
end as cnt
from your_table t
group by round_id;
If you need diffrent counts:
SELECT
RoundId,
SUM(IIF(InvestorCountry != CompanyCountry,1,0)) AS Count
FROM
YOUR_TABLE_OR_VIEW
GROUP BY
RoundId
If you need difrent count and when all result of a same Round are difrent you want zero:
SELECT
t.RoundId,
IIF(t.Count = t.DiffrentCount,0,t.DiffrentCount) 'Count'
FROM
(
SELECT
RoundId,
SUM(1) AS 'Count',
SUM(IIF(InvestorCountry != CompanyCountry,1,0)) AS 'DiffrentCount',
FROM
YOUR_TABLE_OR_VIEW
GROUP BY
RoundId
)t

Retrieving most frequent value for each group in SQL Server

This is what I have:
AirlineName Departure_City No_of_DepartureCity Arrival_City No_of_ArrivalCity
---------------------------------------------------------------------------------------------------- -------------- ------------------- ------------ -----------------
Air Asia MY 2 JPN 2
Emirates Airlines MY 2 JPN 2
Malaysia Airlines MY 2 GER 2
Malaysia Airlines MY 1 JPN 1
Air Asia MY 1 KOR 1
This is what I want:
AirlineName Departure_City No_of_DepartureCity Arrival_City No_of_ArrivalCity
---------------------------------------------------------------------------------------------------- -------------- ------------------- ------------ -----------------
Air Asia MY 2 JPN 2
Emirates Airlines MY 2 JPN 2
Malaysia Airlines MY 2 GER 2
I have already written a query to retrieve the most frequent data for Departure_City and Arrival_City, but I can't make it grouped together and only show the most frequent data for each AirlineName.
This is my query so far:
SELECT Airline.AirlineName, Flight_Schedule.Departure_City, COUNT(Flight_Schedule.Departure_City) AS No_of_DepartureCity, Flight_Schedule.Arrival_City, COUNT(Flight_Schedule.Arrival_City) AS No_of_ArrivalCity
FROM Airline
LEFT JOIN Aircraft ON Airline.AirlineID = Aircraft.AirlineID
LEFT JOIN Flight_Schedule ON Aircraft.AircraftID = Flight_Schedule.AircraftID
GROUP BY Airline.AirlineName, Flight_Schedule.Departure_City, Flight_Schedule.Arrival_City
ORDER BY COUNT(Flight_Schedule.Departure_City)DESC , COUNT(Flight_Schedule.Arrival_City) DESC
You can make use of Rank or Dense_rank (If you want to select more than two rows having same number of cities) function
Demo
with CTE1 AS(
SELECT A.*,
RANK() OVER(PARTITION BY AirlineName ORDER BY No_of_ArrivalCity desc) as rn
FROM TABLE1 A)
SELECT * FROM CTE1 where rn = 1;
As you're grouping by lots of columns, instead of just 'AirlineName' it's grouping by all of the different values across those number of columns.
To return the number of AirlineName's and their frequency try this:
SELECT Airline.AirlineName, COUNT(*) AS [COUNT]
FROM Airline
GROUP BY Airline.AirlineName
ORDER BY COUNT(*) DESC
If you need the additional columns then your code is already correct, because of how you are grouping it and the individual values contained within the columns.

SQLite percentages with small values

So I have this table of subscribers of users and the country they are in.
UserID | Name | Country
-------+-------------------+------------
1 | Zaphod Beeblebrox | UK
2 | Arthur Dent | UK
3 | Gene Kelly | USA
4 | Nat King Cole | USA
I need to produce a list of all the users by percentage from each of the countries. I also need all the smaller member countries (under 1%) to be collapsed into an "OTHERS" category.
I can accomplish a simple "top x" of members trivially with a
SELECT COUNTRY, COUNT(*) AS POPULATION FROM SUBSCRIBERS GROUP BY COUNTRY ORDER BY POPULATION DESC LIMIT 10
and can generate the percentages by PHP server side code, but I don't quite know how to:
Do all of it in SQL including percentage calculations directly in the result
Club all under 1% members into a single OTHERS category.
So I need something like this:
Country | Population
--------+-----------
USA | 25.4%
Brazil | 12%
UK | 5%
OTHERS | 65%
Appreciate the help!
Here is query for this, I used a subquery to count the total number of rows and then used that to get the percentage value for each. The 'Others' category was generated in a separate query. Rows are sorted by descending population with the Others row last.
SELECT * FROM
(SELECT country , ROUND((100.0*COUNT(*)/count_all),1) ||'%' AS population
FROM (SELECT count(*) count_all FROM subscribers) AS sq,
subscribers s
WHERE (SELECT 100*count(*)/count_all
FROM subscribers s2
WHERE s2.country = s.country) > 1
GROUP BY country
ORDER BY population DESC)
UNION ALL
SELECT 'OTHERS', IFNULL(ROUND(100.0*COUNT(*)/count_all,1),0.0) ||'%' AS population
FROM (SELECT count(*) count_all FROM subscribers) AS sq,
subscribers s
WHERE (SELECT 100*count(*)/count_all
FROM subscribers s2
WHERE s2.country = s.country) <= 1
Ok I think I might have found a way to do this that's a hell of a lot quicker on execution speed:
SELECT territory,
Round(Sum(percentage), 3) AS Population
FROM (SELECT
Round((Count(*)*100.0)/(SELECT Count(*) FROM subscribers),3) AS Percentage,
CASE
WHEN ((Count(*)*100.0)/(SELECT Count(*) FROM subscribers)) > 2 THEN
country
ELSE 'Other'
END AS Territory
FROM subscribers
GROUP BY country
ORDER BY percentage DESC)
GROUP BY territory
ORDER BY population DESC;

Reconciliation Automation Query

I have one database and time to time i change some part of query as per requirement.
i want to keep record of results of both before and after result of these queries in one table and want to show queries which generate difference.
For Example,
Consider following table
emp_id country salary
---------------------
1 usa 1000
2 uk 2500
3 uk 1200
4 usa 3500
5 usa 4000
6 uk 1100
Now, my before query is :
Before Query:
select count(emp_id) as count,country from table where salary>2000 group by country;
Before Result:
count country
2 usa
1 uk
After Query:
select count(emp_id) as count,country from table where salary<2000 group by country;
After Query Result:
count country
2 uk
1 usa
My Final Result or Table I want is:
column 1 | column 2 | column 3 | column 4 |
2 usa 2 uk
1 uk 1 usa
...... but if query results are same than it shouldn't show in this table.
Thanks in advance.
I believe that you can use the same approach as here.
select t1.*, t2.* -- if you need specific columns without rn than you have to list them here
from
(
select t.*, row_number() over (order by count) rn
from
(
-- query #1
select count(emp_id) as count,country from table where salary>2000 group by country;
) t
) t1
full join
(
select t.*, row_number() over (order by count) rn
from
(
-- query #2
select count(emp_id) as count,country from table where salary<2000 group by country;
) t
) t2 on t1.rn = t2.rn