Using Hive DML to generate reports that summaries data - hive

Using relevant Hive DML statements and summary functions to generate reports that summaries the data.
year,town,taxi_co2,bus_co2
2013,luton,1,1
2013,manchester,3,2
2013,london,2,1
2014,luton,1,3
2014,london,3,1
2015,luton,4,1
2014,manchester,6,7
2016,london,2,2
2015,luton,4,1
2015,manchester,1,8
2014,london,3,1
2015,luton,3,1
2015,manchester,1,8
2015,london,3,1
2016,luton,6,5
2016,manchester,4,2
2016,london,3,2
2015,luton,4,1
2013,luton,1,2
2015,london,7,8
2013,manchester,3,2
2015,manchester,1,8
2015,london,7,8
The result I want is to filter only year 2013. And then show the total Co2 per town and a horizontal total.
town, total taxi co2, total bus co2, total (both taxi and bus)
luton, x, x, x
manchester, x, x
london, x, x, x
I have tried using HQL below, but I cannot get my head around completing it or whether my HQL is correct or not. But I'm not getting the desired result. :)
SELECT town,
sum(taxi_co2) AS Taxi,
sum(bus_co2) AS Bus
FROM <table>
WHERE year == '2013'
GROUP BY town;

SELECT town,
sum(taxi_co2) as Taxi,
sum(bus_co2) as Bus,
sum(taxi_co2)+sum(bus_co2) as Total
FROM <table>
WHERE year = '2013'
GROUP BY town;
If sum() for some town can be NULL, use NVL() to convert to 0:
nvl(sum(taxi_co2),0)+nvl(sum(bus_co2),0) as Total

Related

Select only the row with the max value, but the column with this info is a SUM()

I have the following query:
SELECT DISTINCT
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI,
SUM(VLRNOTA) AS AMOUNT
FROM TGFCAB CAB, TGFPAR PAR, TSIBAI BAI
WHERE CAB.CODPARC = PAR.CODPARC
AND PAR.CODBAI = BAI.CODBAI
AND CAB.TIPMOV = 'V'
AND STATUSNOTA = 'L'
AND PAR.CODCID = 5358
GROUP BY
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI
Which the result is this. Company names and neighborhood hid for obvious reasons
The query at the moment, for those who don't understand Latin languages, is giving me clients, company name, company neighborhood, and the total value of movements.
in the WHERE clause it is only filtering sales movements of companies from an established city.
But if you notice in the Select statement, the column that is retuning the value that aggregates the total amount of value of sales is a SUM().
My goal is to return only the company that have the maximum value of this column, if its a tie, display both of em.
This is where i'm struggling, cause i can't seem to find a simple solution. I tried to use
WHERE AMOUNT = MAX(AMOUNT)
But as expected it didn't work
You tagged the question with the whole bunch of different databases; do you really use all of them?
Because, "PL/SQL" reads as "Oracle". If that's so, here's one option.
with temp as
-- this is your current query
(select columns,
sum(vrlnota) as amount
from ...
where ...
)
-- query that returns what you asked for
select *
from temp t
where t.amount = (select max(a.amount)
from temp a
);
You should be able to achieve the same without the need for a subquery using window over() function,
WITH T AS (
SELECT
CAB.CODPARC,
PAR.RAZAOSOCIAL,
BAI.NOMEBAI,
SUM(VLRNOTA) AS AMOUNT,
MAX(VLRNOTA) over() AS MAMOUNT
FROM TGFCAB CAB
JOIN TGFPAR PAR ON PAR.CODPARC = CAB.CODPARC
JOIN TSIBAI BAI ON BAI.CODBAI = PAR.CODBAI
WHERE CAB.TIPMOV = 'V'
AND STATUSNOTA = 'L'
AND PAR.CODCID = 5358
GROUP BY CAB.CODPARC, PAR.RAZAOSOCIAL, BAI.NOMEBAI
)
SELECT CODPARC, RAZAOSOCIAL, NOMEBAI, AMOUNT
FROM T
WHERE AMOUNT=MAMOUNT
Note it's usually (always) beneficial to join tables using clear explicit join syntax. This should be fine cross-platform between Oracle & SQL Server.

Sum of column returning all null values in PySpark SQL

I am new to Spark and this might be a straightforward problem.
I've a SQL with name sql_left which is in the format:
Here is a sample data generated using sql_left.take(1):
[Row(REPORT_ID='2016-30-15/08/2019', Stats Area='2 Metropolitan', Suburb='GREENACRES', Postcode=5086, LGA Name='CITY OF PORT ADELAIDE ENFIELD', Total Units=3, Total Cas=0, Total Fats=0, Total SI=0, Total MI=0, Year=2016, Month='November', Day='Wednesday', Time='01:20 am', Area Speed=50, Position Type='Not Divided', Horizontal Align='Straight road', Vertical Align='Level', Other Feat='Not Applicable', Road Surface='Sealed', Moisture Cond='Dry', Weather Cond='Not Raining', DayNight='Night', Crash Type='Hit Parked Vehicle', Unit Resp=1, Entity Code='Driver Rider', CSEF Severity='1: PDO', Traffic Ctrls='No Control', DUI Involved=None, Drugs Involved=None, ACCLOC_X=1331135.04, ACCLOC_Y=1677256.22, UNIQUE_LOC=13311351677256, REPORT_ID='2016-30-15/08/2019', Unit No=2, No Of Cas=0, Veh Reg State='UNKNOWN', Unit Type='Motor Vehicle - Type Unknown', Veh Year='XXXX', Direction Of Travel='East', Sex=None, Age=None, Lic State=None, Licence Class=None, Licence Type=None, Towing='Unknown', Unit Movement='Parked', Number Occupants='000', Postcode=None, Rollover=None, Fire=None)]
Note: Age column has 'XXX','NUll' and other integer values as 023,034 etc.
The printSchema shows Age,Total Cas as integers.
I've tried the below code to first join two tables:
sql_left = spark.sql('''
SELECT *
FROM sql_crash c Left JOIN sql_units u ON c.REPORT_ID=u.REPORT_ID''')
sql_left.createOrReplaceTempView("mytable")
And below code to generate Total Cas:
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,"Licence Type","Unit Type",Sex,COALESCE(sum("Total Cas"),0) as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,"Licence Type","Unit Type",Sex order by Total_casualities desc''')
sql_result.show(20,truncate=False)
The output I'm getting is below with sum as 0.
+--------------------------------+---+------------+---------+-------+-----------------+
|Date_Time |Age|Licence Type|Unit Type|Sex |Total_casualities|
+--------------------------------+---+------------+---------+-------+-----------------+
|Friday December 2016 02:45 pm |XXX|Licence Type|Unit Type|Unknown|0.0 |
|Saturday September 2017 06:35 pm|023|Licence Type|Unit Type|Male |0.0 |
+--------------------------------+---+------------+---------+-------+-----------------+
I've tried multiple options, however nothing worked out.
My main problem here is Total_casualities is returning 0.0 for all rows if I use COALESCE(sum("Total Cas"),0). If I don't use COALESCE, it is displaying values as NULL.
Help is much appreciated.
Instead of specifying Total Cas in double-quotes("Total Cas"), mention it in backticks.
i.e. `Total Cas`
Note: The column name with space in between needs to be specified with backticks. As you are mentioning under quotations, it considers it as a string, that's why you are not getting sum. Also, for other columns(like Licence Type,Unit Type), it's displaying the same as a string instead of its value. Hope you got it.
sql_result = spark.sql('''select concat_ws(' ', Day, Month,Year,Time) as Date_Time,Age,`Licence Type`,`Unit Type`,Sex,**sum(`Total Cas`)** as Total_casualities from mytable where Suburb in ('ADELAIDE','ADELAIDE AIRPORT','NORTH ADELAIDE','PORT ADELAIDE') Group by Date_Time, Age,`Licence Type`,`Unit Type`,Sex order by Total_casualities desc''')

Calculate the variance of the weights of all players [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
We are given data from nba where a description of each table is as follows:
coaches_season, each tuple of which describes the performance of one coach in one season;[cid, year, yr_order, year, season_win, season_loss, play_off_win, play_off_loss, tid]
teams, each tuple of which gives the basic information of a team; [tid, location, name, league]
players, each tuple of which gives the basic information of one player; [ilkid, firstname, lastname, position, first_season, last_season, h_feet, h_inches, weight, college, birthday]
player_rs, each tuple of which gives the detailed performance of one player in one regular season; [ilkid,tid, pts, asts, of, ftm, tpa,tpm, fgm,fga, fta, blk, turnover, stl, dreb, oreb, reb, minutes, gp, league, lastname, firstname, year]
player_rs_career, each tuple of which gives the detailed regular-season performance of one player in his career;[ilkid, firstname, lastname, fga, fgm, fta, ftm, tpa, tpm, pf, stl, oreb, minutes, gp, dreb, asts, turnover, blk, reb, league]
draft, each tuple of which shows the information of an NBA draft. [draft_year, firstname, lastname, draft_round, tid, selection, draft_from, ilkid, league]
I found many queries but am stuck with these 3 queries:
I) For each college, print the college name and average number of drafts (per season) they sent to NBA. However, only report those colleges that sent drafts in at least 3 seasons.
II) Calculate the variance of the weights of all players;
III) Print the first and last names of those who either scored more than 12000 points in their careers or played for more than 12 seasons.
In this case especially the DDL for each table including Primary and Foreign Key definitions. Also include sample data, as text, not an image and the expected output from that data. Further, in this case you may want to include the definitions for the column names as not everyone is familiar with the acronyms used for the NBA.
With that said, I'll give it a stab. Note since you didn't include test data nor table definitions the queries have NOT been tested.
-- I) For each college, print the college name and average number of drafts (per season) they sent to NBA.
-- However, only report those colleges that sent drafts in at least 3 seasons.
-- assumptions:
-- draft yr integer specifying calendar year of draft
-- draft_from text name of college
-- average number of drafts (per season) ?? how many drafts are there per season
-- what is the difference between season and year
with draft as
(select max(draft_yr) dy from draft_year)
, dy_last3 as
(select distinct draft_from df
from draft_year
where exists (select null from draft_year, draft where draft_yr = dy)
and exists (select null from draft_year, draft where draft_yr = dy-1)
and exists (select null from draft_year, draft where draft_yr = dy-2)
)
select draft_from, round(avg(drc),2) adv_drafts
from (
select draft_from, draft_yr, count(*) drc
from draft_year
where draft_from in (select df from dy_last3)
group by draft_from,draft_yr
) t
group by draft_from;
-- II) Calculate the variance of the weights of all players;
-- assumption: weight defined as float;
select var_samp(weight) from player;
OR
select var_pop(weight) from player;
-- III)Print the first and last names of those who either scored more than 12000 points in their careers or played for more than 12 seasons.
-- assumption fgm => field goals made = 2 points each
-- ftm => free throws made = 1 point each
-- tpm => 3 point shot make = 3 points each
-- ilkid => Pk in player and FK in player_rs_career
-- table player_rs_career does include last/current season
-- note player_rs_career does NOT contain year/season, unless hidden by undescribed column name
select distinct *
from (select p.firstname, p.lastname
, sum(ftm + (2*fgm) + (3*tpm)) over (partition by p.ilkid) points
, (coalesce (p.last_season, extract (year from now())::integer) - p.first_season + 1) seasons
from player p
join player_rs_career pc
on p.ilkid = pc.ilkid
) pp
where points > 12000
or seasons > 12;

Access SQL query - Percentage of Total calculation

In Access SQL, I am attempting what should seem like a simple task in attaining a percentage of total. There are 3 item stores (Sears, kmart & Mktpl) of which in any given week, I wish to calculate their respective percent of total based on balance of sales (all can be obtained using one table - tbl_BUChannelReporting).
For example week 5 dummy numbers - Sears 7000, kmart 2500, mktpl 2000
the following ratios would be returned: sears 61%, kmart 22%, mktpl 17%
I was originally trying to create a sub query and wasn't getting anywhere so I am essentially trying to sum sales on one of the item stores in week 5 divided by the sum of all 3 item store sales in week 5. The following is my query, which is giving me "cannot have aggregate function in expression" error:
SELECT FY, FW, Rept_Chnl, BU_NM, Order_Store, Item_Store, CDBL(
SUM(IIF([item_store]="sears", revenue, IIF([item_store]="kmart", revenue, IIF([item_store]="mktpl", revenue,0)))) /
(SUM(IIF([item_store]="sears",revenue,0)+SUM(IIF([item_store]="kmart",revenue,0)+SUM(IIF([item_store]="mktpl",revenue,0))))))
AS Ratios
FROM tbl_BUChannelReporting
WHERE FY = "2017"
AND FW = 5
GROUP BY FY, FW, Rept_Chnl, BU_NM, Order_Store, item_store
Thanks all in advance for taking the time. This is my 1st post here and I don't consider myself anything but a newbie anxious to learn from the best and see how this turns out.
Take care!
-D
Consider using two derived tables or saved aggregate queries: one that groups on Item_Store and the other that does not include Item_Store in order to sum the total stores' revenue. All other groupings (FY, FW, Rept_Chnl, BU_NM, Order_Store) remain in both and used to join the two. Then in outer query, calculate percentage ratio.
SELECT i.*, CDbl(i.Store_Revenue / a.Store_Revenue) As Ratios
FROM
(SELECT t.FY, t.FW, t.Rept_Chnl, t.BU_NM, t.Order_Store, t.Item_Store,
SUM(t.Revenue) As Store_Revenue
FROM tbl_BUChannelReporting t
WHERE t.FY = '2017' AND t.FW = 5
GROUP BY t.FY, t.FW, t.Rept_Chnl, t.BU_NM, t.Order_Store, t.Item_Store) As i
INNER JOIN
(SELECT t.FY, t.FW, t.Rept_Chnl, t.BU_NM, t.Order_Store
SUM(t.Revenue) As Store_Revenue
FROM tbl_BUChannelReporting t
WHERE t.FY = '2017' AND t.FW = 5
GROUP BY t.FY, t.FW, t.Rept_Chnl, t.BU_NM, t.Order_Store) As a
ON i.FY = a.FY AND i.FW = a.FW AND i.Rept_Chnl = a.Rept_Chnl
AND i.BU_NM = a.BU_NM AND i.Order_Store = a.Order_Store
Or save each above SELECT statement as its own query and reference both below:
SELECT i.*, (i.Store_Revenue / a.Store_Revenue) As Ratios
FROM
Indiv_Item_StoreAggQ As i
INNER JOIN
All_Item_StoreAggQ As a
ON i.FY = a.FY AND i.FW = a.FW AND i.Rept_Chnl = a.Rept_Chnl
AND i.BU_NM = a.BU_NM AND i.Order_Store = a.Order_Store

sql queries to show the most popular record

I have four tables
Car (car_ registration_no, class, type_code)
Rental_history (rent_date, car_registration_no, rent_location_code, return_location_code)
Type (type_code, make, model)
Location (location_code, branch_name)
I need a query to show the most popular car rented by location.
I need a query to show the total rentals at each location for the previous month?
My code so far is as follows, but I couldn't complete it:
SELECT car.class, car.type_code , type.make, type.model
FROM car , type, rental_history
where rental_history.car_registration_no = car.car_registration_no
and car.type_code = type.type_code
You will need to join the tables and calculate the numbers. Let's start off with an easier query to point you in the right direction.
This will show you how many times a "type_code" car has been rented per location (untested, may contain errors)
SELECT
count(car.car_registration_no) as rental_num,
car.type_code,
rental_history.rent_location_code
FROM car
LEFT JOIN rental_history ON (rental_history.car_registration_no = car.car_registration_no)
GROUP BY car.type_code, rental_history.rent_location_code;
I'm using a left join here because there may be cars that have not been rented and won't have any history. Instead of not showing up, you will have a "0" for number of rentals.
Edit:
For the second query it's actually very straightforward. You need to group by location, filter on date and use COUNT (again, untested):
SELECT
count(rental_history.car_registration_no) as rental_num,
rental_history.rent_location_code
FROM rental_history
WHERE rent_date >= '2012-03-01' AND rent_date < '2012-04-01'
GROUP BY rental_history.rent_location_code;
Join All the table and use count..!