Join with three tables in Hive - sql

I have to write Hive SQL.
I want to find the lowest price for each catalog. I want to get one row for each catalog with the catalog ID, the product ID, the price (lowest price), the image url, the large_ctgr_id, the small_ctgr_id, the large_ctgr_name, and the small_ctgr_name columns.
In the example below, we finally need two rows.
In addition, there are multiple products in a catalog, and each product is classified into a large category and a small category.
Eventually we have to join the three tables.
There is a similar issue with the link I asked.
(Finding the lowest price for each category using join sql in Hive) In this case, we joined only two tables, and the names of the columns changed slightly.
Help me. Thank you.
"catalog_product_match" table
catalog_id product_id
1001 500001
1001 500002
1002 500101
1002 500102
1002 500103
"product_info" table
prd_id price img_url large_ctgr_id small_ctgr_id
500001 29000 /app/url/img/500001.jpg 200 34543
500002 29500 /app/url/img/500002.jpg 201 2345234
500101 8100 /app/url/img/500101.jpg 1020 23252
500102 8100 /app/url/img/500102.jpg 42133 2349823
500103 8500 /app/url/img/500103.jpg 3435 342514
"category_info" table
l_ctgr_id s_ctgr_id l_ctgr_name s_ctgr_name
200 34543 computer notebook
200 3423984 computer tablet
201 2345234 phone smartphone
1020 23252 clothes top
42133 2349823 cup cup
3435 342514 pen ink

Related

Query to aggregate across multiple tables

I'm new to SQL, and I'm trying to create a database to manage a small inventory. This is the structure of the db:
DatabaseStructure
I need to create a query that returns the total inventory per material. So, the first step would be to look up for all the batches associated with the material. Second, look up for all the movements associated with each batch. Then, sum the quantity associated with each movement, but depending on the movement type (If it is a good receipt is addition (+), but if it is an inventory withdrawal is subtraction (-)).
Here is an example of the tables with sample data and the desired result.
Table Material
MaterialID
MaterialDescription
1
Bottle
2
Box
Table Batch
BatchID
MaterialID
VendorMaterial
VendorBatch
ExpirationDate
1000
1
2096027
00123456
12/12/2025
1001
1
2096027
00987654
11/11/2026
1002
2
102400
202400E
10/10/2023
Table Movement
MovementID
BatchID
MovementType
Quantity
CreatedBy
CreatedOnDate
1
1000
Good receipt
100
user1#email.com
4/10/2022
2
1000
Inventory withdrawal
20
user2#email.com
4/15/2022
3
1000
Inventory withdrawal
25
user3#email.com
4/17/2022
4
1001
Good receipt
100
user1#email.com
4/20/2022
5
1001
Inventory withdrawal
10
user4#email.com
4/26/2022
6
1002
Good receipt
50
user1#email.com
2/26/2022
Expected query result - total inventory per material:
MaterialDescription
TotalInventory
Bottle
145
Box
50
TotalInventory calculation: for Bottle there are two good receipts movements of 100 and three withdrawals of 20, 25 and 10. So, total inventory will be (100+100)-(20+25+10)=145.
Thanks for your help!
select
mat.MaterialDescription,
sum(
case mov.MovementType
when 'Good receipt' then 1
when 'Inventory withdrawal' then -1
else 0 /* don't know what to do for other MovementTypes */
end * mov.Quantity
) as TotalInventory
from
Material as mat
left join Batch as bat on bat.MaterialID = mat.MaterialID
left join Movement as mov on mov.BatchID = bat.BatchID
group by
mat.MaterialDescription
;

JOIN the same table on two columns

I use JOINs to replace country and product IDs in import and export data with actual country and products names stored in separate tables. In the data source table (data), there are two columns with country IDs, for origin and destination, both of which I am replacing with country names.
The code I have come up with refers to the country_names table twice – as country_names, and country_names2, – which doesn’t seem to be very elegant. I expected to be able to refer to the table just once, by a single name. I would be grateful if someone pointed me to a more elegant and maybe more efficient way to achieve the same result.
SELECT
country_names.name AS origin,
country_names2.name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM
OEC.year_origin_destination_hs92_6 AS data
JOIN
OEC.products_hs_92 AS product_names
ON
data.hs92 = product_names.hs92
JOIN
OEC.country_names AS country_names
ON
data.origin = country_names.id_3char
JOIN
OEC.country_names AS country_names2
ON
data.dest = country_names2.id_3char
WHERE
data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The table to convert product IDs to product names has 6K+ rows. Here is a small sample:
id hs92 name
63215 3215 Ink
2130110 130110 Lac
21002 1002 Rye
2100200 100200 Rye
52706 2706 Tar
20902 902 Tea
42203 2203 Beer
42302 2302 Bran
178703 8703 Cars
The table to convert country IDs to country names (which is the table I have to JOIN on twice) has 264 rows for all countries in the world. (id_3char is the column used.) Here is a sample:
id id_3char name
euchi chi Channel Islands
askhm khm Cambodia
eublx blx Belgium-Luxembourg
eublr blr Belarus
eumne mne Montenegro
euhun hun Hungary
asmng mng Mongolia
nabhs bhs Bahamas
afsen sen Senegal
And here is a sample of data from the import and export data table with a total of 205M rows that has the two columns origin and dest that I am making a join on:
year origin dest hs92 export_val import_val
2009 can isr 300410 2152838.47 3199.24
1995 chn jpn 590190 275748.65 554154.24
2000 deu gmb 100610 1573508.44 1327.0
2008 deu jpn 540822 10000.0 202062.43
2010 deu ukr 950390 1626012.04 159423.38
2006 esp prt 080530 2470699.19 125291.33
2006 grc ind 844859 8667.0 3182.0
2000 ltu deu 630399 6018.12 5061.96
2005 usa zaf 290219 2126216.52 34561.61
1997 ven ecu 281122 155347.73 1010.0
I think you already have it done such that it can be considered good enough to just use as is :o)
Meantime, If for some reason you really-really want to avoid two joins on that country table - what you can do is to materialize below select statement into let's say `OEC.origin_destination_pairs` table
SELECT
o.id_3char o_id_3char,
o.name o_name,
d.id_3char d_id_3char,
d.name d_name
FROM `OEC.country_names` o
CROSS JOIN `OEC.country_names` d
Then you can just join on that new table as below
SELECT
country_names.o_name AS origin,
country_names.d_name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM OEC.year_origin_destination_hs92_6 AS data
JOIN OEC.products_hs_92 AS product_names
ON data.hs92 = product_names.hs92
JOIN OEC.origin_destination_pairs AS country_names
ON data.origin = country_names.o_id_3char
AND data.dest = country_names2.d_id_3char
WHERE data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The motivation behind above is cost of storing and querying in your particular case
Your `OEC.country_names` table is just about 10KB in size
Each time you query it you pay as if it is 10MB (Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.)
So, if you will materialize above mentioned table - it will still be less than 10MB so no difference in querying charges
Similar situation with storing that table - no visible changes in charges
You can check more about pricing here

Joining a table to two one-to-many relationship tables in SQL Server

Happy Friday folks,
I'm trying to write an SSRS report displaying data from three (actually about 12, but only three relevant) tables that have akward relationships and the SQL query behind the data is proving difficult.
There are three entities involved - a Purchase Order, a Sales Order, and a Delivery. The problem is the a Purchase Order can have many sales orders, and also many deliveries which are NOT linked to the sales orders...that would be too easy.
Both the Sales Order and Delivery tables can be linked to the Purchase Order table by foreign keys and an intermediate table each.
I need to basically list Purchase Orders, a list of sales orders and a list of deliveries next to them, with NULLs for any fields that aren't valid so that'll give the required output in SSRS/when read by a human, ie, for a purchase order with 2 sales orders and 4 delivery dates;
PO SO Delivery
1234 ABC 05/10
1234 DEF 09/10
1234 NULL 10/12
1234 NULL 14/12
The above (when grouped by PO) will tell the users there are two sales orders and four (unlinked) delivery dates.
Likewise if there are more SOs than deliveries, we need NULLs in the Delivery column;
PO SO Delivery
1234 ABC 03/08
1234 DEF NULL
1234 GHI NULL
1234 JKL NULL
Above would be the case with 4 SOs and one delivery date.
Using Left Outer joins alone gives too much duplication - in this case 8 rows, as it gives 4 delivery dates for each match on the sales order;
PO SO Delivery
1234 ABC 05/10
1234 ABC 09/10
1234 ABC 10/12
1234 ABC 14/12
1234 DEF 05/10
1234 DEF 09/10
1234 DEF 10/12
1234 DEF 14/12
It's fine that the PO column is duplicated as SSRS can visually group that - but the SO/Delivery fields can't be allowed to duplicate as this can't be got rid of in the report - if I group the column in SSRS by SO then it still spits out 4 delivery dates for each one.
The only situation our query works nice is when there is just one SO per PO. In that case the single PO and SO numbers are duplicated together for x deliveries and can both be neatly grouped in SSRS. Unfortunately this is a rare occurence in the data.
I've thought of trying to use some sort of windowing function or CROSS APPLY but both fall down as they will repeat for every PO number listed and end up spitting out too much data.
At the point of thinking this just isn't set-based enough to be doable in SQL, I know the data is horrible..
Any help much appreciated.
EDIT - basical sqlfiddle link to the table schemas. Omitted many columns which aren't relevant. http://sqlfiddle.com/#!2/5ba16
Example data...
Purchase Order
PO_Number Style
1001 Black work boots
1002 Green hat
1006 Red Scarf
Sales Order
Sales_order_number PO_number Qty Retailer
A100-21 1001 15 Walmart
A100-22 1001 29 Walmart
A200-31 1006 1000 Asda
Delivery
Delivery_ID Delivery_Date PO_number
1543285 10/05/2014 1001
1543286 12/05/2014 1001
1543287 17/05/2014 1001
1543288 21/05/2014 1002
If you assign row numbers to the elements in salesorders and deliveries, you can link on that.
Something like this
declare #salesorders table (po int, so varchar(10))
declare #deliveries table (po int, delivery date)
declare #purchaseorders table (po int)
insert #purchaseorders values (123),(456)
insert #salesorders values (123,'a'),(123,'b'),(456,'c')
insert #deliveries values (123,'2014-1-1'),(456,'2014-2-1'),(456,'2014-2-1')
select *
from
(
select numbers.number, p.po, so.so, d.delivery from #purchaseorders p
cross join (Select number from master..spt_values where type='p') numbers
left join (select *,ROW_NUMBER() over (partition by po order by so) sor from #salesorders ) so
on p.po = so.po and numbers.number = so.sor
left join (select * , ROW_NUMBER() over (partition by po order by delivery) dor from #deliveries) d
on p.po = d.po and numbers.number = d.dor
) v
where so is not null or delivery is not null
order by po,number

Storing a set of criteria in another table

I have a large table with sales data, useful data below:
RowID Date Customer Salesperson Product_Type Manufacturer Quantity Value
1 01-06-2004 James Ian Taps Tap Ltd 200 £850
2 02-06-2004 Apple Fran Hats Hats Inc 30 £350
3 04-06-2004 James Lawrence Pencils ABC Ltd 2000 £980
...
Many rows later...
...
185352 03-09-2012 Apple Ian Washers Tap Ltd 600 £80
I need to calculate a large set of targets from table containing values different types, target table is under my control and so far is like:
TargetID Year Month Salesperson Target_Type Quantity
1 2012 7 Ian 1 6000
2 2012 8 James 2 2000
3 2012 9 Ian 2 6500
At present I am working out target types using a view of the first table which has a lot of extra columns:
SELECT YEAR(Date)
, MONTH(Date)
, Salesperson
, Quantity
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType1
, CASE WHEN Manufacturer = 'Hats Inc' AND Product_Type IN ('Hats','Coats') THEN True ELSE False END AS IsType2
...
...
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType24
, CASE WHEN Manufacturer IN ('Tap Ltd','Hats Inc') AND Product_Type = 'Hats' THEN True ELSE False END AS IsType25
FROM SalesTable
WHERE [some stuff here]
This is horrible to read/debug and I hate it!!
I've tried a few different ways of simplifying this but have been unable to get it to work.
The closest I have come is to have a third table holding the definition of the types with the values for each field and the type number, this can be joined to the tables to give me the full values but I can't work out a way to cope with multiple values for each field.
Finally the question:
Is there a standard way this can be done or an easier/neater method other than one column for each type of target?
I know this is a complex problem so if anything is unclear please let me know.
Edit - What I need to get:
At the very end of the process I need to have targets displayed with actual sales:
Type Year Month Salesperson TargetQty ActualQty
2 2012 8 James 2000 2809
2 2012 9 Ian 6500 6251
Each row of the sales table could potentially satisfy 8 of the types.
Some more points:
I have 5 different columns that need to be defined against the targets (or set to NULL to include any value)
I have between 30 and 40 different types that need to be defined, several of the columns could contain as many as 10 different values
For point 2, if I am using a row for each permutation of values, 2 columns with 10 values each would give me 100 rows for each sales person for each month which is a lot but if this is the only way to define multiple values I will have to do this.
Sorry if this makes no sense!
If I am correct that the "Target_Type" field in the Target Table is based on the Manufacturer and the Product_Type, then you can create a TargetType table that looks like what's below and JOIN on Manufacturer and the Product_Type to get your Target_Type_Value:
ID Product_Type Manufacturer Target_Type_Value
1 Taps Tap Ltd 1
2 Hats Hats Inc 2
3 Coats Hats Inc 2
4 Hats Caps Inc 3
5 Pencils ABC Ltd 6
This should address the "multiple values for each field" problem by having a row for each possibility.

SELECT datafields with multiple groups and sums

I cant seem to group by multiple data fields and sum a particular grouped column.
I want to group Person to customer and then group customer to price and then sum price. The person with the highest combined sum(price) should be listed in ascending order.
Example:
table customer
-----------
customer | common_id
green 2
blue 2
orange 1
table invoice
----------
person | price | common_id
bob 2330 1
greg 360 2
greg 170 2
SELECT DISTINCT
min(person) As person,min(customer) AS customer, sum(price) as price
FROM invoice a LEFT JOIN customer b ON a.common_id = b.common_id
GROUP BY customer,price
ORDER BY person
The results I desire are:
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue,$170
The colors are the customer, that GREG and Bob handle. Each color has a price.
There are two issues that I can see. One is a bit picky, and one is quite fundamental.
Presentation of data in SQL
SQL returns tabular data sets. It's not able to return sub-sets with headings, looking something a Pivot Table.
The means that this is not possible...
**BOB:**
Orange, $2230
**GREG:**
green, $360
blue, $170
But that this is possible...
Bob, Orange, $2230
Greg, Green, $360
Greg, Blue, $170
Relating data
I can visually see how you relate the data together...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 360 2
blue 2 greg 170 2
orange 1 bob 2330 1
But SQL doesn't have any implied ordering. Things can only be related if an expression can state that they are related. For example, the following is equally possible...
table customer table invoice
-------------- -------------
customer | common_id person | price |common_id
green 2 greg 170 2 \ These two have
blue 2 greg 360 2 / been swapped
orange 1 bob 2330 1
This means that you need rules (and likely additional fields) that explicitly state which customer record matches which invoice record, especially when there are multiples in both with the same common_id.
An example of a rule could be, the lowest price always matches with the first customer alphabetically. But then, what happens if you have three records in customer for common_id = 2, but only two records in invoice for common_id = 2? Or do the number of records always match, and do you enforce that?
Most likely you need an extra piece (or pieces) of information to know which records relate to each other.
you should group by using all your selected fields except sum then maybe the function group_concat (mysql) can help you in concatenating resulting rows of the group clause
Im not sure how you could possibly do this. Greg has 2 colors, AND 2 prices, how do you determine which goes with which?
Greg Blue 170 or Greg Blue 360 ???? or attaching the Green to either price?
I think the colors need to have unique identofiers, seperate from the person unique identofiers.
Just a thought.