How to merge rows in hive? - sql

I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)

This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id

Related

how to dynamically join tables in bigquery to avoid duplication of common columns

I have 2 tables with a large number of columns (each has around 700-800 columns, which makes it not feasible to individually write all the column names). Both the tables have a few common rows. I need to dynamically union both the tables such that the common columns don't get repeated and are queried only once in the final table. For example:
TABLE 1:
+---------+--------+------+-------+
|firstname|lastname|upload|product|
+---------+--------+------+-------+
| alice| a| 100|apple |
| bob| b| 23|orange |
+---------+--------+------+-------+
TABLE 2:
+---------+--------+------+-------+
|firstname|lastname|books |active |
+---------+--------+------+-------+
| alice| a| 10 |yes |
| bob| b| 2 |no |
+---------+--------+------+-------+
FINAL TABLE:
+---------+--------+------+-------+-----+------+
|firstname|lastname|upload|product|books|active|
+---------+--------+------+-------+-----+------+
| alice| a| 100|apple | 10 | yes |
| bob| b| 23|orange | 2 | no |
+---------+--------+------+-------+-----+------+
Just to give you a direction to look into
select *
from table1
join table2
using(firstname, lastname)
if applied to sample data in your question - output is

Grouping together values in SQL

I have a postgresql database with data having four columns. I am using sqllite3 to run these commands in python. Say the columns are named A, B, c, d.
Each event in the database is uniquely represented by combined values of A and B. As an example, consider the table below.
--------------------------------------------------
| A| B| c| d|
--------------------------------------------------
| 10| 3| c1| d1|
--------------------------------------------------
| 11| 4| c2| d2|
--------------------------------------------------
| 10| 3| c3| d3|
--------------------------------------------------
| 12| 3| c4| d4|
--------------------------------------------------
| 13| 1| c5| d5|
--------------------------------------------------
| 10| 3| c6| d6|
--------------------------------------------------
| 11| 4| c7| d7|
--------------------------------------------------
| 10| 1| c8| d8|
--------------------------------------------------
| 10| 2| c9| d9|
--------------------------------------------------
| 11| 3| c10| d10|
--------------------------------------------------
As you can see, the combination 10 and 3 appears three times in the table meaning all those rows were part of the same event (where each event is uniquely represented by combined value in A and B). What I want out of this is a list of all the details:
grouped according the column A
then grouped according to it's corresponding value in the column B.
For example, in this case, (c1, c3, c6), (d1, d2, d6), (c8), (d8), (c9), (d9) - these correspond to 10 in Column A. The first two 'groups' came from 10 (from A) and the corresponding 3 (from B). The second two came from 10(A) and 1(B), and the last two came from 10(A) and 2(B). Similarly everything else also has to be grouped in this manner.
I am a complete noob to SQL so I am really lost. After much googling I got to this part
SELECT GROUP_CONCAT(c, ',') c,\
GROUP_CONCAT(d, ',') d,\
FROM table WHERE GROUP BY A, B
This is giving me the details but it groups A and B together and so the output that I get has 10/3 and 10/2 in a different row. I want all the 10s (and all the other values in A) to be grouped together and then grouped by it's value in B.
How can this be done? Thanks!
EDIT: The required answer in tabular format-
--------------------------------------------------
| (c1, c3, c6), (d1, d3, d6), (c8), (d8), (c9), (d9)
--------------------------------------------------
| (c2, c7), (d2, d7), (c10), (d10)
--------------------------------------------------
| (c4), (d4)
--------------------------------------------------
| (c5), (d5)
--------------------------------------------------

SQL query for finding the most frequent value of a grouped by value

I'm using SQLite browser, I'm trying to find a query that can find the max of each grouped by a value from another column from:
Table is called main
| |Place |Value|
| 1| London| 101|
| 2| London| 20|
| 3| London| 101|
| 4| London| 20|
| 5| London| 20|
| 6| London| 20|
| 7| London| 20|
| 8| London| 20|
| 9| France| 30|
| 10| France| 30|
| 11| France| 30|
| 12| France| 30|
The result I'm looking for is the finding the most frequent value grouping by place:
| |Place |Most Frequent Value|
| 1| London| 20|
| 2| France| 30|
Or even better
| |Place |Most Frequent Value|Largest Percentage|2nd Largest Percentage|
| 1| London| 20| 0.75| 0.25|
| 2| France| 30| 1| 0.75|
You can group by place, then value, and order by frequency eg.
select place,value,count(value) as freq from cars group by place,value order by place, freq;
This will not give exactly the answer you want, but near to it like
London | 101 | 2
France | 30 | 4
London | 20 | 6
Now select place and value from this intermediate table and group by place, so that only one row per place is displayed.
select place,value from
(select place,value,count(value) as freq from cars group by place,value order by place, freq)
group by place;
This will produce the result like following:
France | 30
London | 20
This works for sqlite. But for some other programs, it might not work as expected and return the place and value with least frequency. In those, you can put order by place, freq desc instead to solve your problem.
The first part would be something like this.
http://sqlfiddle.com/#!7/ac182/8
with tbl1 as
(select a.place,a.value,count(a.value) as val_count
from table1 a
group by a.place,a.value
)
select t1.place,
t1.value as most_frequent_value
from tbl1 t1
inner join
(select place,max(val_count) as val_count from tbl1
group by place) t2
on t1.place=t2.place
and t1.val_count=t2.val_count
Here we are deriving tbl1 which will give us the count of each place and value combination. Now we will join this data with another derived table t2 which will find the max count and we will join this data to get the required result.
I am not sure how do you want the percentage in second output, but if you understood this query, you can use some logic on top of it do derive the required output. Play around with the sqlfiddle. All the best.
RANK
SQLite now supports RANK, so we can use the exact same syntax that works on PostgreSQL, similar to https://stackoverflow.com/a/12448971/895245
SELECT "city", "value", "cnt"
FROM (
SELECT
"city",
"value",
COUNT(*) AS "cnt",
RANK() OVER (
PARTITION BY "city"
ORDER BY COUNT(*) DESC
) AS "rnk"
FROM "Sales"
GROUP BY "city", "value"
) AS "sub"
WHERE "rnk" = 1
ORDER BY
"city" ASC,
"value" ASC
This would return all in case of tie. To return just one you could use ROW_NUMBER instead of RANK.
Tested on SQLite 3.34.0 and PostgreSQL 14.3. GitHub upstream.

Finding the intersection of tables that use a many-to-many relationship in SQL

I need to get an intersection of two tables that uses two many-to-many tables to relate each other. Example tables as follows:
**Discount** **DiscountRef** **ProductCat** **Product**
|DisId| Discount|Amount| |DisId|RefType|RefId|IsActive| |ProdId|CatId| |ProdId| ProdName|ProdPrice|
+-----+---------+------+ +-----+-------+-----+--------+ +------+-----+ +------+--------------+---------+
| 1| 2% Off| 0.02| | 1|Product| 9004| 0| | 9001| 3456| | 9001| 9" Nail| 0.50|
| 2| 10% Off| 0.10| | 2|Product| 9002| 0| | 9002| 3456| | 9002| 2"x4" Stud| 2.50|
| 3| 25% Off| 0.25| | 2| PCat| 3456| 1| | 9005| 3456| | 9003| Claw Hammer| 5.99|
| 4| 2 for 1| 0.50| | 3| PCat| 7346| 1| | 9001| 7346| | 9004| Wood Glue| 1.20|
| 5|Clearance| 0.75| | 3| PCat| 4455| 1| | 9003| 7346| | 9005|6'x4' Dry Wall| 10.39|
| 5|Product| 9004| 0| | 9003| 4455| | 9006| Screwdriver| 4.25|
| 9006| 4455|
With these tables I need to get the intersection of Product Categories if there under the same Discount Id. The below table is what I need to get:
|DisId|ProdId|DisPrice|
+-----+------+--------+
| 2| 9001| 0.45|
| 2| 9002| 2.25|
| 2| 9005| 9.36|
| 3| 9003| 4.50|
I have tried a few different ways but can't seem to get to that table. The below SQL returns me the discounts that have more then one category applied to it.
SELECT DR.DisId, PC.CatId
FROM DiscountRef DR
INNER JOIN (
SELECT DisId
FROM DiscountRef
GROUP BY DisId
HAVING COUNT(DisId) > 1
) SDR ON SDR.DisId = DR.DisId
INNER JOIN ProductCat PC ON PC.CatId = DR.RefId AND DR.RefType = 'PCat'
GROUP BY DR.DisId, PC.CateId
Table Returned:
|DisId|CatId|
+-----+-----+
| 3| 7346|
| 3| 4455|
Then using the Product Categories Id's with an intersect of Product tables I get the correct amount of product Ids.
SELECT P1.ProdId
FROM Product P1
INNER JOIN ProdCat PC1 ON PC1.ProdId = P1.ProdId AND PC1.CategoryId = 7346
INTERSECT
SELECT P2.ProdId
FROM Product P2
INNER JOIN ProdCat PC2 ON PC2.ProdId = P2.ProdId AND PC2.CategoryId = 4455
Also a discount can have more then two categories (Narrows down the number of products), and some times there's more then one discount active (discount data is omitted for this but a check will be done).
Any help on how I can get my desired table above?
EDIT: If there are multiple DisIds on the DiscountRef table and they happen to be the PCat type they are products that shared in all the categories. Like how Claw Hammer is the only item that appears in both CatId 7346 AND CatId 4455.

Retrieving rows that share multiple ID's in SQL

I am stuck on how to narrow down a selection of rows that are related by multiple ID's. Here is my problem with the data as follows:
|Widget | |Widget Category | |Part Category | |Part |
+---------+ +--------------------+ +--------------+ +-------------+
|Id|Name | |WidId|CatId|CatName | |PartId| CatId | |Id|Name |
+---------+ +-----+-----+--------+ +------+-------+ +--+----------+
| 1|item01| | 1| 1|Windows | | 1| 1| | 1|Glass |
| 2|item02| | 2| 1|Windows | | 1| 2| | 2|Door Frame|
| 3|item03| | 3| 1|Windows | | 2| 2| | 3|Wheel |
| 4|item04| | 1| 2|Door | | 4| 2| | 4|Handle |
| 5|item05| | 5| 2|Door |
| 6|item06| | 6| 3|Trunk |
One or more widgets can be in a Widget Category. Many widget categories can have many part Categories. Many Parts can be part of many part categories. I need to know what Parts are linked to what Widgets. So we know that Item01 has parts "Glass" and Item05 has Parts "Glass, Door Frame, and Handle".
Here is my SQL I have so far but I need it to be dynamic so it can run once a week on a stored procedure.
---- This gives me the Correct number of Widgets to Parts based on set of 2 category ID's as a quick and static hack
SELECT W.Id
FROM Widget W
INNER JOIN dbo.[WidgetCategory] WC1 ON WC1.WidId = W.Id
INNER JOIN dbo.[WidgetCategory] WC2 ON WC2.WidId = W.Id
WHERE WC1.CatId = 1 AND WC2.CatId = 2
GROUP BY W.Id
The reason for the above query is to get a table structure that is grouped by PartId's to WidgetId's as an intersection of the two related categories and all the widgets that are related to parts. The below table is what I am trying to get so that I can aggregate how many widgets are in a part (COUNT(WidId) GROUP BY PartId):
|WidId|PartId|WidgetName|
+-----+------+----------+
| 1| 1| Item01|
| 2| 1| Item02|
| 3| 1| Item03|
| 1| 2| Item01|
| 5| 2| Item05|
Updated question: How can I get this response from the tables above with only returning the intersection of the two categories?
|WidId|PartId|WidgetName|
+-----+------+----------+
| 1| 1| Item01|
| 1| 2| Item01|
Any help would be greatly appreciated! Sorry for the sloppiness, had to post quickly before I left for weekend.
EDIT: Sorry, about the ProductId, was left over from some SQL that I was using. Should be Widget Id. Added more clarity to the problem and added an addition problem I was having.
I think you need a query like this.
SELECT DISTINCT w.WidId, p.ParId, w.Name
FROM Widget w
JOIN WidgetCategory wc ON wc.WidId=w.Id
JOIN PartCategory pc ON pc.CatId=wc.CatId
JOIN Part p ON p.Id=pc.ParId
I don't see why you would need to join twice on the WidgetCategory table. What you need is to reach the Part table by joining the PartCategory table.
And why are you grouping? If you want all the parts, then you can't group, unless you use some specific SQL feature to concatenate all the parts in a single row. This may or may not be possible, depending on which database engine you are using.
I added the DISTINCT, just in case you have more than one ways to get from Widget X to Part Y... that is enough to remove duplicates. There is no need for a GROUP BY unless you need to COUNT or do something else with the aggregation.