how to dynamically join tables in bigquery to avoid duplication of common columns - sql

I have 2 tables with a large number of columns (each has around 700-800 columns, which makes it not feasible to individually write all the column names). Both the tables have a few common rows. I need to dynamically union both the tables such that the common columns don't get repeated and are queried only once in the final table. For example:
TABLE 1:
+---------+--------+------+-------+
|firstname|lastname|upload|product|
+---------+--------+------+-------+
| alice| a| 100|apple |
| bob| b| 23|orange |
+---------+--------+------+-------+
TABLE 2:
+---------+--------+------+-------+
|firstname|lastname|books |active |
+---------+--------+------+-------+
| alice| a| 10 |yes |
| bob| b| 2 |no |
+---------+--------+------+-------+
FINAL TABLE:
+---------+--------+------+-------+-----+------+
|firstname|lastname|upload|product|books|active|
+---------+--------+------+-------+-----+------+
| alice| a| 100|apple | 10 | yes |
| bob| b| 23|orange | 2 | no |
+---------+--------+------+-------+-----+------+

Just to give you a direction to look into
select *
from table1
join table2
using(firstname, lastname)
if applied to sample data in your question - output is

Related

How to merge rows in hive?

I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id

SQL: Merge localized version of a table to the main one

Imagine I have a main table like:
Table guys
|id| name|profession|
|--|------|----------|
| 1| John| developer|
| 2| Mike| boss|
| 3| Roger| fireman|
| 4| Bob| policeman|
I also have a localized version which is not complete (the boss is missing):
Table guys_bg
|id| name | profession|
|--|------|-----------|
| 1| Джон|разработчик|
| 3|Роджър| пожарникар|
| 4| Боб| полицай|
I want to prioritize guys_bg results while still showing all the guys (The boss is still a guy, right?).
This is the desired result:
|id| name | profession|
|--|------|-----------|
| 1| Джон|разработчик|
| 2| Mike| boss|
| 3|Роджър| пожарникар|
| 4| Боб| полицай|
Take into consideration that both tables may have a lot of (100+) columns so joining the tables and using CASE for every column will be very tedious.
What are my options?
Here is one way using union all:
select gb.*
from guys_bg gb
union all
select g.*
from guys g
where not exists (select 1 from guys_bg gb where gb.id = g.id);
You can also make it with using FULL JOIN.
SELECT
ISNULL(b.id,g.id) id
, ISNULL(b.name, g.name) name
, ISNULL(b.profession, g.profession) profession
FROM
guys g
FULL JOIN guys_bg b ON g.id = b.id

Translation of a SQL Query Into DAX to create a Calculated Column in PowerPivot

Hi I am building a PowerPivot Data Model Using "Person" table which has the columns "Name" and "Amount"
Table - Person
|Name | Amount|
|Red | 10|
|Blue | 10|
|Red | 16|
|Blue | 82|
|Red | 82|
|Red | 54|
|Red | 61|
|Blue | 82|
|Blue | 82|
The Output is as expected :
| Name | Amount | Count(Specific_Amount) |
| Red |10 | 2 |
| Blue | 10 | 1 |
|Red | 16 | 1|
|Blue | 82 | 3|
|Red |82 | 1|
|Red | 54 | 1|
|Red | 61 | 1|
What i Have Tried till now is :
select Name, distinct Amount, count(Amount) as CountOfAmountRepeated
from Person
group by Amount
order by Amount;
I have imported my table "Person" into PowerPivot in Excel.
I want to create a Calculated Column In PowerPivot in Excel to create a new column of count of Repeated Amount Values. i was able to do this in SQL by using the above query, But i wanted an Equivalent DAX query for creating a new column in PowerPivot.
Can someone translate this query into DAX or say a tool to translate sql into DAX so that i can create an Calculated column and Use PowerView to prepare a histogram of this data.
tried googling but no much help. Advance Thanks ..
There are a lot of facets of you question that need to be addressed but very simply (without consideration of any other requirements) the calculation is:
Count(Specific_Amount):=COUNTROWS('Person')
*All you seem to be looking to do here is count the unique instances of each combination.
If you then then created a pivot table dragging the [name] and [amount] into the rows and [Count(Specific_Amount)] into the values you would have the answer you are looking for, To get the layout you want you could change the layout to tabular form and remove the sub totals.

Finding the intersection of tables that use a many-to-many relationship in SQL

I need to get an intersection of two tables that uses two many-to-many tables to relate each other. Example tables as follows:
**Discount** **DiscountRef** **ProductCat** **Product**
|DisId| Discount|Amount| |DisId|RefType|RefId|IsActive| |ProdId|CatId| |ProdId| ProdName|ProdPrice|
+-----+---------+------+ +-----+-------+-----+--------+ +------+-----+ +------+--------------+---------+
| 1| 2% Off| 0.02| | 1|Product| 9004| 0| | 9001| 3456| | 9001| 9" Nail| 0.50|
| 2| 10% Off| 0.10| | 2|Product| 9002| 0| | 9002| 3456| | 9002| 2"x4" Stud| 2.50|
| 3| 25% Off| 0.25| | 2| PCat| 3456| 1| | 9005| 3456| | 9003| Claw Hammer| 5.99|
| 4| 2 for 1| 0.50| | 3| PCat| 7346| 1| | 9001| 7346| | 9004| Wood Glue| 1.20|
| 5|Clearance| 0.75| | 3| PCat| 4455| 1| | 9003| 7346| | 9005|6'x4' Dry Wall| 10.39|
| 5|Product| 9004| 0| | 9003| 4455| | 9006| Screwdriver| 4.25|
| 9006| 4455|
With these tables I need to get the intersection of Product Categories if there under the same Discount Id. The below table is what I need to get:
|DisId|ProdId|DisPrice|
+-----+------+--------+
| 2| 9001| 0.45|
| 2| 9002| 2.25|
| 2| 9005| 9.36|
| 3| 9003| 4.50|
I have tried a few different ways but can't seem to get to that table. The below SQL returns me the discounts that have more then one category applied to it.
SELECT DR.DisId, PC.CatId
FROM DiscountRef DR
INNER JOIN (
SELECT DisId
FROM DiscountRef
GROUP BY DisId
HAVING COUNT(DisId) > 1
) SDR ON SDR.DisId = DR.DisId
INNER JOIN ProductCat PC ON PC.CatId = DR.RefId AND DR.RefType = 'PCat'
GROUP BY DR.DisId, PC.CateId
Table Returned:
|DisId|CatId|
+-----+-----+
| 3| 7346|
| 3| 4455|
Then using the Product Categories Id's with an intersect of Product tables I get the correct amount of product Ids.
SELECT P1.ProdId
FROM Product P1
INNER JOIN ProdCat PC1 ON PC1.ProdId = P1.ProdId AND PC1.CategoryId = 7346
INTERSECT
SELECT P2.ProdId
FROM Product P2
INNER JOIN ProdCat PC2 ON PC2.ProdId = P2.ProdId AND PC2.CategoryId = 4455
Also a discount can have more then two categories (Narrows down the number of products), and some times there's more then one discount active (discount data is omitted for this but a check will be done).
Any help on how I can get my desired table above?
EDIT: If there are multiple DisIds on the DiscountRef table and they happen to be the PCat type they are products that shared in all the categories. Like how Claw Hammer is the only item that appears in both CatId 7346 AND CatId 4455.

Retrieving rows that share multiple ID's in SQL

I am stuck on how to narrow down a selection of rows that are related by multiple ID's. Here is my problem with the data as follows:
|Widget | |Widget Category | |Part Category | |Part |
+---------+ +--------------------+ +--------------+ +-------------+
|Id|Name | |WidId|CatId|CatName | |PartId| CatId | |Id|Name |
+---------+ +-----+-----+--------+ +------+-------+ +--+----------+
| 1|item01| | 1| 1|Windows | | 1| 1| | 1|Glass |
| 2|item02| | 2| 1|Windows | | 1| 2| | 2|Door Frame|
| 3|item03| | 3| 1|Windows | | 2| 2| | 3|Wheel |
| 4|item04| | 1| 2|Door | | 4| 2| | 4|Handle |
| 5|item05| | 5| 2|Door |
| 6|item06| | 6| 3|Trunk |
One or more widgets can be in a Widget Category. Many widget categories can have many part Categories. Many Parts can be part of many part categories. I need to know what Parts are linked to what Widgets. So we know that Item01 has parts "Glass" and Item05 has Parts "Glass, Door Frame, and Handle".
Here is my SQL I have so far but I need it to be dynamic so it can run once a week on a stored procedure.
---- This gives me the Correct number of Widgets to Parts based on set of 2 category ID's as a quick and static hack
SELECT W.Id
FROM Widget W
INNER JOIN dbo.[WidgetCategory] WC1 ON WC1.WidId = W.Id
INNER JOIN dbo.[WidgetCategory] WC2 ON WC2.WidId = W.Id
WHERE WC1.CatId = 1 AND WC2.CatId = 2
GROUP BY W.Id
The reason for the above query is to get a table structure that is grouped by PartId's to WidgetId's as an intersection of the two related categories and all the widgets that are related to parts. The below table is what I am trying to get so that I can aggregate how many widgets are in a part (COUNT(WidId) GROUP BY PartId):
|WidId|PartId|WidgetName|
+-----+------+----------+
| 1| 1| Item01|
| 2| 1| Item02|
| 3| 1| Item03|
| 1| 2| Item01|
| 5| 2| Item05|
Updated question: How can I get this response from the tables above with only returning the intersection of the two categories?
|WidId|PartId|WidgetName|
+-----+------+----------+
| 1| 1| Item01|
| 1| 2| Item01|
Any help would be greatly appreciated! Sorry for the sloppiness, had to post quickly before I left for weekend.
EDIT: Sorry, about the ProductId, was left over from some SQL that I was using. Should be Widget Id. Added more clarity to the problem and added an addition problem I was having.
I think you need a query like this.
SELECT DISTINCT w.WidId, p.ParId, w.Name
FROM Widget w
JOIN WidgetCategory wc ON wc.WidId=w.Id
JOIN PartCategory pc ON pc.CatId=wc.CatId
JOIN Part p ON p.Id=pc.ParId
I don't see why you would need to join twice on the WidgetCategory table. What you need is to reach the Part table by joining the PartCategory table.
And why are you grouping? If you want all the parts, then you can't group, unless you use some specific SQL feature to concatenate all the parts in a single row. This may or may not be possible, depending on which database engine you are using.
I added the DISTINCT, just in case you have more than one ways to get from Widget X to Part Y... that is enough to remove duplicates. There is no need for a GROUP BY unless you need to COUNT or do something else with the aggregation.