IN operator in Apache Pig - apache-pig

Is there an IN operator equivalent for Apache Pig? I'm currentl using Apache Pig 0.10.0
I want to do something similar to this:
select count(distinct(o.order_id)),count(od.prod_id),count(od.prod_id)/count(distinct(o.order_id))
from orders o
inner join order_details od
on od.order_id=o.order_id
where o.order_id
in (
select *
from (select o.order_id
from orders o
inner join order_details od
on od.order_id = o.order_id
where(o.order_date between '2013-05-01' and '2013-05-31') and (od.prod_id=1274348)
) as subq
);

This is probably the equivalent script in Pig. You can create as many interim relations as you like to get just the data you need before generating the counts. Note I've treated dates as timestamps; you could use the built-in ToDate UDF which can convert between UNIX timestamps or dates as chararrays to native Pig DateTime types.
-- Load in all of your data
-- Replace with actual paths
-- You may need to supply a delimiter value
o = LOAD 'orders' USING PigStorage() AS (
order_date:long,
order_id:chararray
);
od = LOAD 'order_details' USING PigStorage() AS (
order_id:chararray,
prod_id:chararray
);
-- Filter like WHERE in SQL
-- Replace 1000 and 2000 with actual timestamps
o_filtered = FILTER o BY order_date <= 2000 AND order_date >= 1000;
od_filtered = FILTER od BY prod_id == '1274348';
-- Inner join - only needed once in Pig
subq = JOIN o_filtered BY order_id, od_filtered BY order_id;
-- Drop fields not needed for final counts
subq_renamed = FOREACH subq GENERATE
o_filtered::order_id AS order_id,
od_filtered::prod_id AS prod_id;
-- To do the counts, need to group the data
subq_counts = FOREACH (GROUP subq_renamed ALL) {
dist_order_id = DISTINCT subq_renamed.order_id;
GENERATE
COUNT(dist_order_id) AS dist_order_id_count,
COUNT(subq_renamed.prod_id) AS prod_id_count;
}
-- Calculate the ratio count(od.prod_id)/count(distinct(o.order_id))
final_counts = FOREACH subq_counts GENERATE *,
(float)prod_id_count/dist_order_id_count AS prod_order_ratio;

Related

SQL query is loading for long period, how it could be optimized?

This is the query:
SELECT
[Code]
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY [OrderNo], [ProductNo] ORDER BY [Quantity] DESC) AS [RowNumber],
SUBSTRING(P.[ProductNo], 1, 2) AS [Code]
FROM [LESMESPRD].[FlexNet_prd].[dbo].[ORDER_DETAIL] AS OD
INNER JOIN [LESMESPRD].[FlexNet_prd].[dbo].[WIP_COMPONENT] AS WC ON [WC].[WiporderNo] = OD.[OrderNo]
AND WC.[WipOrderType] = OD.[OrderType]
AND WC.[Active] = 1
INNER JOIN [LESMESPRD].[FlexNet_prd].[dbo].[COMPONENT] AS C ON C.[ID] = WC.[ComponentID]
INNER JOIN [LESMESPRD].[FlexNet_prd].[dbo].[PRODUCT] AS P ON P.[ID] = C.[ProductID]
WHERE SUBSTRING(P.[ProductNo], 1, 2) IN ('43', '72')
) AS OrderBrandComponents
WHERE [RowNumber] = 1
Executing time is 1 minute and 16 seconds, maybe you can help me optimize it somehow? This query is just small piece of the code, but I found that exactly this part is slowing the process.
I tried to think that maybe problem is in sub select when I try to get my rownumber, from these tables that are linked servers data is executing in seconds, I think problem is with the functions. I hope that this query could be optimized.
I believe the delay is because your query is not sargable based on the SUBSTRING( P.[ProductNo], 1,2 ). The engine can not utilize an index on a function call. But by using the full column and using LIKE based on the first 2 characters Plus wild-card anything after, you get the same records, but able to use an index.
Now, because you are looking for 2 specific product type codes (43 and 72), I reversed the query to START with that table, then find orders the products were used. This may help optimize speed, especially if you have 100 orders with these products, but 1000s of orders otherwise. Thus, starting with a smaller set to begin with.
Also, you dont need all the square brackets all over. Typically, those are only used if you have a column name based on a "reserved" keyword, such as naming a column "from" which is an obvious keyword in a SQL statement. Or things that are known data types, function names, etc.
Finally indexes to help optimize this. I would ensure you have indexes on the following tables
table index
Product ( ProductNo, Id ) -- specifically this order
Component ( ProductID, Id
WIP_COMPONENT ( ComponentId, Active, WipOrderNo, WipOrderType )
ORDER_DETAIL ( OrderNo, OrderType )
SELECT
Code
FROM
(SELECT
ROW_NUMBER() OVER
(PARTITION BY OrderNo, ProductNo
ORDER BY Quantity DESC) AS RowNumber,
SUBSTRING(P.ProductNo, 1, 2) Code
FROM
LESMESPRD.FlexNet_prd.dbo.PRODUCT P
JOIN LESMESPRD.FlexNet_prd.dbo.COMPONENT C
ON P.ID = C.ProductID
JOIN LESMESPRD.FlexNet_prd.dbo.WIP_COMPONENT WC
ON C.ID = WC.ComponentID
AND WC.Active = 1
JOIN LESMESPRD.FlexNet_prd.dbo.ORDER_DETAIL OD
ON WC.WiporderNo = OD.OrderNo
AND WC.WipOrderType = OD.OrderType
WHERE
P.ProductNo like '43%'
OR P.ProductNo like '72%' ) AS OrderBrandComponents
WHERE
OrderBrandComponents.RowNumber = 1

Converting Nested SQL to ORM in Django

I have a Query Like this
SELECT
*,
(
SELECT
COALESCE(json_agg(product_attribute), '[]')
FROM
(
SELECT
*
FROM
optimus_productattribute as product_attribute
WHERE
product.id = product_attribute.product_id
)
AS product_attribute
)
AS product_atttribute
FROM
optimus_product as product inner join optimus_store as store on product.store_id = store.id
and I want to convert it to ORM
Have Tried JSONBAgg but it says it can only be used to have Single Column
Product.objects.filter(store_id=787).annotate(attributes=Coalesce(JSONBAgg(ProductAttribute.objects.filter(product=OuterRef('pk')).values_list('id', 'uuid')),[]))

Adding value of COUNT to an already existing value

I have a table for managing inventory movements and a table to manage stock.
I want to add the count of movements that the item exists in between given dates to the stock table.
My update statement looks like this:
UPDATE inventory
SET quantity = quantity + 1
WHERE ItemID IN
( SELECT ItemID FROM movements
WHERE group = '3' AND store = '500'
AND DateMove BETWEEN 20201219 AND 20201223 )
AND StoreNumber = '500'
How can I change this query to add the amount of times that the ItemID appears in movements to the quantity in the inventory table?
I've been thinking that I can add a count(itemID) and group by and add an alias in the subquery and use the alias after the + but it doesn't seem to work.
Thanks for any help
Using an UPDATE, you can use a correlated subquery:
UPDATE inventory i
SET quantity = (i.quantity +
(SELECT COUNT(*)
FROM movements m
WHERE m.ItemId = i.ItemId AND
m.group = 3 AND m.store = i.store AND
m.DateMove BETWEEN 20201219 AND 20201223
)
)
WHERE i.store = 500 AND
EXISTS (SELECT 1
FROM movements m
WHERE m.ItemId = i.ItemId AND
m.group = 3 AND m.store = 500 AND
m.DateMove BETWEEN 20201219 AND 20201223
);
Note that I removed the single quotes around 500 and 3. These values look like numbers. Only use the single quotes if they are strings.
Oracle also allows you to update using a subquery under some circumstances, so this should work as well:
update (select i.*,
(SELECT COUNT(*)
FROM movements m
WHERE m.ItemId = i.ItemId AND
m.group = 3 AND m.store = i.store AND
m.DateMove BETWEEN 20201219 AND 20201223
) as inc_quantity
from inventory i
where store = 500
)
set quantity = quantity + inc_quantity
where quantity > 0;
You appear to want a MERGE statement and to COUNT the movements:
MERGE INTO inventory dst
USING (
SELECT ItemID,
store,
COUNT(*) AS num_movements
FROM movements
WHERE group = 3
AND store = 500
AND DateMove BETWEEN DATE '2020-12-19' AND DATE '2020-12-23'
GROUP BY
ItemId,
store
) src
ON ( src.ItemId = dst.ItemId AND dst.StoreNumber = src.store )
WHEN MATCHED THEN
UPDATE
SET quantity = dst.quantity + src.num_movements;
(Also, if values are numbers then use number literals and not string literals and, for dates, use date literals and not numbers.)
You need a correlated subquery. For brevity, I've omitted all other where conditions.
UPDATE inventory AS inv
SET quantity = quantity + (SELECT COUNT(*) FROM movements AS mov WHERE mov.itemID = inv.itemID);
I think it's better to create a view such as
CREATE OR REPLACE VIEW v_inventory AS
SELECT i.*,
SUM(CASE
WHEN i.StoreNumber = 500 AND m."group" = 3 AND m.store = 500 AND
m.DateMove BETWEEN date'2020-12-19' AND date'2020-12-23'
THEN
1
ELSE
0
END) OVER (PARTITION BY m.ItemID) AS mov_quantity
FROM inventory i
LEFT JOIN movements m
ON m.ItemID = i.ItemID
rather than applying a DML for the sake of good db design, since the desired count already can be calculated, and may yield for later confusions

SQL Server - Code for Updating values using inner query self join

this may be quite basic for most, but I'd like to get some help.
Using SQL Server I have the following Orders table (Excel excerpt to simplify):
Please note there are multiple orders (OrderID). Some may have "PrimaryOrder" value, meaning they are related to an existing prior order. Related orders receive the "PrimaryOrder" of the 1st related order, and an "OrderIndex" noting the order they came in.
Only the first order in each set has value. If an order's "PrimaryOrder" is NULL, it means it is a single order and I should simply ignore it.
What I need is, using SQL Server Update command give all orders which are related, the same "value" as their 1st related order's "Value".
Meaning for each order that has "OrderIndex" > 1, update it's Value field from NULL to it's "PrimaryOrder" value.
If "PrimaryOrder" = 1 OR is NULL, ignore and don't update.
Tried some simple INNER JOIN but got lost.
I don't think it should be too complicated, but I might be overthinking it.
Thank you!
You can use correlated sub-query with update statement :
update o
set o.value = (select top (1) o1.value
from Orders o1
where o1.primaryorder = o.primaryorder and
o1.value is not null and
o1.orderindex <= o.orderindex
order by o1.orderindex desc
)
from Orders o
where o.value is null;
Perhaps something like this:
UPDATE #table
SET a.Value=b.Value
FROM #table a INNER JOIN #table b
on a.OrderID=b.PrimaryOrder and a.OrderIndex>1 and a.Value is NULL
UPDATE o
SET Value=MaxVals.MaxValue
FROM Orders o
INNER JOIN (
SELECT MAX(Value) AS MaxValue, PrimaryOrder
FROM Orders
WHERE PrimaryOrder IS NOT NULL
GROUP BY PrimaryOrder
) AS MaxVals ON MaxVals.PrimaryOrder=o.PrimaryOrder
WHERE o.Value IS NULL
Thank you all.
Managed to take something from all the above and this solved it:
UPDATE O
SET O.[Value] = B.[Value]
FROM Orders O INNER JOIN Orders B
ON O.PrimaryOrder = B.[PrimaryOrder] and O.OrderIndex > 1 and O.[Value] is NULL
AND B.[OrderIndex] = 1

Inserting into Bigquery table with array of arrays

How to insert a record into a BigQuery table with nested arrays 2 levels deep.
ORDER table has an array ORDER_DETAIL which has an array ORDER_DISCOUNTS.
Below is not working.
INSERT INTO ORDER (ORDER_ID, OrderDetail)
SELECT OH.ORDER_ID, ARRAY_AGG(struct(OD.line_id, OD.item_id, ARRAY_AGG(struct(ODIS.discounttype)) )
FROM ORDER_HEADER OH LEFT JOIN ORDER_DETAIL OD, ORDER_DISCOUNTS ODIS
ON OH.ORDER_ID = OD.ORDER_ID AND ODIS.ORDER_ID = OD.ORDER_ID and ODIS.LINE_ID = OD.LINE_ID
WHERE OH.ORDER_ID = 'ABCD'
I can't see the GROUP BYs in the sample question. Reproducing here with public data to show how to make arrays of arrays in BigQuery:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
JOIN UNNEST(['Andy_K%','Boys%','Catri%']) start
ON title LIKE start
WHERE DATE(datehour) = "2019-09-01"
AND wiki='en'
)
SELECT start, ARRAY_AGG(STRUCT(title, views) LIMIT 10) title_views
FROM (
SELECT start, title, ARRAY_AGG(STRUCT(datehour,views) LIMIT 3) views
FROM data
GROUP BY start, title
)
GROUP BY start