Suggest most optimized way using hive or pig - hive

Problem Statement
Assume there is one text file of logs. Below are the fields in the file.
Log File
userID
productID
action
Where Action would be one of these –
Browse, Click, AddToCart, Purchase, LogOut
Select users who performed AddToCart action but did not perform Purchase action.
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
Can anyone suggest to get this info using hive or pig with optimised performance

This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?
For User+Product:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0

Hive: Use not in
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')
Pig: Filter the ids using action and do a left join and check id is null
A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;

Related

SQL SELECT filtering out combinations where another column contains empty cells, then returning records based on max date

I have run into an issue I don't know how to solve. I'm working with a MS Access DB.
I have this data:
I want to write a SELECT statement, that gives the following result:
For each combination of Project and Invoice, I want to return the record containing the maximum date, conditional on all records for that combination of Project and Invoice being Signed (i.e. Signed or Date column not empty).
In my head, first I would sort the irrelevant records out, and then return the max date for the remaining records. I'm stuck on the first part.
Could anyone point me in the right direction?
Thanks,
Hulu
Start with an initial query which fetches the combinations of Project, Invoice, Date from the rows you want returned by your final query.
SELECT
y0.Project,
y0.Invoice,
Max(y0.Date) AS MaxOfDate
FROM YourTable AS y0
GROUP BY y0.Project, y0.Invoice
HAVING Sum(IIf(y0.Signed Is Null,1,0))=0;
The HAVING clause discards any Project/Invoice groups which include a row with a Null in the Signed column.
If you save that query as qryTargetRows, you can then join it back to your original table to select the matching rows.
SELECT
y1.Project,
y1.Invoice,
y1.Desc,
y1.Value,
y1.Signed,
y1.Date
FROM
YourTable AS y1
INNER JOIN qryTargetRows AS sub
ON (y1.Project = sub.Project)
AND (y1.Invoice = sub.Invoice)
AND (y1.Date = sub.MaxOfDate);
Or you can do it without the saved query by directly including its SQL as a subquery.
SELECT
y1.Project,
y1.Invoice,
y1.Desc,
y1.Value,
y1.Signed,
y1.Date
FROM
YourTable AS y1
INNER JOIN
(
SELECT y0.Project, y0.Invoice, Max(y0.Date) AS MaxOfDate
FROM YourTable AS y0
GROUP BY y0.Project, y0.Invoice
HAVING Sum(IIf(y0.Signed Is Null,1,0))=0
) AS sub
ON (y1.Project = sub.Project)
AND (y1.Invoice = sub.Invoice)
AND (y1.Date = sub.MaxOfDate);
Write A SQL query, which should be possible in MS-Access too, like this:
SELECT
Project,
Invoice,
MIN([Desc]) Descriptions,
SUM(Value) Value,
MIN(Signed) Signed,
MAX([Date]) "Date"
FROM data
WHERE Signed<>'' AND [Date]<>''
GROUP BY
Project,
Invoice
output:
Project
Invoice
Descriptions
Value
Signed
Date
A
1
Ball
100
J.D.
2022-09-20
B
1
Sofa
300
J.D.
2022-09-22
B
2
Desk
100
J.D.
2022-09-23
Note: for invoice 1 on project A, you will see a value of 300, which is the total for that invoice (when grouping on Project='A' and Invoice=1).
Maybe I should have used DCONCAT (see: Concatenation in between records in Access Query ) for the Description, to include 'TV' in it. But I am unable to test that so I am only referring to this answer.
Try joining a second query:
Select *
From YourTable As T
Inner Join
(Select Project, Invoice, Max([Date]) As MaxDate
From YourTable
Group By Project, Invoice) As S
On T.Project = S.Project And T.Invoice = S.Invoice And T.Date = S.MaxDate

using correlated subquery in the case statement

I’m trying to use a correlated subquery in my sql code and I can't wrap my head around what I'm doing wrong. A brief description about the code and what I'm trying to do:
The code consists of a big query (ALIASED AS A) which result set looks like a list of customer IDs, offer IDs and response status name ("SOLD","SELLING","IRRELEVANT","NO ANSWER" etc.) of each customer to each offer. The customers IDs and the responses in the result set are non-unique, since more than one offer can be made to each customer, and a customer can have different response for different offers.
The goal is to generate a list of distinct customer IDs and to mark each ID with 0 or 1 flag :
if the ID has AT LEAST ONE offer with status name is "SOLD" or "SELLING" the flag should be 1 otherwise 0. Since each customer has an array of different responses, what I'm trying to do is to check if "SOLD" or "SELLING" appears in this array for each customer ID, using correlated subquery in the case statement and aliasing the big underlying query named A with A1 this time:
select distinct
A.customer_ID,
case when 'SOLD' in (select distinct A1.response from A as A1
where A.customer_ID = A1.customer_ID) OR
'SELLING' in (select distinct A1.response from A as A1
where A.customer_ID = A1.customer_ID)
then 1 else 0 end as FLAG
FROM
(select …) A
What I get is a mistake alert saying there is no such object as A or A1.
Thanks in advance for the help!
You can use exists with cte :
with cte as (
<query here>
)
select c.*,
(case when exists (select 1
from cte c1
where c1.customer_ID = c.customer_ID and
c1.response in ('sold', 'selling')
)
then 1 else 0
end) as flag
from cte c;
You can also do aggregation :
select customer_id,
max(case when a.response in ('sold', 'selling') then 1 else 0 end) as flag
from < query here > a;
group by customer_id;
With statement as suggested by Yogesh is a good option. If you have any performance issues with "WITH" statement. you can create a volatile table and use columns from volatile table in your select statement .
create voltaile table as (select response from where response in ('SOLD','SELLING').
SELECT from customer table < and join voltaile table>.
The only disadvantge here is volatile tables cannot be accessed after you disconnect from session.

SQL Server 2008 R2: update one occurrence of a group's NULL value and delete the rest

I have a table of orders which has multiple rows of orders missing a Type and I'm struggling to get the queries right. I'm pretty new to SQL so please bear with me.
I've illustrated an example in the picture below. I need help creating the query that will take the table to the right and UPDATE it to look like the right table.
The orders are sorted by group. Each group should have one instance of type OK (IF A NULL OR OK ALREADY EXISTS), and no instances of NULL. I would like to achieve this by updating one of the groups' orders with type NULL to have type OK and delete the rest of the respective group's NULL rows.
I've managed to get the rows that I want to keep by
Create a temporary table where I insert the orders and replace NULL types with EMPTY
From the temporary table, get the existing OK orders for groups that already have one OK order, else an EMPTY order that should be changed to OK.
I've done this with the following:
SELECT * FROM Orders
SELECT *
INTO #modified
FROM
(SELECT
Id, IdGroup,
CASE WHEN Type IS NULL
THEN 'EMPTY'
ELSE Type
END Type
FROM
Orders) AS XXX
SELECT MIN(x.Id) Id, x.IdGroup, x.Type
FROM #modified x
JOIN
(SELECT
IdGroup, MIN (Type) AS min_Type
FROM #modified a
WHERE Type = 'OK' OR Type = 'EMPTY'
GROUP BY IdGroup) y ON y.IdGroup = x.IdGroup AND y.min_Type = x.Type
GROUP BY x.IdGroup, x.Type
DROP TABLE #modified
The rest of the EMPTY orders should after this step be deleted, but I don't know how to proceed from here. Maybe this is a poor approach from the beginning and maybe it could be done even easier?
Well done for writing a question that shows some effort and clearly explains what you're after. That's a rare thing unfortunately!
This is how I would do it:
First backup the table (I like to put them into a different schema to keep things neat)
CREATE SCHEMA bak;
SELECT * INTO bak.Orders FROM dbo.Orders;
Now you can do a trial run on the bak table if you like.
Anyway...
Set all the NULL types to OK
UPDATE Orders SET Type = 'OK' WHERE Type IS NULL;
Now repeatedly delete redundant records. Find records with more than one OK and delete them:
DELETE Orders WHERE ID In
(
SELECT MIN(Id) Id
FROM Orders
WHERE Type = 'OK';
GROUP BY idGroup
HAVING COUNT(*) > 1
);
You'll need to run that one a few times until it affects zero records
Assuming there are no multiple OKs and each group has at least one Ok or NULL value, you can do:
select t.id, t.idGroup, t.Type
from lefttable t
where t.Type is not null and t.Type <> 'OK'
union all
select t.id, t.idGroup, 'OK'
from (select t.*, row_number() over (partition by idGroup order by coalesce(t.Type, 'ZZZ')) as seqnum
from lefttable t
where t.Type is null or t.Type = 'OK'
) t
where seqnum = 1;
Actually, this will work even if you do have multiple OKs, but it will keep only of of the rows.
The first subquery selects all rows that are not OK or NULL. The second chooses exactly one of those group and assign the type as OK.
If you want to keep any OK ones in preference to a NULL, this will work. It creates a temp table with everything we need to work on (OK and NULL), and numbers them starting from one with each group, ordered so you list OK records before null ones. Then it makes sure all the first records are OK, and deletes all the rest
Create table #work (Id int, RowNo int)
--Get a list of all the rows we need to work on, and number them for each group
--(order by type desc puts OK before nulls)
Insert into #work (Id, RowNo)
Select Id, ROW_NUMBER() over (partition by IdGroup order by type desc) as RowNo
From Orders O
where (type is null OR type = 'OK');
-- Make sure the one we keep is OK, not null
Update O set type = 'OK'
from #Work W
inner join Orders O on O.Id = W.Id
Where W.RowNo = 1 and O.type IS NULL;
--Delete the remaining ones (any rowno > 1)
Delete O
from #Work W
inner join Orders O on O.Id = W.Id
Where W.RowNo > 1;
drop table #work;
Can't you just delete the rows where Type equals null?
DELETE FROM Orders WHERE Type IS NULL

Nested SQL Queries with Self JOIN - How to filter rows OUT

I have an SQLite3 database with a table upon which I need to filter by several factors. Once such factor is to filter our rows based on the content of other rows within the same table.
From what I've researched, a self JOIN is going to be required, but I am not sure how I would do that to filter the table by several factors.
Here is a sample table of the data:
Name Part # Status Amount
---------------------------------
Item 1 12345 New $100.00
Item 2 12345 New $15.00
Item 3 35864 Old $132.56
Item 4 12345 Old $15.00
What I need to do is find any Items that have the same Part #, one of them has an "Old" Status and the Amount is the same.
So, first we would get all rows with Part # "12345," and then check if any of the rows have an "Old" status with a matching Amount. In this example, we would have Item2 and Item4 as a result.
What now would need to be done is to return the REST of the rows within the table, that have a "New" Status, essentially discarding those two items.
Desired Output:
Name Part # Status Amount
---------------------------------
Item 1 12345 New $100.00
Removed all "Old" status rows and any "New" that had a matching "Part #" and "Amount" with an "Old" status. (I'm sorry, I know that's very confusing, hence my need for help).
I have looked into the following resources to try and figure this out on my own, but there are so many levels that I am getting confused.
Self-join of a subquery
ZenTut
Compare rows and columns of same table
The first two links dealt with comparing columns within the same table. The third one does seem to be a pretty similar question, but does not have a readable answer (for me, anyway).
I do Java development as well and it would be fairly simple to do this there, but I am hoping for a single SQL query (nested), if possible.
The "not exists" statment should do the trick :
select * from table t1
where t1.Status = 'New'
and not exists (select * from table t2
where t2.Status = 'Old'
and t2.Part = t1.Part
and t2.Amount = t1.Amount);
This is a T-SQL answer. Hope it is translatable. If you have a big data set for matches you might change the not in to !Exists.
select *
from table
where Name not in(
select Name
from table t1
join table t2
on t1.PartNumber = t2.PartNumber
AND t1.Status='New'
AND t2.Status='Old'
and t1.Amount=t2.Amount)
and Status = 'New'
could be using an innner join a grouped select for get status old and not only this
select * from
my_table
INNER JOIN (
select
Part_#
, Amount
, count(distinct Status)
, sum(case when Status = 'Old' then 1 else 0 )
from my_table
group part_#, Amount,
having count(distinct Status)>1
and sum(case when Status = 'Old' then 1 else 0 ) > 0
) t on.t.part_# = my_table.part_#
and status = 'new'
and my_table.Amount <> t.Amount
Tried to understand what you want best I could...
SELECT DISTINCT yt.PartNum, yt.Status, yt.Amount
FROM YourTable yt
JOIN YourTable yt2
ON yt2.PartNum = yt.PartNum
AND yt2.Status = 'Old'
AND yt2.Amount != yt.Amount
WHERE yt.Status = 'New'
This gives everything with a new status that has an old status with a different price.

PostgreSQL sum quantity of children items

I have a subscription service that delivers many items.
Subscribers add items to a delivery by creating a row in delivery_items.
Until recently subscribers could only add 1 of each item to a delivery. But now I have added a quantity column to my delivery_items table.
Given this schema, and an outdated query (on SQL Fiddle), how can I select the total amount of an item I will need for each day's deliveries?
This provided a table of days, and items being delivered that day but doesn't account for quantity:
SELECT
d.date,
sum((di.item_id = 1)::int) as "Bread",
sum((di.item_id = 2)::int) as "Eggs",
sum((di.item_id = 3)::int) as "Coffee"
FROM deliveries d
JOIN users u ON u.id = d.user_id
JOIN delivery_items di ON di.delivery_id = d.id
GROUP BY d.date
ORDER BY d.date
Ideally, my query would be agnostic to the specifics of the items, like the id/name.
Thanks
Edit to add schema:
deliveries (TABLE)
id int4(10)
date timestamp(29)
user_id int4(10)
delivery_items (TABLE)
delivery_id int4(10)
item_id int4(10)
quantity int4(10)
items (TABLE)
id int4(10)
name varchar(10)
users (TABLE)
id int4(10)
name varchar(10)
You don't need to JOIN your users table, because you're neither getting any data from it nor using it as your joining condition.
Here's your edited SQL Fiddle
Using conditional sum() function would retrieve values of needed goods to deliver for a particular date.
SELECT
d.date,
sum(CASE WHEN di.item_id = 1 THEN di.quantity ELSE 0 END) as "Bread",
sum(CASE WHEN di.item_id = 2 THEN di.quantity ELSE 0 END) as "Eggs",
sum(CASE WHEN di.item_id = 3 THEN di.quantity ELSE 0 END) as "Coffee"
FROM deliveries d
JOIN delivery_items di ON di.delivery_id = d.id
GROUP BY d.date
ORDER BY d.date
You could also look into crosstab(text, text) function. Result would be the same, but you can also specify query that produces the set of categories.
Though, if you want to get dynamic results when your items table has additional rows, you would need to wrap this up in a function and build the output columns and types definition, because:
The crosstab function is declared to return setof record, so the actual names and types of the output columns must be defined in the FROM clause of the calling SELECT statemen