pig script loop though calculate averages - apache-pig

I have data that will be run in pig using aws emr looks like. The columns are called model, year, units_sold, total_customers.
chevy 1900 1000 49
chevy 1901 73 92
chevy 1902 45 65
chevy 1903 300 75
ford 1900 35 12
ford 1901 777 32
ford 1902 932 484
ford 1903 33 15
What I am trying to do is calculate the average for every car type. the averages will be calculated by adding the sum of units_sold, divided by the sum of total_customers.
so the desired result would look like
chevy (1000+73+45+300) / (49+92+65+75) = 5.04
ford (35+777+932+33) / (12+32+484+15) = 3.27
in my script i have
A = *Step to load data*;
B = GROUP A by year;
C = results = FOREACH B GENERATE SUM(units_sold)/SUM(total_customers);
dump C;
This returns an incorrect result.How can I achieve results that look like
chevy 5.04
ford 3.27

Looks like you need to group by car type, not year. Also, you might need to cast to float if units_sold and total_customers are integers if you don't want a rounded result. Try:
B = GROUP A by model;
C = FOREACH B GENERATE (float)SUM(units_sold)/(float)SUM(total_customers);

Related

Finding Max Price and displaying multiple columns SQL

I have a table that looks like this:
customer_id item price cost
1 Shoe 120 36
1 Bag 180 50
1 Shirt 30 9
2 Shoe 150 40
3 Shirt 30 9
4 Shoe 120 36
5 Shorts 65 14
I am trying to find the most expensive item each customer bought along with the cost of item and the item name.
I'm able to do the first part:
SELECT customer_id, max(price)
FROM sales
GROUP BY customer_id;
Which gives me:
customer_id price
1 180
2 150
3 30
4 120
5 65
How do I get this output to also show me the item and it's cost in the output? So output should look like this...
customer_id price item cost
1 180 Bag 50
2 150 Shoe 40
3 30 Shirt 9
4 120 Shoe 36
5 65 Shorts 14
I'm assuming its a Select statement within a Select? I would appreciate the help as I'm fairly new to SQL.
One method that usually has good performance is a correlated subquery:
select s.*
from sales s
where s.price = (select max(s2.price)
from sales s2
where s2.customer_id = s.customer_id
);

find sum of the count for the latest snapshot in pandas

i have a below df and want to calculate sum of a group by taking last snapshot:
product desc id month_year count
car ford 1 2019-01 20
car ford 1 2019-02 20
car ford 1 2019-04 40
car ford 2 2019-04 30
car ford 2 2019-04 30
car ford 2 2019-04 60
and find output as
df.groupby(["product", "desc"]. ?
product desc count_overall
car ford 100
which is for id 1 take last count order by desc month_year which is 40 and similarly for 2 it is 60 which makes the total as 100
IIUC you need the id as well to get the last value of count
s=df.groupby(["product", "desc","id"])['count'].last().sum(level=[0,1]).to_frame('count_overall').reset_index()
Out[171]:
product desc count_overall
0 car ford 100
You can also use drop_duplicates given the data is sorted by date already:
(df.drop_duplicates(['product','desc','id'], keep='last')
.groupby(['product','desc'])['count'].sum()
)
Output:
product desc
car ford 100
Name: count, dtype: int64
IIUC,
we can use groupby with agg with sort_values to get the last occurance of count.
first we transform your date into a proper datetime
df['month_year'] = pd.to_datetime(df['month_year'],format='%Y-%m')
new_df = df.sort_values("count").groupby(["product", "desc", "id"]).agg(
date_max=("month_year", max), count=("count", "last")
)
print(new_df)
date_max count
product desc id
car ford 1 2019-04-01 40
2 2019-04-01 60
from here you can just do a simple sum
print(new_df.groupby(level=[0,1]).sum())
count
product desc
car ford 100

SSAS MDX calculation

I need a calculated measure(SSAS MD) to calculate the percentage of count values.
I have tried below expression but I did not get the desired output.Let me know if I missing anything and I want to calculate the percentage of the age for the group by the car total:
( [DimCar].[Car], [DimAge].[Age], [Measure].[Count])/
sum([DimCar].[Car].[All].children), ([DimAge].[Age].[All], [Meaures].[Count])}*100
Below are the sample date values in cube:
Car Age Count
----- ----- -----
Benz 1 2
Camry 37
Honda 1 18
Honda 6 10
Expected output:
Car Age Count Percent TotalCount
----- ----- ----- ------ ----------
Benz 1 2 100% 2
Camry 37 100% 37
Honda 1 18 64.28% 28
Honda 6 10 35.71% 28
Forumula to calculate percentage:
18/28*100 =64.28%
10/28*100 =35.71%
Honda 1 18 64.28% 28
Honda 6 10 35.71% 28
with Member [Measures].[Total Sales Count]
as iif (isempty([Measures].[Sales]),NUll, sum([Model].[Modelname].[All].children ,[Measures].[Sales]))
Member [Measures].[Total Sales%]
as ([Measures].[Sales]/[Measures].[Total Sales Count]),FORMAT_STRING = "Percent"
select {[Measures].[Sales],[Measures].[Total Sales Count],[Measures].[Total Sales%]
}on 0
,non empty{[Car].[Carname].[Carname]*[Model].[Modelname].[Modelname]} on 1
from [Cube]
Output :
Car Model Sales Total Sales Count Total Sales%
Benz New Model 2 2 100.00%
Camry Old Model 37 37 100.00%
Honda New Model 18 28 64.29%
Honda Top Model 10 28 35.71%
Instead of "Age" attribute I have added "Model" dimension.
Below code get exact output which is expected.
enter image description here
My understanding is that for a particular car example honda, you want to divide by the total honda's irrespective of their Age. In this case 28. So for Age:six honda you use 10/28. Where as for Benz, since all Benz are Age: 1 you use 2.
Use the following code
Round(
(
( [DimCar].[Car].currentmember, [DimAge].[Age].currentmember, [Measure].[Count])
/
([DimCar].[Car].currentmember,root([DimAge]),[Measure].[Count])
)*100
,2)
Below is a similar example on adventure works
with member
measures.t as
(
( [Product].[Category].currentmember, [Delivery Date].[Calendar Year].currentmember, [Measures].[Internet Order Quantity])
/
([Product].[Category].currentmember,root([Delivery Date]),[Measures].[Internet Order Quantity])
)*100
select {[Measures].[Internet Order Quantity],measures.t}
on columns ,
non empty
([Product].[Category].[Category],[Delivery Date].[Calendar Year].[Calendar Year])
on rows
from [Adventure Works]

Proc Optmodel SAS Variable not unique

I am using proc optmodel to solve a problem in which several items must be priced the same within the same location (let's say they are different colors of same product and are not currently priced the same). I know that volume will increase/decrease depending on direction of price change, and I have some MIN/MAX constraints as well.
The problem I am running into is that the procedure is only reading one group of unique SKUs....I think because they repeat. How can I get the procedure to optimize all unique combinations of SKU/LOCATION? I tried just changing the item numbers, which of course works, but is not practical for my business solution. Thanks.
data input_data;
input SKU DESC $ LOCATION $ OLD_PRICE MIN MAX LIFT OLD_UNITS;
cards;
111 black NY 12.99 10 15 1.3 100
222 white NY 13.45 11 15 .9 150
333 red NY 13.29 13 15 1.6 200
111 black DC 11.75 10 14 1.2 300
222 white DC 11.75 10 14 1.5 100
333 red DC 11.99 10 14 1.7 140
111 black LA 14.21 12 17 2.0 600
222 white LA 14.79 14 17 1.5 500
333 red LA 15.99 13 17 .3 200
444 orange LA 14.11 12 17 .6 300
;
run;
proc optmodel;
set<num> SKU;
string LOCATION{SKU};
string DESC{SKU};
set LOCATIONS = setof{i in SKU} LOCATION[i];
set SKUperLOCATION{gi in LOCATIONS} = {i in SKU: LOCATION[i] = gi};
number OLD_PRICE{SKU};
number MIN{SKU};
number MAX{SKU};
var NEW_PRICE{gi in LOCATIONs} >= max{i in SKUperLOCATION[gi]} MIN[i] <= min{i in SKUperLOCATION[gi]} MAX[i];
impvar NEW_PRICEbySKU{i in SKU} = NEW_PRICE[LOCATION[i]];
number LIFT{SKU};
number OLD_UNITS{SKU};
read data input_data into
SKU=[SKU]
DESC
LOCATION
OLD_PRICE
MIN
MAX
LIFT
OLD_UNITS;
max sales=sum{gi in LOCATIONs}
sum{i in SKUperLOCATION[gi]}
(NEW_PRICE[gi])*(1-(NEW_PRICE[gi]-OLD_PRICE[i])*LIFT[i]/OLD_PRICE[i])*OLD_UNITS[i];
expand;
solve;
create data results_FAM_maxsales
from [SKU]={SKU}
DESC
LOCATION
OLD_PRICE
NEW_PRICE=NEW_PRICEbySKU
MIN
MAX
LIFT
OLD_UNITS;
print NEW_PRICE sales;
quit;
One way would be to set your unique key to be SKU & Location. I haven't used OPTMODEL in a while, but something like this should work.
set<num,str> SKU_Loc;
num old_price{SKU_Loc};
<code>
read data input_data into SKU_Loc = [SKU Location];
<code>
Then change the rest of the code to reference the unique combination of SKU & location.

Display %change using SQL Server

Happy New Year!
I was wondering if we could write a SQL query in SQL Server that displays the %change.
Example, I am running this simple query with the output:
Select location, item_sold
from product
Location Item_sold
VA 20
CA 57
DC 44
MA 75
FL 101
Now, I would like the desired output as follow, which take the item sold from each location divided by the total.
Location Item_sold %Chg
VA 20 7%
CA 57 19%
DC 44 15%
MA 75 25%
FL 101 34%
TOTAL 297
In SQL Server 2005+, you can use the following:
SELECT location, item_sold, CAST(item_Sold AS FLOAT)/(SUM(Item_Sold) OVER()) [%Chg]
FROM product
For previous versions, try this:
SELECT A.location, A.item_sold,
CAST(A.item_Sold AS FLOAT)/(SELECT SUM(Item_Sold) Ammount FROM product) [%Chg]
FROM product A