Data Frame % column by groupping - pandas

I am working on a forecast accuracy report which measure the deviation between actual & pervious projection. The measurement would be = 1- ('Actual' - 'M-1') / 'Actual' .
There measure need to be groupped based different gratuity, say 'Product Category' / 'Line' / 'Product'. However, the df.groupby('Product Category').sum() function couldnt support the percentage calculation. Does anyone have idea how it should be fixed? Thanks!
data = {
"Product Category": ['Drink', 'Drink','Drink','Food','Food','Food'],
"Line": ['Water', 'Water','Wine','Fruit','Fruit','Fruit'],
"Product": ['A', 'B', 'C','D','E','F'],
"Actual": [100,50,40,20,70,50],
"M-1": [120,40,10,20,80,50],
}
df = pd.DataFrame(data)
df['M1 Gap'] = df['Actual'] - df['M-1']
df['Error_Per'] = 1- df['M1 Gap'] / df['Actual']
Expected output would be
enter image description here

You can also create a custom function and apply it on every row of a pandas data frame as follows. Just note that I set the axis argument to 1 so that the custom function is applied on each row or across columns:
import pandas as pd
def func(row):
row['M1 Gap'] = row['Actual'] - row['M-1']
row['Error_Per'] = 1 - (row['M1 Gap'] / row['Actual'])
return row
df.groupby('Product Category').sum().apply(func, axis = 1)
Actual M-1 M1 Gap Error_Per
Product Category
Drink 190.0 170.0 20.0 0.894737
Food 140.0 150.0 -10.0 1.071429

You should group BEFORE calculating percentage:
data = {
"Product Category": ['Drink', 'Drink','Drink','Food','Food','Food'],
"Line": ['Water', 'Water','Wine','Fruit','Fruit','Fruit'],
"Product": ['A', 'B', 'C','D','E','F'],
"Actual": [100,50,40,20,70,50],
"M-1": [120,40,10,20,80,50],
}
df = pd.DataFrame(data)
df['M1 Gap'] = df['Actual'] - df['M-1']
df_line = df.groupby('Line').sum()
df_line['Error_Per'] = df_line['M1 Gap'] / df_line['Actual']
print(df_line)
df_prod = df.groupby('Product Category').sum()
df_prod['Error_Per'] = df_prod['M1 Gap'] / df_prod['Actual']
print(df_prod)
Output:
Actual M-1 M1 Gap Error_Per
Line
Fruit 140 150 -10 -0.071429
Water 150 160 -10 -0.066667
Wine 40 10 30 0.750000
Actual M-1 M1 Gap Error_Per
Product Category
Drink 190 170 20 0.105263
Food 140 150 -10 -0.071429
Note: your expected Outcome from the screenshot doesn't match the dictionary of your code (which I used)

Related

Lambda not in function doesn't work for more than one word in Python

I would like to filter dataframe by lambda if condition
I have a "product name" and "category1" columns and "if product name" not contains ("boxer","boxers","sock","socks") words I would like to change "category1" column as "Other", but below code change all of them as "other" example even contains "sock"
df = pd.DataFrame({
'product_name': ["blue shirt", " medium boxers", "red jackets ", "blue sock"],})
df["category1"]=df.apply(lambda x: "Other" if ("boxer","boxers","sock","socks" not in x["product_name"] ) else x["category1"], axis=1)
I expected below results
df = pd.DataFrame({
'product_name': ["blue shirt", " medium boxers", "red jackets ", "blue sock"],
'category1'["other", Nan, "other ", "Nan"],})
Thank you for your support
You could use str.contains:
items = ("boxer","boxers","sock","socks")
import numpy as np
df["category1"] = np.where(df['product_name'].str.contains('|'.join(items)),
np.nan, # value is True
'Other') # value if False
output:
product_name category1
0 blue shirt Other
1 medium boxers nan
2 red jackets Other
3 blue sock nan

Generating one NumPy array for each DataFrame row

I'm attempting to plot stock market trades against a plot of the particular stock using mplfinance.plot(). I keep record of all my trades using jstock which uses as CSV file:
"Code","Symbol","Date","Units","Purchase Price","Current Price","Purchase Value","Current Value","Gain/Loss Price","Gain/Loss Value","Gain/Loss %","Broker","Clearing Fee","Stamp Duty","Net Purchase Value","Net Gain/Loss Value","Net Gain/Loss %","Comment"
"ASO","Academy Sports and Outdoors, Inc.","Sep 13, 2021","25.0","45.85","46.62","1146.25","1165.5","0.769999999999996","19.25","1.6793893129770994","0.0","0.0","0.0","1146.25","19.25","1.6793893129770994",""
"ASO","Academy Sports and Outdoors, Inc.","Aug 26, 2021","15.0","41.3","46.62","619.5","699.3","5.32","79.79999999999995","12.881355932203384","0.0","0.0","0.0","619.5","79.79999999999995","12.881355932203384",""
"ASO","Academy Sports and Outdoors, Inc.","Jun 3, 2021","10.0","37.48","46.62","374.79999999999995","466.2","9.14","91.40000000000003","24.386339381003214","0.0","0.0","0.0","374.79999999999995","91.40000000000003","24.386339381003214",""
"RMBS","Rambus Inc.","Nov 24, 2021","2.0","26.99","26.99","53.98","53.98","0.0","0.0","0.0","0.0","0.0","0.0","53.98","0.0","0.0",""
I can get this data easily enough using
myportfolio = pd.read_csv(PORTFOLIO_LOCATION, parse_dates=[2])
But I need to create individual lists for each trade that match the day-by-day stock price:
Date,High,Low,Open,Close,Volume,Adj Close
2020-12-01,17.020000457763672,16.5,16.799999237060547,16.8799991607666,990900,16.8799991607666
2020-12-02,17.31999969482422,16.290000915527344,16.65999984741211,16.40999984741211,1200500,16.40999984741211
and I have a normal DataFrame containing this. So far this is what I have:
for i in myportfolio.groupby("Code"):
(code, j) = i
if code == "ASO": # just testing it against one stock
simp = pd.DataFrame(columns=["Date", "Units", "Price"],
data=j[["Date", "Units", "Purchase Price"]].values, index=j[["Date"]])
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
# df.lookup(simp["Date"])
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[simp["Date"]]['row_num']
trades = []
for index, m in k.iteritems():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[m] = simp[index]["Price"]
trades.append(t.to_list())
But I receive a KeyError: Timestamp('2021-09-17 00:00:00')
Any ideas of how to fix this?
Addendum 1:
import pandas as pd
trade_data = [['ASO', '5/5/21', 10], ['ASO', '5/6/21', 12], ['RBLX', '5/7/21', 15]]
trade_df = pd.DataFrame(trade_data, columns = ['Code', 'Date', 'Price'])
trade_df['Date'] = pd.to_datetime(trade_df['Date'])
trade_df
Code Date Price
0 ASO 2021-05-05 10
1 ASO 2021-05-07 12
2 RBLX 2021-05-07 15
aso_data = [['5/5/21', 12, 5, 10, 7], ['5/6/21', 15, 7, 13, 8], ['5/7/21', 17, 10, 15, 11]]
aso_df = pd.DataFrame(aso_data, columns = ['Date', 'High', 'Low', 'Open', 'Close'])
aso_df['Date'] = pd.to_datetime(aso_df['Date'])
aso_df
Date High Low Open Close
0 2021-05-05 12 5 10 7
1 2021-05-06 15 7 13 8
2 2021-05-07 17 10 15 11
So I want to create two NumPy arrays for ASO {one for each trade) and one for the RBLX trade. For ASO I should have two NumPy arrays that looks like [10, Nan, Nan] and [NaN, NaN, 12].
Do you want a list of lists right?
There is no need to loop.
df_list = df.values.tolist()
just in case another novice such as myself surfs in with a similar problem.
for i in myportfolio.groupby(["Code"]):
(code, j) = i
if code == "ASO": # just testing it against one stock
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[j["Date"]]['row_num']
trades = []
for index, m in j.iterrows():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[int(df.loc[m["Date"]]['row_num'])] = m["Purchase Price"]
asplot = mpf.make_addplot(t, type="scatter", color='red', marker="D")
trades.append(asplot)
mpf.plot(df, type='candle', addplot=trades)
produced an okay graph showing my entry points. good luck

Optimization Python

I am trying to get the optimal solution
column heading: D_name , Vial_size1 ,Vial_size2 ,Vial_size3 , cost , units_needed
row 1: Act , 120 , 400 , 0 , $5 , 738
row 2: dug , 80 , 200 , 400 , $40 , 262
data in excel
column heading: Vials price size
Row 1: Vial size 1 5 120
Row 2: Vial size 2 5 400
prob=LpProblem("Dose_Vial",LpMinimize)
import pandas as pd
df = pd.read_excel (r'C:\Users\*****\Desktop\Vial.xls')
print (df)
# Create a list of the Vial_Size
Vial_Size = list(df['Vials'])
# Create a dictinary of units for all Vial_Size
size = dict(zip(Vial_Size,df['size']))
# Create a dictinary of price for all Vial_Size
Price = dict(zip(Vial_Size,df['Price']))
# print dictionaries
print(Vial_Size)
print(size)
print(Price)
vial_vars = LpVariable.dicts("Vials",size,lowBound=0,cat='Integer')
# start building the LP problem by adding the main objective function
prob += lpSum([Price[i]*vial_vars[i]*size[i] for i in size])
# adding constraints
prob += lpSum([size[f] * vial_vars[f] for f in size]) >= 738
# The status of the solution is printed to the screen
prob.solve()
print("Status:", LpStatus[prob.status])
# In case the problem is ill-formulated or there is not sufficient information,
# the solution may be infeasible or unbounded
for v in prob.variables():
if v.varValue>0:
print(v.name, "=", format(round(v.varValue)))
Vials_Vial_Size_1 = 3
Vials_Vial_Size_2 = 1
obj =round((value(prob.objective)))
print("The total cost of optimized vials: ${}".format(round(obj)))
The total cost of optimized vials: $3800
'
how to set it for 2 or more drugs and get the best optimal solution.
Here is an approach to solve the first part of the question, finding vial combinations that minimizes the waste (I'm not sure what role the price plays?):
from pulp import *
import pandas as pd
import csv
drugs_dict = {"D_name": ['Act', 'dug'],
"Vial_size1": [120, 80],
"Vial_size2": [400, 200],
"Vial_size3": [0, 400],
"cost": [5, 40],
"units_needed": [738, 262]}
df = pd.DataFrame(drugs_dict)
drugs = list(df['D_name'])
vial_1_size = dict(zip(drugs, drugs_dict["Vial_size1"]))
vial_2_size = dict(zip(drugs, drugs_dict["Vial_size2"]))
vial_3_size = dict(zip(drugs, drugs_dict["Vial_size3"]))
units_needed = dict(zip(drugs, drugs_dict["units_needed"]))
results = []
for drug in drugs:
print(f"drug = {drug}")
# setup minimum waste problem
prob = LpProblem("Minimum Waste Problem", LpMinimize)
# create decision variables
vial_1_var = LpVariable("Vial_1", lowBound=0, cat='Integer')
vial_2_var = LpVariable("Vial_2", lowBound=0, cat='Integer')
vial_3_var = LpVariable("Vial_3", lowBound=0, cat='Integer')
units = lpSum([vial_1_size[drug] * vial_1_var +
vial_2_size[drug] * vial_2_var +
vial_3_size[drug] * vial_3_var])
# objective function
prob += units
# constraints
prob += units >= units_needed[drug]
prob.solve()
print(f"units = {units.value()}")
for v in prob.variables():
if v.varValue > 0:
print(v.name, "=", v.varValue)
results.append([drug, units.value(), int(vial_1_var.value() or 0), int(vial_2_var.value() or 0), int(vial_3_var.value() or 0)])
with open('vial_results.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['drug', 'units', 'vial_1', 'vial_2', 'vial_3'])
csv_writer.writerows(results)
Running gives:
drug = Act
units = 760.0
Vial_1 = 3.0
Vial_2 = 1.0
drug = dug
units = 280.0
Vial_1 = 1.0
Vial_2 = 1.0

Matplotlib table with double headers

Hi is possible to make a matplotlib table to have a "double header" like this
(mind the dashed line)
----------------------------------------
| Feb Total | YTD Total |
----------------------------------------
| 2014|2015 | 2014/2015| 2015/2016 |
--------------------------------------------------
|VVI-ID | 12 | 20 | 188 | 169 |
--------------------------------------------------
|TDI-ID | 34 | 45 | 556 | 456 |
You can do this by using another tables with no data as headers. That is, you create empty tables, whose column labels will be the headers for your table. Let's consider this demo example. At first, add tables header_0 and header_1. At second, correct headers' and table's argument bbox to position all tables correctly. Since the tables are overlapped, the table with data should be the last one.
import numpy as np
import matplotlib.pyplot as plt
data = [[ 66386, 174296, 75131, 577908, 32015],
[ 58230, 381139, 78045, 99308, 160454],
[ 89135, 80552, 152558, 497981, 603535],
[ 78415, 81858, 150656, 193263, 69638],
[ 139361, 331509, 343164, 781380, 52269]]
columns = ('Freeze', 'Wind', 'Flood', 'Quake', 'Hail')
rows = ['%d year' % x for x in (100, 50, 20, 10, 5)]
values = np.arange(0, 2500, 500)
value_increment = 1000
# Get some pastel shades for the colors
colors = plt.cm.BuPu(np.linspace(0, 0.5, len(rows)))
n_rows = len(data)
index = np.arange(len(columns)) + 0.3
bar_width = 0.4
# Initialize the vertical-offset for the stacked bar chart.
y_offset = np.array([0.0] * len(columns))
# Plot bars and create text labels for the table
cell_text = []
for row in range(n_rows):
plt.bar(index, data[row], bar_width, bottom=y_offset, color=colors[row])
y_offset = y_offset + data[row]
cell_text.append(['%1.1f' % (x/1000.0) for x in y_offset])
# Reverse colors and text labels to display the last value at the top.
colors = colors[::-1]
cell_text.reverse()
# Add headers and a table at the bottom of the axes
header_0 = plt.table(cellText=[['']*2],
colLabels=['Extra header 1', 'Extra header 2'],
loc='bottom',
bbox=[0, -0.1, 0.8, 0.1]
)
header_1 = plt.table(cellText=[['']],
colLabels=['Just Hail'],
loc='bottom',
bbox=[0.8, -0.1, 0.2, 0.1]
)
the_table = plt.table(cellText=cell_text,
rowLabels=rows,
rowColours=colors,
colLabels=columns,
loc='bottom',
bbox=[0, -0.35, 1.0, 0.3]
)
# Adjust layout to make room for the table:
plt.subplots_adjust(left=0.2, bottom=-0.2)
plt.ylabel("Loss in ${0}'s".format(value_increment))
plt.yticks(values * value_increment, ['%d' % val for val in values])
plt.xticks([])
plt.title('Loss by Disaster')
plt.show()
If extra header is symmetric or combine equal quantity of "normal" header, all you need to do is to add an extra header table and correct bbox of data table like this (the same example with deleted column):
header = plt.table(cellText=[['']*2],
colLabels=['Extra header 1', 'Extra header 2'],
loc='bottom'
)
the_table = plt.table(cellText=cell_text,
rowLabels=rows,
rowColours=colors,
colLabels=columns,
loc='bottom',
bbox=[0, -0.35, 1.0, 0.3]
)

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)