Generating one NumPy array for each DataFrame row - pandas

I'm attempting to plot stock market trades against a plot of the particular stock using mplfinance.plot(). I keep record of all my trades using jstock which uses as CSV file:
"Code","Symbol","Date","Units","Purchase Price","Current Price","Purchase Value","Current Value","Gain/Loss Price","Gain/Loss Value","Gain/Loss %","Broker","Clearing Fee","Stamp Duty","Net Purchase Value","Net Gain/Loss Value","Net Gain/Loss %","Comment"
"ASO","Academy Sports and Outdoors, Inc.","Sep 13, 2021","25.0","45.85","46.62","1146.25","1165.5","0.769999999999996","19.25","1.6793893129770994","0.0","0.0","0.0","1146.25","19.25","1.6793893129770994",""
"ASO","Academy Sports and Outdoors, Inc.","Aug 26, 2021","15.0","41.3","46.62","619.5","699.3","5.32","79.79999999999995","12.881355932203384","0.0","0.0","0.0","619.5","79.79999999999995","12.881355932203384",""
"ASO","Academy Sports and Outdoors, Inc.","Jun 3, 2021","10.0","37.48","46.62","374.79999999999995","466.2","9.14","91.40000000000003","24.386339381003214","0.0","0.0","0.0","374.79999999999995","91.40000000000003","24.386339381003214",""
"RMBS","Rambus Inc.","Nov 24, 2021","2.0","26.99","26.99","53.98","53.98","0.0","0.0","0.0","0.0","0.0","0.0","53.98","0.0","0.0",""
I can get this data easily enough using
myportfolio = pd.read_csv(PORTFOLIO_LOCATION, parse_dates=[2])
But I need to create individual lists for each trade that match the day-by-day stock price:
Date,High,Low,Open,Close,Volume,Adj Close
2020-12-01,17.020000457763672,16.5,16.799999237060547,16.8799991607666,990900,16.8799991607666
2020-12-02,17.31999969482422,16.290000915527344,16.65999984741211,16.40999984741211,1200500,16.40999984741211
and I have a normal DataFrame containing this. So far this is what I have:
for i in myportfolio.groupby("Code"):
(code, j) = i
if code == "ASO": # just testing it against one stock
simp = pd.DataFrame(columns=["Date", "Units", "Price"],
data=j[["Date", "Units", "Purchase Price"]].values, index=j[["Date"]])
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
# df.lookup(simp["Date"])
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[simp["Date"]]['row_num']
trades = []
for index, m in k.iteritems():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[m] = simp[index]["Price"]
trades.append(t.to_list())
But I receive a KeyError: Timestamp('2021-09-17 00:00:00')
Any ideas of how to fix this?
Addendum 1:
import pandas as pd
trade_data = [['ASO', '5/5/21', 10], ['ASO', '5/6/21', 12], ['RBLX', '5/7/21', 15]]
trade_df = pd.DataFrame(trade_data, columns = ['Code', 'Date', 'Price'])
trade_df['Date'] = pd.to_datetime(trade_df['Date'])
trade_df
Code Date Price
0 ASO 2021-05-05 10
1 ASO 2021-05-07 12
2 RBLX 2021-05-07 15
aso_data = [['5/5/21', 12, 5, 10, 7], ['5/6/21', 15, 7, 13, 8], ['5/7/21', 17, 10, 15, 11]]
aso_df = pd.DataFrame(aso_data, columns = ['Date', 'High', 'Low', 'Open', 'Close'])
aso_df['Date'] = pd.to_datetime(aso_df['Date'])
aso_df
Date High Low Open Close
0 2021-05-05 12 5 10 7
1 2021-05-06 15 7 13 8
2 2021-05-07 17 10 15 11
So I want to create two NumPy arrays for ASO {one for each trade) and one for the RBLX trade. For ASO I should have two NumPy arrays that looks like [10, Nan, Nan] and [NaN, NaN, 12].

Do you want a list of lists right?
There is no need to loop.
df_list = df.values.tolist()

just in case another novice such as myself surfs in with a similar problem.
for i in myportfolio.groupby(["Code"]):
(code, j) = i
if code == "ASO": # just testing it against one stock
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[j["Date"]]['row_num']
trades = []
for index, m in j.iterrows():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[int(df.loc[m["Date"]]['row_num'])] = m["Purchase Price"]
asplot = mpf.make_addplot(t, type="scatter", color='red', marker="D")
trades.append(asplot)
mpf.plot(df, type='candle', addplot=trades)
produced an okay graph showing my entry points. good luck

Related

How can I add several columns within a dataframe (broadcasting)?

import numpy as np
import pandas as pd
data = [[30, 19, 6], [12, 23, 14], [8, 18, 20]]
df = pd.DataFrame(data = data, index = ['A', 'B', 'C'], columns = ['Bulgary', 'Robbery', 'Car Theft'])
df
I get the following:
Bulgary
Robbery
Car Theft
A
30
19
6
B
12
23
14
C
8
18
20
I would like to assign:
df['Total'] = df['Bulgary'] + df['Robbery'] + df['Car Theft']
But does this operation have to be done manually? I am looking for a function that can handle conveniently.
#pseudocode
#df['Total'] = df.Some_Column_Adding_Function([0:3])
#df['Total'] == df['Bulgary'] + df['Robbery'] + df['Car Theft'] returns True
Similarly, how do I add across rows?
Use sum:
df['Total'] = df.sum(axis=1)
Or if you want subset of columns:
df['Total'] = df[df.columns[0:3]].sum(axis=1)
# or df['Total'] = df[['Bulgary', 'Robbery', 'Car Theft']].sum(axis=1)

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Sorted MultiIndex DataFrame indexig by index number

I've a MultiIndex DataFrame as follows:
header = pd.MultiIndex.from_product([['#'],
['TE', 'SS', 'M', 'MR']])
dat = ([[100, 20, 21, 35], [100, 12, 5, 15]])
df = pd.DataFrame(dat, index=['JC', 'TTo'], columns=header)
df = df.stack()
df = df.sort_values('#', ascending=False).sort_index(level=0, sort_remaining=False)
And I want to get the next rows indexig by index number not by name, that is the third row of every level 0 index:
JC M 21
TTo SS 12
Of all that I have tried, what is closest to what I am looking for is:
df.loc[pd.IndexSlice[:, df.index[2]], '#']
But this doesn't work also as intended.
You can do the following:
df["idx"] = df[df.groupby(level=0).cumcount() == 2]
df.loc[df.idx == 2]
One line solution from Quang Hoang:
df[df.groupby(level=0).cumcount() == 2]
Another way using df.xs:
df.set_index(df.groupby(level=0).cumcount()+1,append=True).xs(3,level=2)
#
JC M 21
TTo SS 12
Try with groupby then
out = df.groupby(level=0).apply(lambda x: x.iloc[[2]])
Out[141]:
#
JC JC SS 20
TTo TTo SS 12

Slicing of a Pandas Series when index elements are not default (doesn't start with 0)

Created a Pandas Series in Python 3.7, providing the 'data' and 'index', where the data contains a list of list; len(list) = 6 and the index list contains the element which starts from 3 rather than starting from 0.
I want to slice the series.
import pandas as pd
li_a = [[1,2],[3,4],[5,6],[7,8],(9,10),(11,12)]
li_c = [3,4,5,6,7,8]
ser1 = pd.Series(data=li_a,index=li_c)
so, ser1[3] output: [1,2] i.e. the First element of the Series
I expected the output of ser1[3:] to be entire Series, but the output was
6 [7, 8]
7 (9, 10)
8 (11, 12)
dtype: object
It is working that way because you are printing by row position, not using index:
print(ser1[3:])
output:
6 [7, 8]
7 (9, 10)
8 (11, 12)
If you want to print rows from specific index number you need to use loc
print(ser1.loc[3:])
output:
3 [1, 2]
4 [3, 4]
5 [5, 6]
6 [7, 8]
7 (9, 10)
8 (11, 12)
edited: from iloc to loc :
loc gets rows (or columns) with particular labels from the index.
your full code (i have changed also your if name line:
def main():
arr = np.arange(10,16)
index1 = np.arange(3,9)
ser1 = pd.Series(data=arr,index=index1)
print(ser1)
print(ser1.loc[3:])
if __name__ == "__main__":
main()

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)