How to Index Pandas Dataframe with a List of Slices - pandas

I have two data frames, ret and bins. I would like to take index values from bins, create a range for every row in bins and then use that list of ranges to select the data from ret. Somehow this works when I pass an index of slices (manually typed in), but doesn't work when I pass in a list saved in the variable a
ret = pd.DataFrame({'px': [.1, -.15, .30, -.20, .05]})
bins = pd.DataFrame({'t1': [3,4]}, index=[1,2])
a = []
for i, b in bins.iterrows():
a.append(slice(i, b.t1))
print('a',a)
print('np.r_[a]',np.r_[a])
print('np.r[slice',np.r_[slice(1, 3, None) , slice(1, 4, None)])
print(ret.iloc[np.r_[slice(1, 3, None) , slice(1, 4, None)]]) # this WORKS
print(ret.iloc[a] #this DOES NOT WORK)
here is the output:
a [slice(1, 3, None), slice(2, 4, None)]
np.r_[a] [slice(1, 3, None) slice(2, 4, None)]
np.r[slice [1 2 1 2 3]
px
1 -0.15
2 0.30
1 -0.15
2 0.30
3 -0.20
...
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'

Going to answer my own question here! Problem was slice() is too cumbersome to use. Easier to just flatten the lists of arrays. If anyone has any suggestions please post here!
ret = pd.DataFrame({'px': [.1, -.15, .30, -.20, .05]})
bins = pd.DataFrame({'t1': [3,4]}, index=[1,2])
a = [ret[i:b.t1].index for i, b in bins.iterrows()]
out = [item for sublist in a for item in sublist]
print(ret.loc[out])
>>> px
>>>1 -0.15
>>>2 0.30
>>>2 0.30
>>>3 -0.20

Related

Generating one NumPy array for each DataFrame row

I'm attempting to plot stock market trades against a plot of the particular stock using mplfinance.plot(). I keep record of all my trades using jstock which uses as CSV file:
"Code","Symbol","Date","Units","Purchase Price","Current Price","Purchase Value","Current Value","Gain/Loss Price","Gain/Loss Value","Gain/Loss %","Broker","Clearing Fee","Stamp Duty","Net Purchase Value","Net Gain/Loss Value","Net Gain/Loss %","Comment"
"ASO","Academy Sports and Outdoors, Inc.","Sep 13, 2021","25.0","45.85","46.62","1146.25","1165.5","0.769999999999996","19.25","1.6793893129770994","0.0","0.0","0.0","1146.25","19.25","1.6793893129770994",""
"ASO","Academy Sports and Outdoors, Inc.","Aug 26, 2021","15.0","41.3","46.62","619.5","699.3","5.32","79.79999999999995","12.881355932203384","0.0","0.0","0.0","619.5","79.79999999999995","12.881355932203384",""
"ASO","Academy Sports and Outdoors, Inc.","Jun 3, 2021","10.0","37.48","46.62","374.79999999999995","466.2","9.14","91.40000000000003","24.386339381003214","0.0","0.0","0.0","374.79999999999995","91.40000000000003","24.386339381003214",""
"RMBS","Rambus Inc.","Nov 24, 2021","2.0","26.99","26.99","53.98","53.98","0.0","0.0","0.0","0.0","0.0","0.0","53.98","0.0","0.0",""
I can get this data easily enough using
myportfolio = pd.read_csv(PORTFOLIO_LOCATION, parse_dates=[2])
But I need to create individual lists for each trade that match the day-by-day stock price:
Date,High,Low,Open,Close,Volume,Adj Close
2020-12-01,17.020000457763672,16.5,16.799999237060547,16.8799991607666,990900,16.8799991607666
2020-12-02,17.31999969482422,16.290000915527344,16.65999984741211,16.40999984741211,1200500,16.40999984741211
and I have a normal DataFrame containing this. So far this is what I have:
for i in myportfolio.groupby("Code"):
(code, j) = i
if code == "ASO": # just testing it against one stock
simp = pd.DataFrame(columns=["Date", "Units", "Price"],
data=j[["Date", "Units", "Purchase Price"]].values, index=j[["Date"]])
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
# df.lookup(simp["Date"])
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[simp["Date"]]['row_num']
trades = []
for index, m in k.iteritems():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[m] = simp[index]["Price"]
trades.append(t.to_list())
But I receive a KeyError: Timestamp('2021-09-17 00:00:00')
Any ideas of how to fix this?
Addendum 1:
import pandas as pd
trade_data = [['ASO', '5/5/21', 10], ['ASO', '5/6/21', 12], ['RBLX', '5/7/21', 15]]
trade_df = pd.DataFrame(trade_data, columns = ['Code', 'Date', 'Price'])
trade_df['Date'] = pd.to_datetime(trade_df['Date'])
trade_df
Code Date Price
0 ASO 2021-05-05 10
1 ASO 2021-05-07 12
2 RBLX 2021-05-07 15
aso_data = [['5/5/21', 12, 5, 10, 7], ['5/6/21', 15, 7, 13, 8], ['5/7/21', 17, 10, 15, 11]]
aso_df = pd.DataFrame(aso_data, columns = ['Date', 'High', 'Low', 'Open', 'Close'])
aso_df['Date'] = pd.to_datetime(aso_df['Date'])
aso_df
Date High Low Open Close
0 2021-05-05 12 5 10 7
1 2021-05-06 15 7 13 8
2 2021-05-07 17 10 15 11
So I want to create two NumPy arrays for ASO {one for each trade) and one for the RBLX trade. For ASO I should have two NumPy arrays that looks like [10, Nan, Nan] and [NaN, NaN, 12].
Do you want a list of lists right?
There is no need to loop.
df_list = df.values.tolist()
just in case another novice such as myself surfs in with a similar problem.
for i in myportfolio.groupby(["Code"]):
(code, j) = i
if code == "ASO": # just testing it against one stock
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[j["Date"]]['row_num']
trades = []
for index, m in j.iterrows():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[int(df.loc[m["Date"]]['row_num'])] = m["Purchase Price"]
asplot = mpf.make_addplot(t, type="scatter", color='red', marker="D")
trades.append(asplot)
mpf.plot(df, type='candle', addplot=trades)
produced an okay graph showing my entry points. good luck

Concatenating 2 dataframes vertically with empty row in middle

I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.

How do you strip out only the integers of a column in pandas?

I am trying to strip out only the numeric values--which is the first 1 or 2 digits. Some values in the column contain pure strings and others contain special characters. See pic for the value count:
enter image description here
I have tried multiple methods:
breaks['_Size'] = breaks['Size'].fillna(0)
breaks[breaks['_Size'].astype(str).str.isdigit()]
breaks['_Size'] = breaks['_Size'].replace('\*','',regex=True).astype(float)
breaks['_Size'] = breaks['_Size'].str.extract('(\d+)').astype(int)
breaks['_Size'].map(lambda x: x.rstrip('aAbBcC'))
None are working. The dtype is object. To be clear, I am attempting to make a new column with only the digits (as an int/float) and if I could convert the fraction to a decimal that would be bonus
This works for dividing fractions and also allows for extra numbers to be present in the string (it returns you just the first sequence of numbers):
In [60]: import pandas as pd
In [61]: import re
In [62]: df = pd.DataFrame([0, "6''", '7"', '8in', 'text', '3/4"', '1a3'], columns=['_Size'])
In [63]: df
Out[63]:
_Size
0 0
1 6''
2 7"
3 8in
4 text
5 3/4"
6 1a3
In [64]: def cleaning_function(row):
...: row = str(row)
...: fractions = re.findall(r'(\d+)/(\d+)', row)
...: if fractions:
...: return float(int(fractions[0][0])/int(fractions[0][1]))
...: numbers = re.findall(r'[0-9]+', str(row))
...: if numbers:
...: return numbers[0]
...: return 0
...:
In [65]: df._Size.apply(cleaning_function)
Out[65]:
0 0
1 6
2 7
3 8
4 0
5 0.75
6 1
Name: _Size, dtype: object

numpy get top k elements from last dimension of ndarray

I have a multidimensional array, and I need to get the top k elements from each row of the last dimension.
>>> x = np.random.random_integers(0, 100, size=(2,1,1,5))
>>> x
array([[[[99, 39, 10, 18, 68]]],
[[[22, 3, 13, 56, 2]]]])
I'm trying to get:
array([[[[ 99., 68.]]],
[[[ 18., 99.]]]])
I can get the indices using the following, but I'm not sure how to slice out the values.
>>> k = 2
>>> parts = np.flip(-1 - np.arange(k), 0)
>>> indices = np.flip(
... np.argpartition(x, parts, axis=-1)[..., -k:],
... axis=-1)
>>> indices
array([[[[0, 4]]],
[[[3, 0]]]])
This could solve your problem.
np.sort(x, axis=len(x.shape)-1)[...,-2:]
np.partition(x, 2)[..., -2:]
returns 2 largest elements from each row. E.g.,
x = np.random.random_integers(0, 100, size=(2,1,1,5))
print(x)
print(np.partition(x, 2)[..., -2:])
prints something like
[[[[79 34 90 80 56]]]
[[[78 11 24 20 42]]]]
[[[[80 90]]]
[[[78 42]]]]

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)