Speeding up conversion of pandas to numpy - pandas

I've got a Pandas df of approximately 2.5m rows, with a multi-index of the form:
('assetCode', 'date') and approximately 60 columns.
I'm trying to convert this to a 3D numpy matrix:
assetCodes = X_calculated.index.get_level_values(0).unique().sort_values().to_numpy()
dates = X_calculated.index.get_level_values(1).unique().sort_values().to_numpy()
columns = X_calculated.columns.to_numpy()
myData = np.empty((assetCodes.size, dates.size, columns.size))
def updateMatrix(row):
idx = row.name
assetLabel = np.searchsorted(assetCodes, idx[0])
dateLabel = np.where(dates == idx[1])
myData[assetLabel][dateLabel] = row.to_numpy()
X_calculated.apply(updateMatrix, axis=1)
This operation takes a very long time. Is there a quicker way?

I think if you already have all the combinations of assetCode and date in your dataframe, you can do it whit reshape:
# example data
X_calculated = pd.DataFrame(np.arange(36).reshape(9, -1),
index=pd.MultiIndex.from_product([range(101,104),
range(111,114)],
names=('assetCode','date')),
columns=list('abcd'))
# get dimensions
nb_asset = X_calculated.index.get_level_values(0).nunique()
nb_dates = X_calculated.index.get_level_values(1).nunique()
nb_cols = len(X_calculated.columns)
# create myData
myData = X_calculated.sort_index().to_numpy().reshape(nb_asset, nb_dates, nb_cols)
print (myData) #same result than with your code
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]]
If you have missing combinations, you can use reindex before with pd.MultiIndex.from_product with unique value in both index levels. No need to sort_index anymore I think as the new multiIndex is generated sorted
assetCodes = X_calculated.index.get_level_values(0).unique().sort_values()
dates = X_calculated.index.get_level_values(1).unique().sort_values()
myData = (X_calculated.reindex(pd.MultiIndex.from_product([assetCodes, dates]))
.to_numpy()
.reshape(len(assetCodes), len(dates), len(X_calculated.columns))
)

Related

How can I add several columns within a dataframe (broadcasting)?

import numpy as np
import pandas as pd
data = [[30, 19, 6], [12, 23, 14], [8, 18, 20]]
df = pd.DataFrame(data = data, index = ['A', 'B', 'C'], columns = ['Bulgary', 'Robbery', 'Car Theft'])
df
I get the following:
Bulgary
Robbery
Car Theft
A
30
19
6
B
12
23
14
C
8
18
20
I would like to assign:
df['Total'] = df['Bulgary'] + df['Robbery'] + df['Car Theft']
But does this operation have to be done manually? I am looking for a function that can handle conveniently.
#pseudocode
#df['Total'] = df.Some_Column_Adding_Function([0:3])
#df['Total'] == df['Bulgary'] + df['Robbery'] + df['Car Theft'] returns True
Similarly, how do I add across rows?
Use sum:
df['Total'] = df.sum(axis=1)
Or if you want subset of columns:
df['Total'] = df[df.columns[0:3]].sum(axis=1)
# or df['Total'] = df[['Bulgary', 'Robbery', 'Car Theft']].sum(axis=1)

drop columns according to header value ()

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks
Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2

Generating one NumPy array for each DataFrame row

I'm attempting to plot stock market trades against a plot of the particular stock using mplfinance.plot(). I keep record of all my trades using jstock which uses as CSV file:
"Code","Symbol","Date","Units","Purchase Price","Current Price","Purchase Value","Current Value","Gain/Loss Price","Gain/Loss Value","Gain/Loss %","Broker","Clearing Fee","Stamp Duty","Net Purchase Value","Net Gain/Loss Value","Net Gain/Loss %","Comment"
"ASO","Academy Sports and Outdoors, Inc.","Sep 13, 2021","25.0","45.85","46.62","1146.25","1165.5","0.769999999999996","19.25","1.6793893129770994","0.0","0.0","0.0","1146.25","19.25","1.6793893129770994",""
"ASO","Academy Sports and Outdoors, Inc.","Aug 26, 2021","15.0","41.3","46.62","619.5","699.3","5.32","79.79999999999995","12.881355932203384","0.0","0.0","0.0","619.5","79.79999999999995","12.881355932203384",""
"ASO","Academy Sports and Outdoors, Inc.","Jun 3, 2021","10.0","37.48","46.62","374.79999999999995","466.2","9.14","91.40000000000003","24.386339381003214","0.0","0.0","0.0","374.79999999999995","91.40000000000003","24.386339381003214",""
"RMBS","Rambus Inc.","Nov 24, 2021","2.0","26.99","26.99","53.98","53.98","0.0","0.0","0.0","0.0","0.0","0.0","53.98","0.0","0.0",""
I can get this data easily enough using
myportfolio = pd.read_csv(PORTFOLIO_LOCATION, parse_dates=[2])
But I need to create individual lists for each trade that match the day-by-day stock price:
Date,High,Low,Open,Close,Volume,Adj Close
2020-12-01,17.020000457763672,16.5,16.799999237060547,16.8799991607666,990900,16.8799991607666
2020-12-02,17.31999969482422,16.290000915527344,16.65999984741211,16.40999984741211,1200500,16.40999984741211
and I have a normal DataFrame containing this. So far this is what I have:
for i in myportfolio.groupby("Code"):
(code, j) = i
if code == "ASO": # just testing it against one stock
simp = pd.DataFrame(columns=["Date", "Units", "Price"],
data=j[["Date", "Units", "Purchase Price"]].values, index=j[["Date"]])
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
# df.lookup(simp["Date"])
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[simp["Date"]]['row_num']
trades = []
for index, m in k.iteritems():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[m] = simp[index]["Price"]
trades.append(t.to_list())
But I receive a KeyError: Timestamp('2021-09-17 00:00:00')
Any ideas of how to fix this?
Addendum 1:
import pandas as pd
trade_data = [['ASO', '5/5/21', 10], ['ASO', '5/6/21', 12], ['RBLX', '5/7/21', 15]]
trade_df = pd.DataFrame(trade_data, columns = ['Code', 'Date', 'Price'])
trade_df['Date'] = pd.to_datetime(trade_df['Date'])
trade_df
Code Date Price
0 ASO 2021-05-05 10
1 ASO 2021-05-07 12
2 RBLX 2021-05-07 15
aso_data = [['5/5/21', 12, 5, 10, 7], ['5/6/21', 15, 7, 13, 8], ['5/7/21', 17, 10, 15, 11]]
aso_df = pd.DataFrame(aso_data, columns = ['Date', 'High', 'Low', 'Open', 'Close'])
aso_df['Date'] = pd.to_datetime(aso_df['Date'])
aso_df
Date High Low Open Close
0 2021-05-05 12 5 10 7
1 2021-05-06 15 7 13 8
2 2021-05-07 17 10 15 11
So I want to create two NumPy arrays for ASO {one for each trade) and one for the RBLX trade. For ASO I should have two NumPy arrays that looks like [10, Nan, Nan] and [NaN, NaN, 12].
Do you want a list of lists right?
There is no need to loop.
df_list = df.values.tolist()
just in case another novice such as myself surfs in with a similar problem.
for i in myportfolio.groupby(["Code"]):
(code, j) = i
if code == "ASO": # just testing it against one stock
df = pd.read_csv("ASO-2020-12-01-2021-12-01.csv", index_col=0, parse_dates=True)
df.insert(0, 'row_num', range(0,len(df)))
k = df.loc[j["Date"]]['row_num']
trades = []
for index, m in j.iterrows():
t = np.zeros((df.shape[0], 1))
t.fill(np.nan)
t[int(df.loc[m["Date"]]['row_num'])] = m["Purchase Price"]
asplot = mpf.make_addplot(t, type="scatter", color='red', marker="D")
trades.append(asplot)
mpf.plot(df, type='candle', addplot=trades)
produced an okay graph showing my entry points. good luck

Using tensorflow dataset with stratified sampling

Given a tensorflow dataset
Train_dataset = tf.data.Dataset.from_tensor_slices((Train_Image_Filenames,Train_Image_Labels))
Train_dataset = Train_dataset.map(Parse_JPEG_Augmented)
...
I would like to stratify my batches to deal with class imbalance. I found tf.contrib.training.stratified_sample and thought I could use it in the following way:
Train_dataset_iter = Train_dataset.make_one_shot_iterator()
Train_dataset_Image_Batch,Train_dataset_Label_Batch = Train_dataset_iter.get_next()
Train_Stratified_Images,Train_Stratified_Labels = tf.contrib.training.stratified_sample(Train_dataset_Image_Batch,Train_dataset_Label_Batch,[1/Classes]*Classes,Batch_Size)
But it gives the following error and I'm not sure that this would allow me to keep the performance benefits of tensorflow dataset as I may have then have to pass Train_Stratified_Images and Train_Stratified_Labels via feed_dict ?
File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/sampling_ops.py", line 192, in stratified_sample
with ops.name_scope(name, 'stratified_sample', list(tensors) + [labels]):
File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 459, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.
What would be the "best practice" way of using dataset with stratified batches?
Here is below a simple example to demonstrate the usage of sample_from_datasets (thanks #Agade for the idea).
import math
import tensorflow as tf
import numpy as np
def print_dataset(name, dataset):
elems = np.array([v.numpy() for v in dataset])
print("Dataset {} contains {} elements :".format(name, len(elems)))
print(elems)
def combine_datasets_balanced(dataset_smaller, size_smaller, dataset_bigger, size_bigger, batch_size):
ds_smaller_repeated = dataset_smaller.repeat(count=int(math.ceil(size_bigger / size_smaller)))
# we repeat the smaller dataset so that the 2 datasets are about the same size
balanced_dataset = tf.data.experimental.sample_from_datasets([ds_smaller_repeated, dataset_bigger], weights=[0.5, 0.5])
# each element in the resulting dataset is randomly drawn (without replacement) from dataset even with proba 0.5 or from odd with proba 0.5
balanced_dataset = balanced_dataset.take(2 * size_bigger).batch(batch_size)
return balanced_dataset
N, M = 3, 10
even = tf.data.Dataset.range(0, 2 * N, 2).repeat(count=int(math.ceil(M / N)))
odd = tf.data.Dataset.range(1, 2 * M, 2)
even_odd = combine_datasets_balanced(even, N, odd, M, 2)
print_dataset("even", even)
print_dataset("odd", odd)
print_dataset("even_odd_all", even_odd)
Output :
Dataset even contains 12 elements : # 12 = 4 x N (because of .repeat)
[0 2 4 0 2 4 0 2 4 0 2 4]
Dataset odd contains 10 elements :
[ 1 3 5 7 9 11 13 15 17 19]
Dataset even_odd contains 10 elements : # 10 = 2 x M / 2 (2xM because of .take(2 * M) and /2 because of .batch(2))
[[ 0 2]
[ 1 4]
[ 0 2]
[ 3 4]
[ 0 2]
[ 4 0]
[ 5 2]
[ 7 4]
[ 0 9]
[ 2 11]]

numpy get top k elements from last dimension of ndarray

I have a multidimensional array, and I need to get the top k elements from each row of the last dimension.
>>> x = np.random.random_integers(0, 100, size=(2,1,1,5))
>>> x
array([[[[99, 39, 10, 18, 68]]],
[[[22, 3, 13, 56, 2]]]])
I'm trying to get:
array([[[[ 99., 68.]]],
[[[ 18., 99.]]]])
I can get the indices using the following, but I'm not sure how to slice out the values.
>>> k = 2
>>> parts = np.flip(-1 - np.arange(k), 0)
>>> indices = np.flip(
... np.argpartition(x, parts, axis=-1)[..., -k:],
... axis=-1)
>>> indices
array([[[[0, 4]]],
[[[3, 0]]]])
This could solve your problem.
np.sort(x, axis=len(x.shape)-1)[...,-2:]
np.partition(x, 2)[..., -2:]
returns 2 largest elements from each row. E.g.,
x = np.random.random_integers(0, 100, size=(2,1,1,5))
print(x)
print(np.partition(x, 2)[..., -2:])
prints something like
[[[[79 34 90 80 56]]]
[[[78 11 24 20 42]]]]
[[[[80 90]]]
[[[78 42]]]]