Virtual column with calculation in Vaex - vaex

I want to set a virtual column to a calculation using another column in Vaex. I need to use an if statement inside this calculation. In general I want to call
df['calculation_col'] = log(df['original_col']) if df['original_col'] == 0 else -4
I then try to run the count function in Vaex:
hist = df.count(
binby='calculation_col',
limits=limits,
shape=binnum,
delay=True
)
When I try to execute this code I get the error ValueError: zero-size array to reduction operation minimum which has no identity.
How can I use a conditional for a virtual column in Vaex?

Probably the most "vaex" way to do this would be to use where:
import vaex
df = vaex.example()
# The syntax is where(condition, if satisfied, else)
df['calculated_col'] = df.func.where(df['x'] < 10, 0, -4)

It might be useful to use a mask for subsetting the relevant rows:
import vaex
df = vaex.example()
mask = df["id"] < 10
df["new_col"] = mask * df["x"] + ~mask * (-4)
print(df[['id', 'x', 'new_col']].head(4))
# # id x new_col
# 0 0 1.23187 1.23187
# 1 23 -0.163701 -4
# 2 32 -2.12026 -4
# 3 8 4.71559 4.71559
Kindly note that in the original script, there would be an error triggered by numpy due to taking np.log of zero, so using np.log1p might more be appropriate in that case.

Related

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

Filter Pandas DataFrame Column Error: Wrong number of items passed 4, placement implies 1

I created a pandas DataFrame with some columns of Numpy arrays. I would like to filter one of the column and assign it to the new column.
df = pd.DataFrame({'Signal' : signalarr, 'Signal RMS with Peaks' : RMS_Calculator(signalarr)} , columns=['Signal','Signal RMS with Peaks'])
df['Signal CMA with Peaks'] = df['Signal'].expanding(2).mean()
df.loc[[0], ['Signal CMA with Peaks']] = df['Signal'][0]
df['Peaks'] = random_peak
#print(df[df['Signal'] >= 10])
#df['Signal Without Peaks'] = df[df['Signal'] >= 10] # error: Wrong number of items passed 4, placement implies 1
df['Signal Without Peaks'] = df['Signal'] >= 10 # I need the values, not the boolean.
df
I read this post Pandas - Filtering value by columns throws error (ValueError: Wrong number of items passed 3, placement implies 1) and tried the solution, but am still getting the error.
With filtering like this df['Signal Without Peaks'] = df[df['Signal'] >= 10] , I didn't get the error before. Any ideas where I am wrong?
Thanks!
Update: I created the other dataframe before and with filtering the values, I got NaN values which is desired result for my application.
df = pd.DataFrame(signalarr, columns=['Signal'])
df['Signal Without Peaks'] = df[df['Signal'] <= 10]
Dataframe
By definition, all the columns of a dataframe have the same length (which is thus also called the length of the dataframe). That's why you can't add a filtered (thus shorter) column as a new column to the original dataframe.
Instead, you may want to assign the filtered values to a new name, e.g.:
df_without_peaks = df[df['Signal'] >= 10]
Edit: To fill in NaN values when the signal has a value less than 10, you can use np.where():
import numpy as np
df['Signal filtered'] = np.where(df['Signal'] >= 10, df['Signal'], np.nan)

How to convert working pandas code to a dask code?

I have two dates in YYYMM format as
date1 = 203201
date2 = 201204
I have a dataframe [testdf] with 235 million rows which contains a date variable 'DATE_TO_COMPARE' that I need to compare with the above two dates for a filter.
I need to filter this dataframe as follows:
# Step 1: Create two date variables in the dataframe for comparison purposes
testdf['date1'] = pd.to_datetime(testdf['date1'], format = '%Y%m', errors='ignore')
testdf['date2'] = pd.to_datetime(testdf['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
testdf_filtered = testdf[(testdf['DATE_TO_COMPARE'] <= testdf['date1']) & \
(testdf['DATE_TO_COMPARE'] > testdf['date2'])]
Problem is the above operations take 70 years to execute on 235 million rows :--)
So I recently realized, I have multiples cores on my PC, a sexy 5 cores lol. So, did some research and read about drumroll...DASK!
So here I am trying to daskize this code as follows:
# Daskize pandas dataframe
import dask as dd
ddata = dd.from_pandas(testdf, npartitions=5)
# Step 1: Create two date variables in the dataframe for comparison purposes
ddata['date1'] = pd.to_datetime(ddata['date1'], format = '%Y%m', errors='ignore')
ddata['date2'] = pd.to_datetime(ddata['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
ddata_filtered = ddata[(ddata['DATE_TO_COMPARE'] <= ddata['date1']) & \
(ddata['DATE_TO_COMPARE'] > ddata['date2'])]
# Re-Pandize Daskized dataframe
testdf_filtered = ddata_filtered.compute(scheduler='processes')
I obviously run into a host of errors in the dask code! Example:
TypeError: 'DataFrame' object does not support item assignment etc.
Any education/advise/example will be much appreciated. Thanks.

Pandas Boolean selection updated?

I was used to getting a single number, which told me how many cases are TRUE for either for the conditions in the code. However, since I used conda update all I now get a list of values with either 0 or 1. I wonder what is now the simplest method in pandas to get this task done. I guess that this is a pandas update. I did a google search but could not find that they changed boolean indexing. What is the easiest way to get this sum of booleans (I know how to get it but I cannot imagine that this extra step is required).
import pandas as pd
import numpy as np
x = np.random.randint(10,size=10)
y = np.random.randint(10,size=10)
d ={}
d['x'] = x
d['y'] = y
df = pd.DataFrame(d)
sum([df['x']>=6] or [df['y']<=3])
You need to use vectorized or |:
(df.x.ge(6) | df.y.le(3)).sum()
# 9
Or: ((df.y <= 3) | (df.x >= 6)).sum(), sum((df.y <= 3) | (df.x >= 6)).

Selecting from pandas dataframe (or numpy ndarray?) by criterion

I find myself coding this sort of pattern a lot:
tmp = <some operation>
result = tmp[<boolean expression>]
del tmp
...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)
For example:
tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp
As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.
I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)
As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:
def filterobj(obj, criterion):
return obj[eval(criterion % 'obj')]
This actually works2:
filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')
# Int
# 0 -1.650107
# 2 -0.718555
# 3 -1.725498
# 4 -0.306617
# Name: II
...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.
1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...
2The definition of df used in the last expression above would be something like this:
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)
Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like
def filterobj(obj, fn):
return obj[fn(obj)]
filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)
should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.
Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like
>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
df[this < 3]
[...]
KeyError: u'no item named this < 3'
and then either special-case the treatment of this into pandas or still have a function like
def filterobj(obj, criterion):
return obj[eval(str(criterion.subs({"this": "obj"})))]
(with enough work we could lose the eval, this is simply proof of concept) after which something like
>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.
This is as concise as I could get:
(df.xs('A')['II'] - df.xs('B')['II']).apply(lambda x: x if (x<0) else np.nan).dropna()
Int
0 -4.488312
1 -0.666710
2 -1.995535
Name: II