I have 2 dataframes, df1, df2.
df1 consists of 3 columns, start, end, id.
df2 consists of 4 columns, start, end, id, quantity.
Note that start < end always for both dataframes.
For df1, end - start for each row is always 15, and the [start, end] pair for each row is nonoverlapping and contiguous for each id, e.g.,
df1:
id start end
1 0 15
1 15 30
1 30 45
2 0 15
2 15 30
2 30 45
I need to create a 4th column, quantity_average, in df1, where the quantity_average for each row is the weighted average of all df2.quantity such that the corresponding id is the same in both and there is full/partial overlap between the start, end pairs in both dataframes.
The weight is defined as (min(df2.end, df1.end) - max(df2.start, df1.start)) / 15, i.e., proportional to the amount of overlap.
I will provide a full example. We will use the df1 above, and use
df2 =
id start end quantity
1 0 1.1 3.5
1 1.1 11.4 5.5
1 11.4 34 2.5
1 34 46 3
2 0 1.5 2.2
2 1.5 20 1.0
2 20 30 4.5
So we have the result for quantity_average to be:
1.1 / 15 * 3.5 + (11.4 - 1.1)/15 * 5.5 + (15 - 11.4) / 15 * 2.5 = 4.63333
(30 - 15) / 15 * 2.5 = 2.5
(34 - 30) / 15 * 2.5 = 0.66666
1.5 / 15 * 2.2 + (15 - 1.5) / 15 * 1.0 = 1.12
(20 - 15) / 15 * 1.0 + (30 - 20) / 15 * 4.5 = 3.33333333333
0
I am wondering if there's a quick way to do this in pandas?
Here's one (not so simple) way to do it. It's fast in the sense that it uses vectorized functions but its time and memory complexities are both O(len(df1) * len(df2)). Depending on the scale of your data sets, the memory requirements may overwhelm your computers' hardware.
The idea is to use numpy broadcasting to compare every row in df1 against every row in df2, searching for pairs that:
Have same id
Have overlapping duration (start - end).
... then perform calculations over them:
# Extract the columns to numpy array
# For the columns of df1, raise each by one dimension to prepare
# for numpy broadcasting
id1, start1, end1 = [col[:, None] for col in df1.to_numpy().T]
id2, start2, end2, quantity2 = df2.to_numpy().T
# Match each row in df1 to each row in df2
# `is_match` is a matrix where if cell (i, j) is True, row i of
# df1 matches row j of df2
is_match = (id1 == id2) & (start1 <= end2) & (start2 <= end1)
# `start` is a matrix where cell (i, j) is the maximum start time
# between row i of df1 and row j of df2
start = np.maximum(
np.tile(start1, len(df2)),
np.tile(start2, (len(df1), 1))
)
# Likewise, `end` is a matrix where cell (i, j) is the minium end
# time between row i of df1 and row j of df2
end = np.minimum(
np.tile(end1, len(df2)),
np.tile(end2, (len(df1), 1))
)
# This assumes that every row in df1 has a duration of 15
df1["quantity"] = (is_match * (end - start) * quantity2).sum(axis=1) / 15
# This allow each row in df1 to have a different duration
df1["quantity"] = (is_match * (end - start) * quantity2).sum(axis=1) / (end1 - start1)[:, 0]
Related
I need to operate a column with an IF as shown in my code. It takes quite a time to compute, is there a faster, cleaner way to do this?
For reference, the column "coin" have pairs like "ETH_ARS", "DAI_USD" and so on, that´s why I split it.
for i in range(merged.shape[0]):
x = merged["coin"].iloc[i]
if x.split("_")[1] == "ARS":
merged["total"].iloc[i] = (
merged["price"].iloc[i]
* merged["amount"].iloc[i]
/ merged["valueUSD"].iloc[i]
)
else:
merged["total"].iloc[i] = merged["price"].iloc[i] * merged["amount"].iloc[i]
You can vectorize your code. The trick here is to set valueUSD=1 when coin column ends with USD. After that the operation is the same for all rows: total = price * amount / valueUSD.
Setup a MRE:
data = {'coin': ['ETH_ARS', 'DAI_USD'],
'price': [10, 12],
'amount': [3, 4],
'valueUSD': [2, 7]}
df = pd.DataFrame(data)
print(df)
# Output:
coin price amount valueUSD
0 ETH_ARS 10 3 2
1 DAI_USD 12 4 7 # <- should be set to 1 for division
valueUSD = df['valueUSD'].mask(df['coin'].str.split('_').str[1].eq('USD'), other=1)
df['total'] = df['price'] * df['amount'] / valueUSD
print(df)
# Output:
coin price amount valueUSD total
0 ETH_ARS 10 3 2 15.0 # = 10 * 3 / 2
1 DAI_USD 12 4 7 48.0 # = 10 * 3 / 1 (7 -> 1)
To do that, use mask and replace NaN by 1 instead of the valueUSD:
>>> valueUSD
0 2
1 1 # 7 -> 1
Name: valueUSD, dtype: int64
I have a dataframe df which consists of columns of countries and rows of dates. The index is of type "DateTime."
I would like to sort the df by the values of each country by the last element in the series (eg, the latest date) and the graph the "top N" countries by this latest value.
I thought if I sorted the transpose of the df and then slice it, I would have what I need. Hence, if N = 10, then I would select df[0:9].
However,when I attempt to select the last column, I get a 'keyerror' message, referencing the selected column:
KeyError: '2021-03-28 00:00:00'.
I'm stumped....
df_T = df.transpose()
column_name = str(df_T.columns[-1])
df_T.sort_values(by = column_name, axis = 'columns', inplace = True)
#select the top 10 countries by latest value, eg
# plot df_T[0:9]
What I'm trying to do, example df:
A B C .... X Y Z
2021-03-29 10 20 5 .... 50 100 7
2021-03-28 9 19 4 .... 45 90 6
2021-03-27 8 15 2 .... 40 80 4
...
2021-01-03 0 0 0 .... 0 0 0
I want to select series representing by the greatest N values as of the latest index value (eg, latest date).
I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])
I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8
A pandas dataframe(x) with two columns: sum and value. sum is the count of records has the same value. For example:
sum value
2 3
4 1
means 2 records has value 3 and 4 records has value 1
And what I want to do is: sort by value and then cut [1,1,1,1,3,3] into 3 parts: [1,1], [1,1], [3,3]
How to cut the value into 3 parts and Each part has an equal number of records?
pandas.cut can't take sum column into consideration
I think you can use cumsum with double numpy.where:
sumall = df['sum'].sum()
df = df.sort_values(by='value')
df['sum_sum'] = df['sum'].cumsum()
df['tag'] = np.where(df['sum_sum'] < sumall / 3, 0,
np.where(df['sum_sum'] < 2 * sumall / 3, 1, 2) )
print (df)
sum value sum_sum tag
1 4 1 4 2
0 2 3 6 2
this works for me. but ugly:
sum = df['sum'].sum()
def func(x):
if x < sum/3:
return 0
elif x < 2 * sum/3:
return 1
return 2
df = df.sort_values(by='value')
df['sum_sum'] = np.cumsum(df['sum'].values)
df['tag'] = df['sum_sum'].apply(func)