Python Lambda Apply Function Multiple Conditions using OR - pandas

I've searched this one and cannot find a solution. I have a multiple data condition where when either condition is met, is summed. In my dataset, I have used "apply" and the lambda function for a single condition <, >. However, I have a continuous data column where a count is based on either a low value OR a high value. I have tried variations of this below but keep getting a "ValueError:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Let's say my data looks like this: dfdata
Site data month day year
A 4 1 1 2021
A 17 1 2 2021
A 8 1 3 2021
A 7 1 1 2022
A 0 1 2 2022
A 2 1 3 2022
B 3 1 1 2021
B 16 1 2 2021
B 9 1 3 2021
B 2 1 1 2022
B 18 1 2 2022
B 5 1 3 2022
I've used a for loop that should give the following result below for evaluating the "data" column and counting the instances of the value < 4 OR > 15. I think that the "|" operator might do this but I get a True/False...
sites = ['A','B']
n = len(sites)
dft = pd.DataFrame();
for n in sites:
dft.loc[:,n] = dfdata[dfdata['Site']==n].groupby(["month", "day"])["data"].apply(lambda x: (x < 4) or (x > 15).sum())
the result.
month day A B
1 1 0 2
1 2 2 2
1 3 1 0
Thanks for your help.

You don't have to use (and should avoid) loops in pandas. Aside from being slow, it also make you intention harder to read.
Here's on solution using pandas functions:
dft = (
dfdata.query("data < 4 or data > 15")
.groupby(["month", "day", "Site"])["data"]
.sum()
.unstack(fill_value=0)
)
The query filters for rows whose data is <4 or >17. The rest is just adding them up and reshaping the resulting dataframe.

Related

pandas reset_index of certain level removes entire level of multiindex

I have DataFrame like this:
performance
year month week
2015 1 2 4.170358
3 3.423766
4 -1.835888
5 8.157457
2 6 -3.276887
... ...
2018 7 30 -1.045241
31 -0.870845
8 31 0.950555
32 6.757876
33 -2.203334
I want to have week in range(0 or 1,n) where n = number of weeks in current year and month.
Well, the easy way I thought is to use
df.reset_index(level=2, drop=True)
But it's mistake I realized later, in best scenario I would get
performance
year month week
2015 1 0 4.170358
1 3.423766
2 -1.835888
3 8.157457
2 4 -3.276887
... ...
2018 7 n-4 -1.045241
n-3 -0.870845
8 n-2 0.950555
n-1 6.757876
n -2.203334
But after I did that, I got an unexpected behaviour
close
timestamp timestamp
2015 1 4.170358
1 3.423766
1 -1.835888
1 8.157457
2 -3.276887
... ...
2018 7 -1.045241
7 -0.870845
8 0.950555
8 6.757876
8 -2.203334
I lost entire 2nd level of index! Why? I thought it will be 0 to n for each 'cluster' (Ye, it's mistake, I realized it as I mentioned above)...
I solved my problem somesthing like that
df.groupby(level = [0, 1]).apply(lambda x: x.reset_index(drop=True))
And got my desired form of DataFrame like that:
performance
year month
2015 1 0 4.170358
1 3.423766
2 -1.835888
3 8.157457
2 0 -3.276887
... ...
2018 7 3 -1.045241
4 -0.870845
8 0 0.950555
1 6.757876
2 -2.203334
But WHY? Why reset_index on certain level just drops it? That's the main quastion!
reset_index with drop=True adds a default index only when you are reseting the whole index. If you're reseting just a single level of a multi-level index, it will just remove it.

Pandas Series Shift and Conditions return the truth value of a Series is ambiguous

I have a pandas Series df containing 10 values (all doubles).
My aim is to create a new Series as follow.
newSerie = 1 if df > df.shift(1) else 0
In other words newSerie outputs 1 if the current value of df is bigger than its previous value (it should output 0 otherwise).
However, I get :
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In addition after my aim is to concatenate df and newSerie as a Dataframe, but newSerie outputs 9 value as we cannot compare the first value of df with shitf(1). Hence I need the first value of newSerie to be a empty value in order to be able to concatenate.
How can I do that?
To give an example imagine my input is only Series df. And my output should be as in the following image:
You can use shift or diff:
# example dataframe:
data = pd.DataFrame({'df':[10,9,12,13,14,15,18,16,20,1]})
df
0 10
1 9
2 12
3 13
4 14
5 15
6 18
7 16
8 20
9 1
Using Series.shift:
data['NewSerie'] = data['df'].gt(data['df'].shift()).astype(int)
Or Series.diff
data['NewSerie'] = data['df'].diff().gt(0).astype(int)
Output
df NewSerie
0 10 0
1 9 0
2 12 1
3 13 1
4 14 1
5 15 1
6 18 1
7 16 0
8 20 1
9 1 0

Compute element overlap based on another column, pandas

If I have a dataframe of the form:
tag element_id
1 12
1 13
1 15
2 12
2 13
2 19
3 12
3 15
3 22
how can I compute the overlaps of the tags in terms of the element_id ? The result I guess should be an overlap matrix of the form:
1 2 3
1 X 2 2
2 2 X 1
3 2 1 X
where I put X on the diagonal since the overlap of a tag with itself is not relevant and where the numbers in the matrix represent the total element_ids that the two tags share.
My attempts:
You can try and use a for loop like :
for item in df.itertuples():
element_lst += [item.element_id]
element_tag = item.tag
# then intersect the element_list row by row.
# This is extremely costly for large datasets
The second thing I was thinking about was to use df.groupby('tag') and try to somehow intersect on element_id, but it is not clear to me how I can do that with grouped data.
merge + crosstab
# Find element overlap, remove same tag matches
res = df.merge(df, on='element_id').query('tag_x != tag_y')
pd.crosstab(res.tag_x, res.tag_y)
Output:
tag_y 1 2 3
tag_x
1 0 2 2
2 2 0 1
3 2 1 0

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6