Pandas - given a sorted dataframe and a list of target values, how to retrieve rows next to these values in one go - pandas

Suppose I have a sorted dataframe and a list of target values as below
In [57]: df
Out[57]:
value
0 1
1 2
2 3
3 4
4 5
5 6
In [58]: target_values=[1.5, 3.5, 5.5]
What I want is to get the first row which has a value >= the target value respectively. In the example above, the index of such rows are [1, 3, 5].
I can achieve the goal with following code
In [60]: [df[df.value >= t].iloc[0] for t in target_values]
However, it will scan the dataframe for len(target_values) times. Is there a Pandas function which can achieve the goal with just one scan?

It's called searchsorted. You can use pandas method, or numpy
pandas
df.value.searchsorted(target_values)
array([1, 3, 5])
numpy
df.value.values.searchsorted(target_values)
array([1, 3, 5])

#build a pair wise difference matrix
pairwise_diff = df.values[:,None]-target_values
#find the non-negative min diff for each value in target values.
np.ma.array(pairwise_diff,mask=(pairwise_diff<0)).argmin(0)
Out[178]: array([[1, 3, 5]], dtype=int64)

Related

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Why does pandas.DataFrame.apply produces Series instead of DataFrame

I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:
In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]

Pandas: how to retrieve values from a DataFrame given a list of (row, column) pairs?

tldr; I want to pass a series of positions on a DataFrame and receive a series of values, If possible with a DataFrame method.
I have a Dataframe with some columns and an index
import pandas as pd
df_a = pd.DataFrame(
{'A':[0,1,3,7],
'B':[2,3,4,5]}, index=[0,1,2,3])
I want to retrieve the values at specific (row, column) positions on the DataFrame
rows = [0, 2, 3]
cols = ['A','B','A']
df_a.loc[rows, cols] returns a 3x3 DataFrame
|A |B |A
0 0 2 0
2 3 4 3
3 7 5 7
I want the series of values corresponding to the (row, col) values, a series of length 3
[0, 4, 7]
What is the best way to do this in pandas?
Most certainly! you can use DataFrame.lookup to achieve exactly what you want:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.lookup.html
import pandas as pd
df_a = pd.DataFrame({'A':[0,1,3,7], 'B':[2,3,4,5]}, index=[0,1,2,3])
rows = [0, 2, 3]
cols = ['A','B','A']
values = df_a.lookup(rows, cols)
print(values)
array([0, 4, 7], dtype=int64)
Pandas does not support that kind of indexing, only numpy
>>> df.to_numpy()[rows, df.columns.get_indexer(cols)]
array([0, 4, 7])

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.