Extract a subset of a pandas dataframe based on values (with repetition)? - pandas

Let say I have the following dataframe :
elements = [1,1,1,1,1,2,3,4,5]
df = pd.DataFrame({'elements': elements})
df.set_index(['elements'])
print df
elements
0 1
1 1
2 1
3 1
4 1
5 2
6 3
I have a list [1, 1, 2, 3] and I want a subset of the dataframe including those 4 elements, for example:
elements
0 1
1 1
5 2
6 3
I have been able to deal with it by building a dict counting the items occurrences in the array and building a new dataframe by appending subparts of the initial one.
Would you know some dataframe methods to help me find a more elegant solution?
After #jezrael comment : I must add that i need to keep track of the initial index (in df).
We can see df (first dataframe) as a repository of resources and i need to track which rows/indices are attributed :
Use case is : among the elements in df give me two 1, one 2 and one 3. i would persist the fact that i have the rows 0 and 1 as 1, row 4 as 2 and row 5 as 3.

If and only if your Series and list are sorted (otherwise, see below), then you can do:
L = [1, 1, 2, 3]
df[df.elements.apply(lambda x: x == L.pop(0) if x in L else False)]
elements
0 1
1 1
5 2
6 3
list.pop(i) returns and removes the value in list at index i. Because both, the elements and L, are sorted, popping the first element (i==0) of the subset list L will always occur at the corresponding first element in elements.
So at each iteration of lambda on elements, L will become:
| element | L | Output |
|=========|==============|===========|
| 1 | [1, 1, 2, 3] | True |
| 1 | [1, 2, 3] | True |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 2 | [2, 3] | True |
| 3 | [3] | True |
| 4 | [] | False |
| 5 | [] | False |
As you can see, your list is empty at the end, so if it's a problem, you can copy it beforehand. Or, you actually have that information in the new dataframe you just created!
If df.elements is not sorted, create a sorted copy on which you apply the same lambda function as above, but the output of it will be used as index for the original dataframe (indexes whose values are True are used):
df
elements
0 5
1 4
2 3
3 1
4 2
5 1
6 1
7 1
8 1
cp = df.elements.copy()
cp.sort_values(inplace=True)
tmp = df.loc[cp.apply(lambda x: x == L.pop(0) if x in L else False)]
print tmp
elements
2 3
3 1
4 2
5 1
HTH

Extracting can be possible by merge with new columns by GroupBy.cumcount:
L = [1,1,2,3]
df1 = pd.DataFrame({'elements':L})
df['g'] = df.groupby('elements')['elements'].cumcount()
df1['g'] = df1.groupby('elements')['elements'].cumcount()
print (df)
elements g
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 2 0
6 3 0
7 4 0
8 5 0
print (df1)
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df,df1, on=['elements', 'g']))
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df.reset_index(),df1, on=['elements', 'g'])
.drop('g', axis=1)
.set_index('index')
.rename_axis(None))
elements
0 1
1 1
5 2
6 3

Related

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Pandas - how to get the minimum value for each row from values across several rows

I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1