Filter pandas column based on frequency of occurrence - pandas

My df:
data = [
{'Part': 'A', 'Value': 10, 'Delivery': 10},
{'Part': 'B', 'Value': 12, 'Delivery': 8.5},
{'Part': 'C', 'Value': 10, 'Delivery': 10.1},
{'Part': 'D', 'Value': 10, 'Delivery': 10.3},
{'Part': 'E', 'Value': 11, 'Delivery': 9.2},
{'Part': 'F', 'Value': 15, 'Delivery': 7.3},
{'Part': 'G', 'Value': 10, 'Delivery': 10.1},
{'Part': 'H', 'Value': 12, 'Delivery': 8.1},
{'Part': 'I', 'Value': 12, 'Delivery': 8.0},
{'Part': 'J', 'Value': 10, 'Delivery': 10.2},
{'Part': 'K', 'Value': 8, 'Delivery': 12.5}
]
df = pd.DataFrame(data)
I wish to filter a dataframe out of given dataframe so that it contain only the most frequent occurring "value".
Expected output:
data = [
{'Part': 'A', 'Value': 10, 'Delivery': 10},
{'Part': 'C', 'Value': 10, 'Delivery': 10.1},
{'Part': 'D', 'Value': 10, 'Delivery': 10.3},
{'Part': 'G', 'Value': 10, 'Delivery': 10.1},
{'Part': 'J', 'Value': 10, 'Delivery': 10.2}
]
df_output = pd.DataFrame(data)
is there any way to do this?

Use boolean indexing with Series.mode and seelct first value by Series.iat:
df1 = df[df['Value'].eq(df['Value'].mode().iat[0])]
Or compare by first index value in Series created by Series.value_counts, because by default values are sorted by counts:
df1 = df[df['Value'].eq(df['Value'].value_counts().index[0])]
print (df1)
Part Value Delivery
0 A 10 10.0
2 C 10 10.1
3 D 10 10.3
6 G 10 10.1
9 J 10 10.2

Related

How to convert this nested loop to numpy broadcast?

I want to rearrange my data (two even-length 1d arrays):
cs = [w x y z]
rs = [a b c d e f]
to make a result like this:
[[a b w x]
[c d w x]
[e f w x]
[a b y z]
[c d y z]
[e f y z]]
This is what I have tried (it works):
ls = []
for c in range(0,len(cs),2):
for r in range(0,len(rs),2):
item = [rs[r], rs[r+1], cs[c], cs[c+1]]
ls.append(item)
But I want to get the same result using reshaping/broadcasting or other numpy functions.
What is the idiomatic way to do this task in numpy?
You could tile the elements of rs, repeat the elements of cs and then arrange those as columns for a 2D array:
import numpy as np
cs = np.array(['w', 'x', 'y', 'z'])
rs = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
res = np.c_[np.tile(rs[::2], len(cs) // 2), np.tile(rs[1::2], len(cs) // 2),
np.repeat(cs[::2], len(rs) // 2), np.repeat(cs[1::2], len(rs) // 2)]
Result:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
An alternative:
np.c_[np.tile(rs.reshape(-1, 2), (len(cs) // 2, 1)),
np.repeat(cs.reshape(-1, 2), len(rs) // 2, axis=0)]
An alternative to using tile/repeat, is to generate repeated row indices.
Make the two arrays - reshaped as they will be combined:
In [106]: rs=np.reshape(list('abcdef'),(3,2))
In [107]: cs=np.reshape(list('wxyz'),(2,2))
In [108]: rs
Out[108]:
array([['a', 'b'],
['c', 'd'],
['e', 'f']], dtype='<U1')
In [109]: cs
Out[109]:
array([['w', 'x'],
['y', 'z']], dtype='<U1')
Make 'meshgrid' like indices (itertools.product could also be used)
In [110]: IJ = np.indices((3,2))
In [111]: IJ
Out[111]:
array([[[0, 0],
[1, 1],
[2, 2]],
[[0, 1],
[0, 1],
[0, 1]]])
reshape with order gives two 1d arrays:
In [112]: I,J=IJ.reshape(2,6,order='F')
In [113]: I,J
Out[113]: (array([0, 1, 2, 0, 1, 2]), array([0, 0, 0, 1, 1, 1]))
Then just index the rs and cs and combine them with hstack:
In [114]: np.hstack((rs[I],cs[J]))
Out[114]:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
edit
Here's another way of looking this - a bit more advanced. With sliding_window_view we can get a "block" view of that Out[114] result:
In [130]: np.lib.stride_tricks.sliding_window_view(_114,(3,2))[::3,::2,:,:]
Out[130]:
array([[[['a', 'b'],
['c', 'd'],
['e', 'f']],
[['w', 'x'],
['w', 'x'],
['w', 'x']]],
[[['a', 'b'],
['c', 'd'],
['e', 'f']],
[['y', 'z'],
['y', 'z'],
['y', 'z']]]], dtype='<U1')
With a bit more reverse engineering, I find I can create Out[114] with:
In [147]: res = np.zeros((6,4),'U1')
In [148]: res1 = np.lib.stride_tricks.sliding_window_view(res,(3,2),writeable=True)[::3,::2,:,:]
In [149]: res1[:,0,:,:] = rs
In [150]: res1[:,1,:,:] = cs[:,None,:]
In [151]: res
Out[151]:
array([['a', 'b', 'w', 'x'],
['c', 'd', 'w', 'x'],
['e', 'f', 'w', 'x'],
['a', 'b', 'y', 'z'],
['c', 'd', 'y', 'z'],
['e', 'f', 'y', 'z']], dtype='<U1')
I can't say that either of these is superior, but they show there are various ways of "vectorizing" this kind of array layout.

Pandas Outerjoin New Rows

I have two dataframes df1 and df2.
df1 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210],})
df2 = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col4': [120, 140, 120, 200, 190, 210],})
I do an outerjoin on the two dataframes:
df = pd.merge(df1, df2[['Col1', 'Col4']], on= 'Col1', how='outer')
I get a new dataframe but I don't get the entries for Col2 for df2. I get
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'NaN', 'NaN', 'NaN'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
But what I want is:
df = pd.DataFrame({
'Col1': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat', 'mas', 'apc', 'ywt'],
'Col2': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'Col3': [100, 120, 130, 200, 190, 210, 'NaN', 'NaN', 'NaN'],
'Col4': [120, 140, 120, 'NaN', 'NaN', 'NaN', '200', '190', '210']})
I want to have the entries for Col2 from df2 as new rows in the merged dataframe

numpy reshape/transpose 3D to wide 2D

Make example
letters = np.array([
np.array([
np.array(['a','a','a'])
, np.array(['b','b','b'])
, np.array(['c','c','c'])
])
, np.array([
np.array(['d','d','d'])
, np.array(['e','e','e'])
, np.array(['f','f','f'])
])
, np.array([
np.array(['g','g','g'])
, np.array(['h','h','h'])
, np.array(['i','i','i'])
])
])
array([[['a', 'a', 'a'],
['b', 'b', 'b'],
['c', 'c', 'c']],
[['d', 'd', 'd'],
['e', 'e', 'e'],
['f', 'f', 'f']],
[['g', 'g', 'g'],
['h', 'h', 'h'],
['i', 'i', 'i']]], dtype='<U1')
Desired output
array([['a', 'a', 'a', 'd', 'd', 'd', 'g', 'g', 'g'],
['b', 'b', 'b', 'e', 'e', 'e', 'h', 'h', 'h'],
['c', 'c', 'c', 'f', 'f', 'f', 'i', 'i', 'i']], dtype='<U1')
See how the 2D arrays are now side-by-side?
For the sake of memory, I'd prefer to do this with transpose and reshape rather than stacking/ concatting a new array.
Attempt
letters.reshape(
letters.shape[2],
letters.shape[0]*letters.shape[1]
)
array([['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
['d', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
['g', 'g', 'g', 'h', 'h', 'h', 'i', 'i', 'i']], dtype='<U1')
I think I need to transpose... before reshaping?
letters.transpose(
1,0,2
).reshape(
# where index represents dimension
letters.shape[2],
letters.shape[0]*letters.shape[1]
)

pandas same attribute comparison

I have the following dataframe:
df = pd.DataFrame([{'name': 'a', 'label': 'false', 'score': 10},
{'name': 'a', 'label': 'true', 'score': 8},
{'name': 'c', 'label': 'false', 'score': 10},
{'name': 'c', 'label': 'true', 'score': 4},
{'name': 'd', 'label': 'false', 'score': 10},
{'name': 'd', 'label': 'true', 'score': 6},
])
I want to return names that have the "false" label score value higher than the score value of the "true" label with at least the double. In my example, it should return only the "c" name.
First you can pivot the data, and look at the ratio, filter what you want:
new_df = df.pivot(index='name',columns='label', values='score')
new_df[new_df['false'].div(new_df['true']).gt(2)]
output:
label false true
name
c 10 4
If you only want the label, you can do:
new_df.index[new_df['false'].div(new_df['true']).gt(2)].values
which gives
array(['c'], dtype=object)
Update: Since your data is result of orig_df.groupby().count(), you could instead do:
orig_df['label'].eq('true').groupby('name').mean()
and look at the rows with values <= 1/3.

Pandas multi index dataframe to nested dictionary

Let's say I have the following dataframe
df = pd.DataFrame({0: {('A', 'a'): 1, ('A', 'b'): 6, ('B', 'a'): 2, ('B', 'b'): 7},
1: {('A', 'a'): 2, ('A', 'b'): 7, ('B', 'a'): 3, ('B', 'b'): 8},
2: {('A', 'a'): 3, ('A', 'b'): 8, ('B', 'a'): 4, ('B', 'b'): 9},
3: {('A', 'a'): 4, ('A', 'b'): 9, ('B', 'a'): 5, ('B', 'b'): 1},
4: {('A', 'a'): 5, ('A', 'b'): 1, ('B', 'a'): 6, ('B', 'b'): 2}})
which looks this:
0 1 2 3 4
A a 1 2 3 4 5
b 6 7 8 9 1
B a 2 3 4 5 6
b 7 8 9 1 2
When I convert this to a dictionary via to_dict (regardless of stacking, unstacking), I get a dictionary whose keys are tuples:
df.transpose().to_dict()
{('A', 'a'): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
('A', 'b'): {0: 6, 1: 7, 2: 8, 3: 9, 4: 1},
('B', 'a'): {0: 2, 1: 3, 2: 4, 3: 5, 4: 6},
('B', 'b'): {0: 7, 1: 8, 2: 9, 3: 1, 4: 2}}
What I'd like instead is a nested dict like this:
{'A':{'a': {0: 1, 1:2, 2:3, 3:4, 4:5}, 'b':{0:6, 1:7, 2:8, 3:9,4:1}...
You can use a dictionary comprehension to iterate through the outer levels (values 'A' and 'B') and use the xs method to slice the frame by those levels.
{level: df.xs(level).to_dict('index') for level in df.index.levels[0]}
{'A': {'a': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'b': {0: 6, 1: 7, 2: 8, 3: 9, 4: 1}},
'B': {'a': {0: 2, 1: 3, 2: 4, 3: 5, 4: 6},
'b': {0: 7, 1: 8, 2: 9, 3: 1, 4: 2}}}
For n levels you could have something recursive like this:
def createDictFromPandas(df):
if (df.index.nlevels==1):
return df.to_dict()
dict_f = {}
for level in df.index.levels[0]:
if (level in df.index):
dict_f[level] = createDictFromPandas(df.xs([level]))
return dict_f