Getting variable no of pandas rows w.r.t. a dictionary lookup - pandas

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.

You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683

Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

Related

Pandas groupby custom nlargest

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

Create a dataframe from a series with a TimeSeriesIndex multiplied by another series

Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445

Pandas to summarize (cum and count) columns

Dataframe as below and I want to summarize the columns, like:
I tried pivot_table:
import pandas as pd
from io import StringIO
csvfile = StringIO(
"""Staff Color Sales Spend Transport
Amelia Red 188 49 35
Elijah Yellow 326 18
James Blue 378 10
Benjamin Red 144 34 45
Isabella Red 269 10
Lucas Yellow 159 48
Mason Blue 496 48 20""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df = df.pivot_table(index= None, \
values=['Sales', "Spend", "Transport"], \
aggfunc={'Sales':'sum','Spend':'sum','Transport':'count'})
print (df)
But the error says:
ValueError: No group keys passed!
And I also tried:
columns = ['sum of Sales','sum of Spend','count of Transport']
df_1 = pd.DataFrame(columns = columns)
df_1 = df_1.append({"sum of Sales":df.Sales.sum(),
"sum of Spend":df.Spend.sum(),
"count of Transport":df.Transport.count()}, ignore_index=True)
print (df_1)
It looks ok but I wonder what's the better way to achieve it. Thank you.
In your case just do
df.agg({'Sales':'sum','Spend':'sum','Transport':'count'}).to_frame().T

Extract dictionary value from a list contained in Pandas dataframe column

I'm trying to extract values from a dictionary contained within list in a Pandas dataframe .Objective is to split the id key into multiple columns. Sample data is like :
Column_Header
[{'id': '498', 'relTypeId': 2'},{'id': '499', 'relTypeId': 3'}]
[{'id': '499', 'relTypeId': 3'},{'id': '500', 'relTypeId': 4'},{'id': '501', 'relTypeId': 5'}]
I have tried as below
list(map(lambda x: x["id"], df["Column_Header"]))
But getting error as following:
"list indices must be integers or slices, not str". Desired o/p is :
col1|col2|col3
498 |499 |
499 |500 |501
Can some one please help ?
We can do explode first then create the additional key with cumcount , and pivot
s=df.Column_Header.explode().str['id']
s=pd.crosstab(index=s.index,columns=s.groupby(level=0).cumcount(),values=s,aggfunc='sum')
Out[133]:
col_0 0 1 2
row_0
0 498 499 NaN
1 499 500 501
Use nested list comprehension with select id in keys of dictionaries if performance is important:
df = pd.DataFrame([[y['id'] for y in x] for x in df['Column_Header']], index=df.index)
print (df)
0 1 2
0 498 499 None
1 499 500 501
If possible some missing values use:
L = [[y['id'] for y in x] if isinstance(x, list) else [None] for x in df['Column_Header']]
df = pd.DataFrame(L, index=df.index)