Generate list of values summing to 1 - within groupby? - pandas

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?

IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

Related

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100

Pandas dropping columns by index drops all columns with same name

Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me