I have difficulties to make selection from MultiIndex in pandas 0.14.1 (I know that is old version, but my choice is limited).
I need to do selection based on index labels.
For one level index selection goes fine with repetitions
(pd.DataFrame
.from_records({'A' : [1,2,3], 'B' : [11,12,13]})
.set_index('A')
).loc[idx[1,1,1,2,1], :]
B
A
1 11
1 11
1 11
2 12
1 11
For multilevel index selection works in different way, taking only unique values.
(pd.DataFrame
.from_records({'A' : [1,2,3], 'B' : [11,12,13], 'C' : [21,22,23]})
.set_index(['A', 'B'])
).loc[idx[[1,1,1,2,1], :], :]
C
A B
1 11 21
2 12 22
QUESTION: Is there anyway to use multiindex but preserve selection behaviour from single level index? The expected output is like in single index, thus, 5 rows in return, not 2
Best I could come up with. Be warned, this will blow up for many reasons and I fully expect you to come back with "But this didn't work on my real data."
Option 1
This will always work, but may not always be what you expected.
pd.concat([df.xs(i, drop_level=False) for i in [1, 1, 1, 2, 1]])
C
A B
1 11 21
11 21
11 21
2 12 22
1 11 21
Option 2
This will break if your first level values aren't unique on their own.
df.iloc[df.index.get_level_values(0).searchsorted([1, 1, 1, 2, 1])]
C
A B
1 11 21
11 21
11 21
2 12 22
1 11 21
Related
I wish to extract the list of values from one column in pandas how to extract the list of values from one column and then use those values to create additional columns based on number of values within the list.
My dataframe:
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
Current output:
test
0
1
2
3
4 [1, 2, 3, 4, 5, 6, 6]
5
6
7 [11, 12, 13, 14, 15, 16, 17]
expected output:
example_1 example_2 example_3 example_4 example_5 example_6 example_7
0
1
2
3
4 1 2 3 4 5 6 6
5
6 11 12 13 14 15 16 17
lets say we expect it to have 7 values for each list. <- this is most of my current case so if I can set the limit then it will be a good one. Thank you.
This should be what you're looking for. I replaced the nan values with blank cells, but you can change that of course.
a = pd.DataFrame({"test":["","","","",[1,2,3,4,5,6,6],"","",[11,12,13,14,15,16,17]]})
ab = a.test.apply(pd.Series).fillna("")
ab.columns = ['example_' + str(i) for i in range(1, 8)]
Output:
Edit: using .add_prefix() as the other answer uses is prettier than setting the column names manually with a list comprehension.
Here's a one-liner:
(pd.DataFrame
(a.test.apply(pd.Series)
.fillna("")
.set_axis(range(1, a.test.str.len().max() + 1), axis=1)
.add_prefix("example_")
)
The set_axis is just to make the columns 1-indexed, if you don't mind 0-indexing you can leave it out.
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
There were kind of similar named questions, but they do not reflect the use case I am facing. I have a dataframe with groups and values. I want to select values sliced by their order (confusing maybe, example will explain better).
This is my data:
group value
a 20
a 16
a 14
a 13
a 12
b 19
b 17
b 16
b 14
b 13
b 12
b 12
b 11
I want to group by group and slice [a:b] with nlargest logic, in other words, if a = 2 and b = 7 the biggest 3rd, 4th, 5ht, 6th and 7th variables per each group. I could not find any question here on this use case, or could I find something in pandas-dev github.
If there are less than b elements in any of the groups, then b = len(of that group) should be applied. If there are two or more elements with the same value, they should all be selected if they are within the [a:b] slice.
My desired result looks like this:
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12
Here, the group a has 5 elements which is less than b in the example and because of that, 3rd to the 5th biggest elements are returned. In group b 6th and 7th biggest values are the same, so they are both returned.
The closest question to mine is this question about slice but it does not use nlargest logic. It just slices the groups.
If you could guide me on that, I would appreciate!
You could try the following:
import pandas as pd
gbg = df.groupby('group')
a=2
b=7
res = gbg['value'].agg(lambda x: pd.Series.to_list(x)[a:b]).to_frame().explode('value').reset_index()
# .agg will "aggregate" the groups, here it will create the slices by group
# .to_frame will convert results from pd.Series to pd.DataFrame
# .explode() will write the list values in rows again
# .reset_index() will restore the column 'group'
The intermediate result after .agg():
group
a [14, 13, 12]
b [16, 14, 13, 12, 12]
Name: value, dtype: object
And the full result:
group value
0 a 14
1 a 13
2 a 12
3 b 16
4 b 14
5 b 13
6 b 12
7 b 12
By sorting the dataframe first and using the slice method which this approach gives me the result I expected.
df.sort_values(["group", "value"], ascending = False).groupby("group").slice(2, 7)
Output is
group value
a 14
a 13
a 12
b 16
b 14
b 13
b 12
b 12
In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871
Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me