Drop duplicates and combine IDs where those duplication exist [duplicate] - pandas

I tried to use groupby to group rows with multiple values.
col val
A Cat
A Tiger
B Ball
B Bat
import pandas as pd
df = pd.read_csv("Inputfile.txt", sep='\t')
group = df.groupby(['col'])['val'].sum()
I got
A CatTiger
B BallBat
I want to introduce a delimiter, so that my output looks like
A Cat-Tiger
B Ball-Bat
I tried,
group = df.groupby(['col'])['val'].sum().apply(lambda x: '-'.join(x))
this yielded,
A C-a-t-T-i-g-e-r
B B-a-l-l-B-a-t
What is the issue here ?
Thanks,
AP

Alternatively you can do it this way:
In [48]: df.groupby('col')['val'].agg('-'.join)
Out[48]:
col
A Cat-Tiger
B Ball-Bat
Name: val, dtype: object
UPDATE: answering question from the comment:
In [2]: df
Out[2]:
col val
0 A Cat
1 A Tiger
2 A Panda
3 B Ball
4 B Bat
5 B Mouse
6 B Egg
In [3]: df.groupby('col')['val'].agg('-'.join)
Out[3]:
col
A Cat-Tiger-Panda
B Ball-Bat-Mouse-Egg
Name: val, dtype: object
Last for convert index or MultiIndex to columns:
df1 = df.groupby('col')['val'].agg('-'.join).reset_index(name='new')

just try
group = df.groupby(['col'])['val'].apply(lambda x: '-'.join(x))

Related

How to sort range values in pandas dataframe?

Below is part of a range in dataframe that I work withI have tried to sort it using df.sort_values(['column']) but it doesn't work. I would appreciate advice.
You can simplify solution for sorting by values before - converting to integers in parameter key:
f = lambda x: x.str.split('-').str[0].str.replace(',','', regex=True).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
print (df)
column
0 1,000-1,999
1 2,000-2,949
2 3,000-3,999
3 4,000-4,999
4 5,000-7,499
5 10,000-14,999
6 15,000-19,999
7 20,000-24,999
8 25,000-29,999
9 30,000-39,999
10 40,000-49,999
11 103,000-124,999
12 125,000-149,999
13 150,000-199,999
14 200,000-249,999
15 250,000-299,999
16 300,000-499,999
Another idea is use first integers for sorting:
f = lambda x: x.str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values('column', key= f, ignore_index=True)
If need sorting by both values splitted by - it is possible by:
f = lambda x: x.str.replace(',','', regex=True).str.extract('(\d+)-(\d+)', expand=True).astype(int).apply(tuple, 1)
df = df.sort_values('column', key= f, ignore_index=True)

pandas return auxilliary column from groupby and max

I have a pandas DataFrame with 3 columns, A, B, and V.
I want a DataFrame with A as the index and one column, which contains the B for the maximum V
I can easily create a df with A and the maximum V using groupby, and then perform some machinations to extract the corresponding B, but that seems like the wrong idea.
I've been playing with combinations of groupby and agg with no joy.
Sample Data:
A,B,V
MHQ,Q,0.5192
MMO,Q,0.4461
MTR,Q,0.5385
MVM,Q,0.351
NCR,Q,0.0704
MHQ,E,0.5435
MMO,E,0.4533
MTR,E,-0.6716
MVM,E,0.3684
NCR,E,-0.0278
MHQ,U,0.2712
MMO,U,0.1923
MTR,U,0.3833
MVM,U,0.1355
NCR,U,0.1058
A = [1,1,1,2,2,2,3,3,3,4,4,4]
B = [1,2,3,4,5,6,7,8,9,10,11,12]
V = [21,22,23,24,25,26,27,28,29,30,31,32]
df = pd.DataFrame({'A': A, 'B': B, 'V': V})
res = df.groupby('A').apply(
lambda x: x[x['V']==x['V'].max()]).set_index('A')['B'].to_frame()
res
B
A
1 3
2 6
3 9
4 12

is it possible to do the equivalent of SQL nested requests in pandas dataframe?

it is possible to do in Pandas dataframe the equivalent of this SQL code
delete * from tableA where id in (select id from tableB)
Don't know the exact structure of your DataFrames, but something like this should do it:
# Setup dummy data:
import pandas as pd
tableA = pd.DataFrame(data={"id":[1, 2, 3]})
tableB = pd.DataFrame(data={"id":[3, 4, 5]})
# Solution:
tableA = tableA[~tableA["id"].isin(tableB["id"])]
Yes there is an equivalent. You can try this:
df2.drop(df2.loc[df2['id'].isin(df1['id'])].index)
For example:
df1 = pd.DataFrame({'id': [1,2,3,4], 'value': [2,3,4,5]})
df2 = pd.DataFrame({'id': [1,2,3,4, 5,6,7], 'col': [2,3,4,5,6,7,8]})
print(df2.drop(df2.loc[df2['id'].isin(df1['id'])].index))
output:
id col
4 5 6
5 6 7
6 7 8
I just took random example dataframe. This example is dropping values from df2 (which you can say TableA) using df1 (or TableB)

Getting the specific index number of every group

In this sample dataframe df:
import pandas as pd
import numpy as np
i = ['dog', 'cat', 'elephant'] * 3
df = pd.DataFrame(np.random.randn(9, 4), index=i,
columns=list('ABCD')).sort_index()
What is the quickest way to get the 2nd row of each animal as a dataframe?
You're looking for nth. If an animal has only a single row, no result will be returned.
pandas.core.groupby.GroupBy.nth(n, dropna=None)
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
df.groupby(level=0).nth(1)
A B C D
cat -2.189615 -0.527398 0.786284 1.442453
dog 2.190704 0.607252 0.071074 -1.622508
elephant -2.536345 0.228888 0.716221 0.472490
You can group the data by index and get elements at index 1 (second row) for each group
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :])
A B C D
cat 0.089608 -1.181394 -0.149988 -1.634295
dog 0.002782 1.620430 0.622397 0.058401
elephant 1.022441 -2.185710 0.854900 0.979411
If you expect any group with single value in your dataframe, you can build in that condition
new_df = df.groupby(level=0).apply(lambda x: x.iloc[1, :] if len(x) > 1 else None).dropna()

Map values from a dataframe

I have a dataframe with the correspondence between two values:
A another list with only one of the variables:
l = ['a','b','c']
I want to make the mapping like:
df[l[0]]
and get 1
df[l[1]]
and get 2
As if it was a dictionary, how can I do that?
Are you looking for something like this?
df.loc[df['key']==l[0], 'value']
returns
0 1
1 1
Another way would be to set_index of the df to key.
df.set_index('key', inplace = True)
df.loc[l[0]].values[0]
Another way is map by Series or by dict but is necessary unique value of keys, drop_duplicates helps:
df = pd.DataFrame({'key':list('aabcc'),
'value':[1,1,2,3,3]})
s = df.drop_duplicates('key').set_index('key')['value']
print (s)
key
a 1
b 2
c 3
Name: value, dtype: int64
d = df.drop_duplicates('key').set_index('key')['value'].to_dict()
print (d)
{'c': 3, 'b': 2, 'a': 1}
l = ['a','b','c']
print (s[l[0]])
1
print (d[l[1]])
2