I tried to use groupby to group rows with multiple values.
col val
A Cat
A Tiger
B Ball
B Bat
import pandas as pd
df = pd.read_csv("Inputfile.txt", sep='\t')
group = df.groupby(['col'])['val'].sum()
I got
A CatTiger
B BallBat
I want to introduce a delimiter, so that my output looks like
A Cat-Tiger
B Ball-Bat
I tried,
group = df.groupby(['col'])['val'].sum().apply(lambda x: '-'.join(x))
this yielded,
A C-a-t-T-i-g-e-r
B B-a-l-l-B-a-t
What is the issue here ?
Thanks,
AP
Alternatively you can do it this way:
In [48]: df.groupby('col')['val'].agg('-'.join)
Out[48]:
col
A Cat-Tiger
B Ball-Bat
Name: val, dtype: object
UPDATE: answering question from the comment:
In [2]: df
Out[2]:
col val
0 A Cat
1 A Tiger
2 A Panda
3 B Ball
4 B Bat
5 B Mouse
6 B Egg
In [3]: df.groupby('col')['val'].agg('-'.join)
Out[3]:
col
A Cat-Tiger-Panda
B Ball-Bat-Mouse-Egg
Name: val, dtype: object
Last for convert index or MultiIndex to columns:
df1 = df.groupby('col')['val'].agg('-'.join).reset_index(name='new')
just try
group = df.groupby(['col'])['val'].apply(lambda x: '-'.join(x))
I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?
IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365
One solution:
df3 = df2[[c for c in df2.columns if c in df1]]
I have a pandas DataFrame with 3 columns, A, B, and V.
I want a DataFrame with A as the index and one column, which contains the B for the maximum V
I can easily create a df with A and the maximum V using groupby, and then perform some machinations to extract the corresponding B, but that seems like the wrong idea.
I've been playing with combinations of groupby and agg with no joy.
Sample Data:
A,B,V
MHQ,Q,0.5192
MMO,Q,0.4461
MTR,Q,0.5385
MVM,Q,0.351
NCR,Q,0.0704
MHQ,E,0.5435
MMO,E,0.4533
MTR,E,-0.6716
MVM,E,0.3684
NCR,E,-0.0278
MHQ,U,0.2712
MMO,U,0.1923
MTR,U,0.3833
MVM,U,0.1355
NCR,U,0.1058
A = [1,1,1,2,2,2,3,3,3,4,4,4]
B = [1,2,3,4,5,6,7,8,9,10,11,12]
V = [21,22,23,24,25,26,27,28,29,30,31,32]
df = pd.DataFrame({'A': A, 'B': B, 'V': V})
res = df.groupby('A').apply(
lambda x: x[x['V']==x['V'].max()]).set_index('A')['B'].to_frame()
res
B
A
1 3
2 6
3 9
4 12
I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have
A. B.
a b
a c
d f
I want all column B. where column A is a. so
A. B.
a b
a c
It's just a simple filter:
df2 = df.filter("A = 'a'")
which comes in many flavours, such as
df2 = df.filter(df.A == 'a')
df2 = df.filter(df['A'] == 'a')
or
import pyspark.sql.functions as F
df2 = df.filter(F.col('A') == 'a')
Suppose I have the following data that I want to conduct groupby on:
Key Prod Val
A a 1
A b 0
B a 1
B b 1
B d 1
C a 0
C b 0
I want to group the table so I have a single row per each key, A, B and C, and a list containing the prod values corresponding to the key. But the element should only be in the list of there's an indicator of 1 for the corresponding val. If it's completely 0 for the entire subset of a key, than the key should just get a none value. Here's the result I'm looking for using the same e.g. above:
Key List
A [a]
B [a, b, d]
C None
What's the most efficient way to perform this in pandas?
Let's try:
df.query('Val == 1').groupby('Key')['Prod'].agg(lambda x: list(x)).reindex(df.Key.unique())
Output:
Key
A [a]
B [a, b, d]
C NaN
dtype: object
I think just making a new dataframe would be easiest:
df2 = pd.DataFrame(columns = ['list'], index = set(df1.Key))
for i, row in df2.iterrows():
df2.loc[i, 'list'] = []
for i, row in df1.iterrows():
key = df1.loc[i, 'key']
if df1.loc[i, 'val'] == 1:
df2.loc[key, 'list'].append(df1.loc[i, 'prod'])