Sort data in Pandas dataframe alphabetically - pandas

I have a dataframe where I need to sort the contents of one column (comma separated) alphabetically:
ID Data
1 Mo,Ab,ZZz
2 Ab,Ma,Bt
3 Xe,Aa
4 Xe,Re,Fi,Ab
Output:
ID Data
1 Ab,Mo,ZZz
2 Ab,Bt,Ma
3 Aa,Xe
4 Ab,Fi,Re,Xe
I have tried:
df.sort_values(by='Data')
But this does not work

You can split, sorting and then join back:
df['Data'] = df['Data'].apply(lambda x: ','.join(sorted(x.split(','))))
Or use list comprehension alternative:
df['Data'] = [','.join(sorted(x.split(','))) for x in df['Data']]
print (df)
ID Data
0 1 Ab,Mo,ZZz
1 2 Ab,Bt,Ma
2 3 Aa,Xe
3 4 Ab,Fi,Re,Xe

IIUC get_dummies
s=df.Data.str.get_dummies(',')
df['n']=s.dot(s.columns+',').str[:-1]
df
Out[216]:
ID Data n
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe

IIUC you can use a list comprehension:
[','.join(sorted(i.split(','))) for i in df['Data']]
#['Ab,Mo,ZZz', 'Ab,Bt,Ma', 'Aa,Xe', 'Ab,Fi,Re,Xe']

using explode and sort_values
df["Sorted_Data"] = (
df["Data"].str.split(",").explode().sort_values().groupby(level=0).agg(','.join)
)
print(df)
ID Data Sorted_Data
0 1 Mo,Ab,ZZz Ab,Mo,ZZz
1 2 Ab,Ma,Bt Ab,Bt,Ma
2 3 Xe,Aa Aa,Xe
3 4 Xe,Re,Fi,Ab Ab,Fi,Re,Xe

Using row iteration:
for index, row in df.iterrows():
row['Data'] = ','.join(sorted(row['Data'].split(',')))
In [29]: df
Out[29]:
Data
0 Ab,Mo,ZZz
1 Ab,Bt,Ma
2 Aa,Xe
3 Ab,Fi,Re,Xe

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.
NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

most efficient way to set dataframe column indexing to other columns

I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9

Dataframe merge by row

I have two pd df and I want to merge df2 to each row of df1 based on the ID in df1. The final df should look like in df3.
How do I do it? I tried merge, join and concat and didn't get want I wanted.
df1
ID Division
1 10
2 2
3 4
... ...
df2
Product type Level
1 0
1 1
1 2
2 0
2 1
2 2
2 3
df3
ID Product type Level Division
1 1 0 10
1 1 1 10
1 1 2 10
1 2 0 10
1 2 1 10
1 2 2 10
1 2 3 10
and repeat for ID 2 and ......
Looks like you are looking for a Cartesian product of two dataframes. The following approach should achieve what you want,
(df1.assign(key=1)
.merge(df2.assign(key=1))
.drop('key', axis=1))
Consider such an option:
set index in both DataFrames to 0,
perform an outer join (on indices, so the result is just the Cartesian
product),
reset index.
The code to do it is:
df1.index = [0] * df1.index.size
df2.index = [0] * df2.index.size
result = df1.join(df2, how='outer').reset_index(drop=True)

pandas dataframe filter by sequence of values in a specific column

I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4