splitting a pandas object - pandas

I have a column in dataframe that has values such as 45+2, 98+3, 90+5. I want to split the values such that I only have 45,98,90 i.e drop the + symbol and all that follows it. The problem is that pandas has this data as an object making string stripping difficult any suggestion ?

Use Series.str.split with select first values of lists by indexing:
df = pd.DataFrame({'col':['45+2','98+3','90+5']})
df['new'] = df['col'].str.split('+').str[0]
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
Or use Series.str.extract for first integers from values:
df['new'] = df['col'].str.extract('(\d+)')
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90

You can use lambda function for doing this.
df1 = pd.DataFrame(data=['45+2','98+3','90+5'],columns=['col'])
print df1
col
0 45+2
1 98+3
2 90+5
Delete unwanted parts from the strings in the "col" column
df1['col'] = df1['col'].map(lambda x:x.split('+')[0])
print df1
col
0 45
1 98
2 90

Related

pandas add one column to many others

I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...
Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5

panda expand columns with list into multiple columns

I want to expand / cast a column that contains lists into multiple columns:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
# I want:
pd.DataFrame({'a':[1,2], 'b1':[11,33], 'b2':[22,44]})
Send the column .tolist and create the DataFrame, then join back to the other column(s).
df = pd.concat([df.drop(columns='b'), pd.DataFrame(df['b'].tolist(), index=df.index).add_prefix('b')],
axis=1)
a b0 b1
0 1 11 22
1 2 33 44
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df["b1"] = df["b"].apply(lambda cell: cell[0])
df["b2"] = df["b"].apply(lambda cell: cell[1])
df[["a", "b1", "b2"]]
You can use .tolist() on your "b" column to expand it out, then just assign it back to the dataframe and get rid of your original "b" column:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df[["b1", "b2"]] = df["b"].tolist()
df = df.drop("b", axis=1) # alternatively: del df["b"]
print(df)
a b1 b2
0 1 11 22
1 2 33 44

Create new nested column within dataframe

I have the following
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
What is the "proper" way of creating a new nested column (say, df['world']['data']*2) within the hello column? I have tried df['hello']['new_col'] = df['world']['data']*2 but this does not seem to work.
Use tuples for select and set MultiIndex:
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
Selecting like df['world']['data'] is not recommended - link, because possible chained indexing.

selecting rows with min and max values in pandas dataframe

my df:
df=pd.DataFrame({'A':['Adam','Adam','Adam','Adam'],'B':[24,90,67,12]})
I want to select only rows with same name with min and max value in this df.
i can do that using this code:
df_max=df[df['B']==(df.groupby(['A'])['B'].transform(max))]
df_min=df[df['B']==(df.groupby(['A'])['B'].transform(min))]
df=pd.concat([df_max,df_min])
Is there any way to do this in one line? i prefer to not create two additional df's and concat them at the end .
Thanks
Use GroupBy.agg with DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin with reshape by DataFrame.melt and select rows by DataFrame.loc:
df1 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).melt()['value']].drop_duplicates()
Or DataFrame.stack:
df2 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).stack()].drop_duplicates()
print (df2)
A B
1 Adam 90
3 Adam 12
A solution using groupby, apply and loc to select only the min or max value of column 'B'.
ddf = df.groupby('A').apply(lambda x : x.loc[(x['B'] == x['B'].min()) | (x['B'] == x['B'].max())]).reset_index(drop=True)
The result is:
A B
0 Adam 90
1 Adam 12

Check if a value in one column in one dataframe is within the range between values in two columns in another dataframe

I have two data frames
df1 = pd.DataFrame({'chr':[1,1],'pos':[100, 200]})
df2 = pd.DataFrame({'chr':[1,1,2],'start':[90,110,90],'stop':[110,120,110]})
I want to make a new dataframe with info from both dataframes if:
the value in df1['chr'] is the same is df2['chr']
and
the value df['pos'] is between the values in df2['start'] and df['stop']
From the dataframe above the result should be:
chr pos start stop
1 100 90 110
Thank you for any help!
You can try this :
df = df1.merge(df2,on='chr',how='left')
df.loc[(df['pos'] >= df['start']) & (df['pos'] <= df['stop'])]
You can use df.merge() followed by series.between():
m=df1.merge(df2,on='chr',how='left')
m.loc[m.pos.between(m.start,m.stop)]
chr pos start stop
0 1 100 90 110