splitting a pandas object

splitting a pandas object - pandas

I have a column in dataframe that has values such as 45+2, 98+3, 90+5. I want to split the values such that I only have 45,98,90 i.e drop the + symbol and all that follows it. The problem is that pandas has this data as an object making string stripping difficult any suggestion ?

Use Series.str.split with select first values of lists by indexing:
df = pd.DataFrame({'col':['45+2','98+3','90+5']})
df['new'] = df['col'].str.split('+').str[0]
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
Or use Series.str.extract for first integers from values:
df['new'] = df['col'].str.extract('(\d+)')
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90

You can use lambda function for doing this.
df1 = pd.DataFrame(data=['45+2','98+3','90+5'],columns=['col'])
print df1
col
0 45+2
1 98+3
2 90+5
Delete unwanted parts from the strings in the "col" column
df1['col'] = df1['col'].map(lambda x:x.split('+')[0])
print df1
col
0 45
1 98
2 90

Related

pandas add one column to many others

I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...

Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5

panda expand columns with list into multiple columns

I want to expand / cast a column that contains lists into multiple columns:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
# I want:
pd.DataFrame({'a':[1,2], 'b1':[11,33], 'b2':[22,44]})

Send the column .tolist and create the DataFrame, then join back to the other column(s).
df = pd.concat([df.drop(columns='b'), pd.DataFrame(df['b'].tolist(), index=df.index).add_prefix('b')],
axis=1)
a b0 b1
0 1 11 22
1 2 33 44

df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df["b1"] = df["b"].apply(lambda cell: cell[0])
df["b2"] = df["b"].apply(lambda cell: cell[1])
df[["a", "b1", "b2"]]

You can use .tolist() on your "b" column to expand it out, then just assign it back to the dataframe and get rid of your original "b" column:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df[["b1", "b2"]] = df["b"].tolist()
df = df.drop("b", axis=1) # alternatively: del df["b"]
print(df)
a b1 b2
0 1 11 22
1 2 33 44

Create new nested column within dataframe

I have the following
df1 = pd.DataFrame({'data': [1,2,3]})
df2 = pd.DataFrame({'data': [4,5,6]})
df = pd.concat([df1,df2], keys=['hello','world'], axis=1)
What is the "proper" way of creating a new nested column (say, df['world']['data']*2) within the hello column? I have tried df['hello']['new_col'] = df['world']['data']*2 but this does not seem to work.

Use tuples for select and set MultiIndex:
df[('hello','new_col')] = df[('world','data')]*2
print (df)
hello world hello
data data new_col
0 1 4 8
1 2 5 10
2 3 6 12
Selecting like df['world']['data'] is not recommended - link, because possible chained indexing.

selecting rows with min and max values in pandas dataframe

my df:
df=pd.DataFrame({'A':['Adam','Adam','Adam','Adam'],'B':[24,90,67,12]})
I want to select only rows with same name with min and max value in this df.
i can do that using this code:
df_max=df[df['B']==(df.groupby(['A'])['B'].transform(max))]
df_min=df[df['B']==(df.groupby(['A'])['B'].transform(min))]
df=pd.concat([df_max,df_min])
Is there any way to do this in one line? i prefer to not create two additional df's and concat them at the end .
Thanks

Use GroupBy.agg with DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin with reshape by DataFrame.melt and select rows by DataFrame.loc:
df1 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).melt()['value']].drop_duplicates()
Or DataFrame.stack:
df2 = df.loc[df.groupby('A')['B'].agg(['idxmax','idxmin']).stack()].drop_duplicates()
print (df2)
A B
1 Adam 90
3 Adam 12

A solution using groupby, apply and loc to select only the min or max value of column 'B'.
ddf = df.groupby('A').apply(lambda x : x.loc[(x['B'] == x['B'].min()) | (x['B'] == x['B'].max())]).reset_index(drop=True)
The result is:
A B
0 Adam 90
1 Adam 12

Check if a value in one column in one dataframe is within the range between values in two columns in another dataframe

I have two data frames
df1 = pd.DataFrame({'chr':[1,1],'pos':[100, 200]})
df2 = pd.DataFrame({'chr':[1,1,2],'start':[90,110,90],'stop':[110,120,110]})
I want to make a new dataframe with info from both dataframes if:
the value in df1['chr'] is the same is df2['chr']
and
the value df['pos'] is between the values in df2['start'] and df['stop']
From the dataframe above the result should be:
chr pos start stop
1 100 90 110
Thank you for any help!

You can try this :
df = df1.merge(df2,on='chr',how='left')
df.loc[(df['pos'] >= df['start']) & (df['pos'] <= df['stop'])]

You can use df.merge() followed by series.between():
m=df1.merge(df2,on='chr',how='left')
m.loc[m.pos.between(m.start,m.stop)]
chr pos start stop
0 1 100 90 110

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

splitting a pandas object - pandas

I have a column in dataframe that has values such as 45+2, 98+3, 90+5. I want to split the values such that I only have 45,98,90 i.e drop the + symbol and all that follows it. The problem is that pandas has this data as an object making string stripping difficult any suggestion ?

You can use lambda function for doing this. df1 = pd.DataFrame(data=['45+2','98+3','90+5'],columns=['col']) print df1 col 0 45+2 1 98+3 2 90+5 Delete unwanted parts from the strings in the "col" column df1['col'] = df1['col'].map(lambda x:x.split('+')[0]) print df1 col 0 45 1 98 2 90

Related

pandas add one column to many others

panda expand columns with list into multiple columns

Create new nested column within dataframe

selecting rows with min and max values in pandas dataframe

Check if a value in one column in one dataframe is within the range between values in two columns in another dataframe

Categories

Resources