How to add a new column (not replace) - pandas

import pandas as pd
test=[
[14,12,1,13,15],
[11,21,1,19,32],
[48,16,1,16,12],
[22,24,1,18,41],
]
df = pd.DataFrame(test)
x = [1,2,3,4]
df['new'] = pd.DataFrame(x)
In this example,df will create new column 'new'
What I want is ...
I want create an new DataFrame (df1) include column 'new'(six column), and df is not changed (only five column).
I want df not to change.
How do I do that?

You can create the new DataFrame with .assign:
import pandas as pd
df= pd.DataFrame(test)
df1 = df.assign(new=x)
print(df)
0 1 2 3 4
0 14 12 1 13 15
1 11 21 1 19 32
2 48 16 1 16 12
3 22 24 1 18 41
print(df1)
0 1 2 3 4 new
0 14 12 1 13 15 1
1 11 21 1 19 32 2
2 48 16 1 16 12 3
3 22 24 1 18 41 4
.assign returns a new object, so you can modify it without affecting the original. The other alternative would be
df1 = df.copy() #New object, modifications do not affect `df`.
df1['new'] = x

Alternative way, 'e' is new column, np random creates random values for the new column
df.insert(len(df.columns),'e',np.random.randint(0,5,(5,1)))

Related

pandas dataframe and how to find an element using row and column

is there a way to find the element in a pandas data frame by using the row and column values.For example, if we have a list, L = [0,3,2,3,2,4,30,7], we can use L[2] and get the value 2 in return.
Use .iloc
df = pd.DataFrame({'L':[0,3,2,3,2,4,30,7], 'M':[10,23,22,73,72,14,130,17]})
L M
0 0 10
1 3 23
2 2 22
3 3 73
4 2 72
5 4 14
6 30 130
7 7 17
df.iloc[2]['L']
df.iloc[2:3, 0:1]
df.iat[2, 0]
2
df.iloc[6]['M']
df.iloc[6:7, 1:2]
df.iat[6, 1]
130

Create a new pandas DataFrame Column with a groupby

I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92

Convert Datatype to Integer and drop rows with non integer type values

I have a dataframe column whose curren
...
Use pd.to_numeric with errors='coerce' parameter:
df['Sno'] = pd.to_numeric(df['Sno'], errors='coerce')
df = df[df['Sno'].notna()].astype({'Sno': int})
Output:
>>> df
Sno test
0 12 5
1 14 5
2 15 7
3 16 8
4 17 9

Update dataframe column with values from another dataframe by index

I have two DataFrames.
One of them contains: item id, name, quantity and price.
Another: item id, name and quantity.
The problem is to update names and quantity in first DataFrame taking information from the second DataFrame by item id. Also, first DataFrame has not all item id's, so I need to take into account only those rows from the second DataFrame, which are in the first one.
DataFrame 1
In [1]: df1
Out[1]:
id name quantity price
0 10 X 10 15
1 11 Y 30 20
2 12 Z 20 15
3 13 X 15 10
4 14 X 12 15
DataFrame 2
In [2]: df2
Out[2]:
id name quantity
0 10 A 3
1 12 B 3
2 13 C 6
I've tried to use apply to iterate through rows and modify column value by condition like this:
def modify(row):
row['name'] = df2[df2['id'] == row['id']]['name'].get_values()[0]
row['quantity'] = df2[df2['id'] == row['id']]['quantity'].get_values()[0]
df1.apply(modify, axis=1)
But it doesn't have any results. DataFrame 1 is still the same
I am expecting something like this first:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 11 Y 30 20
2 12 B 3 15
3 13 C 6 10
4 14 X 12 15
After that I want to drop the rows, which were not modified to get:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 12 B 3 15
2 13 C 6 10
Using update
df1=df1.set_index('id')
df1.update(df2.set_index('id'))
df1=df1.reset_index()
Out[740]:
id name quantity price
0 10 A 3.0 15
1 11 Y 30.0 20
2 12 B 3.0 15
3 13 C 6.0 10
4 14 X 12.0 15
new_df = df.merge(df2, on='id')
new.drop(['name_x','quantity_x'], inplace=True, axis=1)
new.columns = ['id','price','name','quantity']
Output
id price name quantity
0 10 15 A 3
1 12 15 B 3
2 13 10 C 6

How can I select Pandas.DataFrame by elements' length

How can I select Pandas.DataFrame by elements' length.
import pandas as pd;
import numpy as np;
df=pd.DataFrame(np.random.randn(4,4).astype(str))
df.apply(lambda x: len(x[1]))
0 19
1 19
2 18
3 20
dtype: int64
here, we see there are three kinds of lengths.
I've searching for something like this kind operation df[len(df)==19], is it possible?
You could take advantage of the vectorized string operations available under .str, instead of using apply:
>>> df.applymap(len)
0 1 2 3
0 19 18 18 21
1 20 18 19 18
2 18 19 20 18
3 19 19 18 18
>>> df[1].str.len()
0 18
1 18
2 19
3 19
Name: 1, dtype: int64
>>> df.loc[df[1].str.len() == 19]
0 1 2 3
2 0.2630843312551179 -0.4811731811687397 -0.04493981407412525 -0.378866831599991
3 -0.5116348949042413 0.07649572869385729 0.8899251802216082 0.5802762385702874
Here is a simple example to show you what is going on:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(4,4).astype(str))
lengths = df.apply(lambda x: len(x[0]))
mask = lengths < 15
print df
print lengths
print mask
print df[mask]
Results in:
0 1 2 3
0 0.315649003654 -1.20005871043 -0.0973557747322 -0.0727740019505
1 -0.270800223158 -2.96509489589 0.822922470677 1.56021584947
2 -2.36245475786 0.0763821870378 1.0540009757 -0.452842084388
3 -1.03486927366 -0.269946751202 0.0611709385483 0.0114964425747
0 14
1 14
2 16
3 16
dtype: int64
0 True
1 True
2 False
3 False
dtype: bool
0 1 2 3
0 0.315649003654 -1.20005871043 -0.0973557747322 -0.0727740019505
1 -0.270800223158 -2.96509489589 0.822922470677 1.56021584947