I have an existing data frame with known columns. I want to insert a row with data for each column inserted one at a time.
I first created an empty data frame with few columns-
df = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df.to_csv('test.csv', sep='|', index=False)
test.csv
col1|col2|col3
Then, added a row with data inserted for each column one at a time.
list = ['col1', 'col2', 'col3']
turn = 2
df = pd.read_csv('test.csv', sep='|')
while turn:
for each in list:
df[each] = turn
turn-=1
Expected output test.csv
col1|col2|col3
2 |2 |2
1 |1 |1
But I am unable to get the expected output, instead, I'm getting this
col1|col2|col3
Kindly let me know where I'm making mistake, I would really appreciate any sort of help.
You can use df.append() to append a row
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2', 'col3'])
turn = 2
while turn:
new_row = {'col1':turn, 'col2':turn, 'col3':turn}
df = df.append(new_row, ignore_index=True)
turn-=1
Out[11]:
col1 col2 col3
0 2 2 2
1 1 1 1
To modify your while loop, do:
turn = 2
while turn:
for each in list:
df.loc[len(df.dropna()), each] = turn
turn-=1
>>> df
col1 col2 col3
0 2 2 2
1 1 1 1
>>>
The reason it doesn't work is because you're assigning to the whole column... Not the specific row value.
Related
This is something I always struggle with and is very beginner. Essentially, I want to locate and apply changes to a column based on a filter from another column.
Example input.
import pandas as pd
cols = ['col1', 'col2']
data = [
[1, 1],
[1, 1],
[2, 1],
[1, 1],
]
df = pd.DataFrame(data=data, columns=cols)
# NOTE: In practice, I will be applying a more complex function
df['col2'] = df.loc[df['col1'] == 1, 'col2'].apply(lambda x: x+1)
Returned output:
col1 col2
0 1 2.0
1 1 2.0
2 2 NaN
3 1 2.0
Expected output:
col1 col2
0 1 2
1 1 2
2 2 2
3 1 2
What's happening:
Records that do not meet the filtering condition are being set to null because of my apply / lambda routine
What I request:
The correct locate/filter and apply approach. I can achieve the expected frame using update, however I want to use locate and apply.
By doing df['col2'] = ..., you're setting all the values of col2. But, since you're only calling apply on some of the values, the values that aren't included get set to NaN. To fix that, save your mask and reuse it:
mask = df['col1'] == 1
df.loc[mask, 'col2'] = df.loc[mask, 'col2'].apply(lambda x: x+1)
Output:
>>> df
col1 col2
0 1 2
1 1 2
2 2 1
3 1 2
I have a dataframe df as:
Col1 Col2
A -5
A 3
B -2
B 15
I need to get the following:
Col1 Col2
A -5
B 15
Where the decision was made for each group in Col1 by selecting the absolute maximum from Col2. I am not sure how to proceed with this.
Use DataFrameGroupBy.idxmax with pass absolute values for indices and then select by DataFrame.loc:
df = df.loc[df['Col2'].abs().groupby(df['Col1']).idxmax()]
#alternative with reassign column
df = df.loc[df.assign(Col2 = df['Col2'].abs()).groupby('Col1')['Col2'].idxmax()]
print (df)
Col1 Col2
0 A -5
3 B 15
I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'
I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a
using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a
Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a
I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
I have pandas data frame have two-level group based on 'col10' and 'col1'.All I want to do is, drop all group rows if a specified value in another column repeated or this value did not existed in the group (keep the group which the specified value existed once only) for example:
The original data frame:
df = pd.DataFrame( {'col0':['A','A','A','A','A','B','B','B','B','B','B','B','c'],'col1':[1,1,2,2,2,1,1,1,1,2,2,2,1], 'col2':[1,2,1,2,3,1,2,1,2,2,2,2,1]})
I need to keep the the rows for the group for example (['A',1],['A',2],['B',2]) in this original DF
The desired dataframe:
I tried this step:
df.groupby(['col0','col1']).apply(lambda x: (x['col2']==1).sum()==1)
where the result is
col0 col1
A 1 True
2 True
B 1 False
2 True
c 1 False
dtype: bool
How to create the desired Df based on this bool?
You can do this as below:
m=(df.groupby(['col0','col1'])['col2'].
transform(lambda x: np.where((x.eq(1)).sum()==1,x,np.nan)).dropna().index)
df.loc[m]
Or:
df[df.groupby(['col0','col1'])['col2'].transform(lambda x: x.eq(1).sum()==1)]
col0 col1 col2
0 A 1 1
1 A 1 2
2 A 2 1
3 A 2 2
4 A 2 3
12 c 1 1