I have three columns C1,C2,C3 in panda dataframe. My aim is to replace C1_i by C2_j whenever C3_i=C1_j. These are all strings. I was trying where but failed. What is a good way to do this avoiding for loop?
If my data frame is
df=pd.DataFrame({'c1': ['a', 'b', 'c'], 'c2': ['d','e','f'], 'c3': ['c', 'z', 'b']})
Then I want c3 to be replaced by ['f','z','e']
I tried this, which takes very long time.
for i in range(0,len(df)):
for j in range(0,len(df)):
if (df.iloc[i]['c1']==df.iloc[j]['c3']):
df.iloc[j]['c3']=accounts.iloc[i]['c2']
Use map by Series created by set_index:
df['c3'] = df['c3'].map(df.set_index('c1')['c2']).fillna(df['c3'])
Alternative solution with update:
df['c3'].update(df['c3'].map(df.set_index('c1')['c2']))
print (df)
c1 c2 c3
0 a d f
1 b e z
2 c f e
Example data:
dataframe = pd.DataFrame({'a':['10','4','3','40','5'], 'b':['5','4','3','2','1'], 'c':['s','d','f','g','h']})
Output:
a b c
0 10 5 s
1 4 4 d
2 3 3 f
3 40 2 g
4 5 1 h
Code:
def replace(df):
if len(dataframe[dataframe.b==df.a]) != 0:
df['a'] = dataframe[dataframe.b==df.a].c.values[0]
return df
dataframe = dataframe.apply(replace, 1)
Output:
a b c
0 1 5 0
1 2 4 0
2 0 3 0
3 4 2 0
4 5 1 0
Is it what you want?
Related
I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9
I have the following DataFrame (result of the method unstack):
df = pd.DataFrame(np.arange(12).reshape(2, -1),
columns=pd.CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c']))
df looks like this:
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
When I try to df.reset_index() I get the following error:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
To bypass this problem I want to convert the column's index from categorical to a normal one. What is the most straightforward way to do it? Maybe you have an idea of how to reset the index without index conversion. I have the following idea:
df.columns = list(df.columns)
Most general is converting columns to list:
df.columns = df.columns.tolist()
Or if possible, convert them to strings:
df.columns = df.columns.astype(str)
df = df.reset_index()
print (df)
index a b c a b c
0 0 0 1 2 3 4 5
1 1 6 7 8 9 10 11
Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]
Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...
I have 3 dataframes say A, B and C with a common column 'com_col' in all three dataframes. I want to create a new column called 'com_col_occurrences' in B which should be calculated as below. For each value in 'com_col in dataframe B, check whether the value is available in A or not. If it is available then return the number of times the value has occurred in A. If it is not, then check in C whether it is available or not and if it is then return the number of times it has repeated. Please tell me how to write a function for this in Pandas. Please find below the sample code which demonstrates the problem.
import pandas as pd
#Given dataframes
df1 = pd.DataFrame({'comm_col': ['A', 'B', 'B', 'A']})
df2 = pd.DataFrame({'comm_col': ['A', 'B', 'C', 'D', 'E']})
df3 = pd.DataFrame({'comm_col':['A', 'A', 'D', 'E']})
# The value 'A' from df2 occurs in df1 twice. Hence the output is 2.
#Similarly for 'B' the output is 2. 'C' doesn't occur in any of the
#dataframes. Hence the output is 0
# 'D' and 'E' occur don't occur in df1 but occur in df3 once. Hence
#the output for 'D' and 'E' should be 1
#Output should be as shown below
df2['comm_col_occurrences'] = [2, 2, 0, 1, 1]
Output:
**df1**
comm_col
0 A
1 B
2 B
3 A
**df3**
comm_col
0 A
1 A
2 D
3 E
**df2**
comm_col
0 A
1 B
2 C
3 D
4 E
**Output**
comm_col comm_col_occurrences
0 A 2
1 B 2
2 C 0
3 D 1
4 E 1
Thanks in advance
You need:
result = pd.DataFrame({
'df1':df1['comm_col'].value_counts(),
'df2':df2['comm_col'].value_counts(),
'df3':df3['comm_col'].value_counts()
})
result['comm_col_occurrences'] = np.nan
result.loc[result['df1'].notnull(), 'comm_col_occurrences'] = result['df1']
result.loc[result['df3'].notnull(), 'comm_col_occurrences'] = result['df3']
result['comm_col'] = result['comm_col'].fillna(0)
result = result.drop(['df1', 'df2', 'df3'], axis=1)
Output:
comm_col comm_col_occurrences
0 A 2.0
1 B 2.0
2 C 0.0
3 D 1.0
4 E 1.0