I have a dataframe. I would like to test whether, (C), on each row, the number in column (B) is in the string, column (A).
df = pd.DataFrame({'A': ["me 123", "me-123", "1234", "me 12", "123 me"],
'B': [123, 123, 123, 123, 6]})
I can do that using extract
df['C'] = df.A.str.extract('(\d+)', expand=False).astype(int).eq(df.B,0).astype(int)
A B C
0 me 123 123 1
1 me-123 123 1
2 1234 123 0
3 me 12 123 0
4 123 me 6 0
However, if one of the A values does not contain a number:
df = pd.DataFrame({'A': ["me 123", "me-123", "1234", "me 12", "123 me", "me"],
'B': [123, 123, 123, 123, 6, 123]})
Then I get:
ValueError: cannot convert float NaN to integer
Values NaNs are floats, so you can convert output to floats:
df['C'] = df.A.str.extract('(\d+)', expand=False).astype(float).eq(df.B,0).astype(int)
Related
I am selecting multiple rows based on a condition, and updating values in multiple columns. This works unless one of the values is a list.
First, a dataframe:
>>> dummy_data = {'A': ['abc', 'def', 'def'],
'B': ['red', 'purple', 'blue'],
'C': [25, 94, 57],
'D': [False, False, False],
'E': [[9,8,12], [36,72,4], [18,3,5]]}
>>> df = pd.DataFrame(dummy_data)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [36, 72, 4]
2 def blue 57 False [18, 3, 5]
Things that work:
This works to select multiple rows and update multiple columns:
>>> df.loc[df['A'] == 'def', ['B', 'C', 'D']] = ['orange', 42, True]
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [36, 72, 4]
2 def orange 42 True [18, 3, 5]
This works to update column E with a new list:
>>> new_list = [1,2,3]
>>> df.loc[df['A'] == 'def', ['E']] = pd.Series([new_list] * len(df))
A B C D E
0 abc red 25 False [9, 8, 12]
1 def purple 94 False [1, 2, 3]
2 def blue 57 False [1, 2, 3]
But how to do both?
I can't figure out an elegant way to combine these approaches.
Attempt 1 This works, but I get the ndarray from ragged nested sequences warning:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = [updates] * num_rows
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences ...
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
Attempt 2 This works, but seems overly complicated:
>>> new_list = [1,2,3]
>>> updates = ['orange', 42, new_list]
>>> num_rows = df.A.eq('def').sum()
>>> df2 = pd.DataFrame([updates] * num_rows)
>>> df.loc[df['A'] == 'def', ['B', 'C', 'E']] = df2[[0, 1, 2]].values
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
You can use pandas.DataFrame to assign/align the new values with the selected columns with the help of a boolean mask.
mask = df['A'] == 'def'
cols = ['B', 'C', 'D', 'E']
new_list = [1,2,3]
updates = ['orange', 42, True, [new_list]]
df.loc[mask, cols] = pd.DataFrame(dict(zip(cols, updates)), index=df.index)
>>> print(df)
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 True [1, 2, 3]
2 def orange 42 True [1, 2, 3]
[Finished in 589ms]
Create a numpy array with object dtype:
df.loc[df['A'] == 'def', ['B', 'C', 'E']] = np.array([updates] * num_rows, dtype='object')
Output:
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3]
2 def orange 42 False [1, 2, 3]
However, as commented, [updates] * num_rows is a dangerous operation. For example, later you want to modify one of the array value:
df.iloc[-1,-1].append(4)
Then your data becomes (notice the change in row 1 as well):
A B C D E
0 abc red 25 False [9, 8, 12]
1 def orange 42 False [1, 2, 3, 4]
2 def orange 42 False [1, 2, 3, 4]
I would like to subset a dataframe without assigning it first to a variable.
Example with assigning:
df = pd.DataFrame({'A': range(10), 'B': range(5, 15)})
df[(df['A'] > 3) & (df['B'] < 12)]
Result:
A B
4 4 9
5 5 10
6 6 11
How to do this without creating df first?
Something like...
pd.DataFrame({'A': range(10), 'B': range(5, 15)}).loc[..., ...]
Or maybe using .pipe()?
Use selection by callable:
df = (pd.DataFrame({'A': range(10), 'B': range(5, 15)})
.loc[lambda x: (x['A'] > 3) & (x['B'] < 12)])
print (df)
A B
4 4 9
5 5 10
6 6 11
Another idea with query, thank you #sammywemmy:
df = pd.DataFrame({'A': range(10), 'B': range(5, 15)}).query("A > 3 and B < 12")
#working same
df = pd.DataFrame({'A': range(10), 'B': range(5, 15)}).query("A > 3 & B < 12")
I have a dataframe as below.
D1 = pd.DataFrame({'a': [15, 22, 107, 120],
'b': [25, 21, 95, 110]})
I am trying to randomly add two rows into column 'b' to get the effect of below. In each case the inserted 0 in this case shifts the rows down one.
D1 = pd.DataFrame({'a': [15, 22, 107, 120, 0, 0],
'b': [0, 25, 21, 0, 95, 110]})
Everything I have seen is about inserting into the whole column as opposed to individual rows.
Here is one potential way to achieve this using numpy.random.randint and numpy.insert:
import numpy as np
n = 2
rand_idx = np.random.randint(0, len(D1), size=n)
# Append 'n' rows of zeroes to D1
D2 = D1.append(pd.DataFrame(np.zeros((n, D1.shape[1])), columns=D1.columns, dtype=int), ignore_index=True)
# Insert n zeroes into random indices and assign back to column 'b'
D2['b'] = np.insert(D1['b'].values, rand_idx, 0)
print(D2)
a b
0 15 25
1 22 0
2 107 0
3 120 21
4 0 95
5 0 110
Use numpy.insert with set positions - for a by random and for b by length of original DataFrame:
n = 2
new = np.zeros(n, dtype=int)
a = np.insert(D1['b'].values, len(D1), new)
b = np.insert(D1['a'].values, np.random.randint(0, len(D1), size=n), new)
#python 0.24+
#a = np.insert(D1['b'].to_numpy(), len(D1), new)
#b = np.insert(D1['a'].to_numpy(), np.random.randint(0, len(D1), size=n), new)
df = pd.DataFrame({'a':a, 'b': b})
print (df)
a b
0 25 0
1 21 15
2 95 22
3 110 0
4 0 107
5 0 120
I have problems to merge two dataframes in the desired way. I unsuccessfully tried out a lot with merge and join methods but I did not achieve the desired result.
import pandas as pd
d = {'A': [1, 1, 0, 1, 0, 1, 0],
'B': [0, 0, 0, 0, 0, 1, 1]
}
df = pd.DataFrame(data=d, index=["A", "B", "C", "D", "E", "F", "G"])
print(df)
d = {'A2': ["D", "A", "A", "B", "C", "C", "E", "X", "F", "G"],
'B2': ["DD", "AA", "AA", "BB", "CC", "CC", "EE", "XX", "FF", "GG"],
'C3': [1, 1, 11, 35, 53, 2, 76, 45, 5, 34]}
df2 = pd.DataFrame(data=d)
print(df2)
Console output:
A B
A 1 0
B 1 0
C 0 0
D 1 0
E 0 0
F 1 1
G 0 1
A2 B2 C3
0 A AA 1
1 A AA 11
2 B BB 35
3 C CC 53
4 C CC 2
5 E EE 76
6 X XX 45
7 F FF 5
8 G GG 34
I'm looking for a way to compute the following: Via the index of df I can look up in column A2 of df2 the value of B2 which should be added to df.
Desired result:
A B B2
A 1 0 AA
B 1 0 BB
C 0 0 CC
D 1 0 DD
E 0 0 EE
F 1 1 FF
G 0 1 GG
(This is only dummy data, just duplicating the index and write it in column B2 of df is not sufficient)
set_index and assign it
df['B2']=df2.drop_duplicates('A2').set_index('A2')['B2']
df
Out[728]:
A B B2
A 1 0 AA
B 1 0 BB
C 0 0 CC
D 1 0 DD
E 0 0 EE
F 1 1 FF
G 0 1 GG
I know this has been already answered by W-B in a very elegant way.
However, since I have spent the time to solve this in a less professional way, let me post also my solution.
From:
I'm looking for a way to compute the following: Via the index of df I
can look up in column A2 of df2 the value of B2 which should be added
to df.
I understood I should do:
get index list form df. So A, B, C...
look values in df2['B2'] in the same index than df2['A2'] for each element of df index
create a new column ['B2'] in df, where we copy these values from df2['B2'] matching the index from df to the elements on df2['A2']
This is my code:
import pandas as pd
d = {'A': [1, 1, 0, 1, 0, 1, 0],
'B': [0, 0, 0, 0, 0, 1, 1]
}
df = pd.DataFrame(data=d, index=["A", "B", "C", "D", "E", "F", "G"])
print(df)
d = {'A2': ["D", "A", "A", "B", "C", "C", "E", "X", "F", "G"],
'B2': ["DD", "AA", "AA", "BB", "CC", "CC", "EE", "XX", "FF", "GG"],
'C3': [1, 1, 11, 35, 53, 2, 76, 45, 5, 34]}
df2 = pd.DataFrame(data=d)
print(df2)
llista=[]
for i in df.index:
m=df2['A2'].loc[df2['A2']==i].index
if m[0]:
print m[0],i
llista.append(df2['B2'].iloc[m[0]])
else:
llista.append([])
df['B2'] = llista
Output is:
A B B2
A 1 0 AA
B 1 0 BB
C 0 0 CC
D 1 0 []
E 0 0 EE
F 1 1 FF
G 0 1 GG
As you can see is different than the accepted post. This is because there is no 'D' index in df2['A2']
QUESTION
How can I get rid of the repeated column labels for each line of data?
CODE
req = urllib.request.Request(newIsUrl)
resp = urllib.request.urlopen(req)
respData = resp.read()
dRespData = respData.decode('utf-8')
df = pd.DataFrame(columns= ['Ticker', 'GW', 'RE', 'OE', 'NI', 'CE'])
df = df.append({'Ticker':ticker,
'GW':gw,
'RE':rt,
'OE':oe,
'NI':netInc,
'CE':capExp}, ignore_index= True)
print(df)
yhooKeyStats()
acquireData()
OUTCOME
Ticker GW RE OE NI CE
0 MMM [7,050,000] [34,317,000] [13,109,000] [4,956,000] [(1,493,000)]
Ticker GW RE OE NI CE
0 ABT [17,501,000] [7,412,000] [12,156,000] [2,437,000]
NOTES
all of the headers and data line up respectively
headers are repeated in the dataframe for each line of data
You can skip every other line with a slice and iloc:
In [11]: df = pd.DataFrame({0: ['A', 1, 'A', 3], 1: ['B', 2, 'B', 4]})
In [12]: df
Out[12]:
0 1
0 A B
1 1 2
2 A B
3 3 4
In [13]: df.iloc[1::2]
Out[13]:
0 1
1 1 2
3 3 4