create values for column in pandas dataframe only for rows containing certain elements of a column - pandas

df = pd.DataFrame({'x':['a','a','b','b'], 'y':[1,2,3,4]})
How can I create a column z which elements are equal to y*2 but only for a elements in column x?
This is what I'm trying to achieve:
x y z
0 a 1 2
1 a 2 4
2 b 3 na
3 b 4 na

#using list comprehension with if else statements
df['z']=[y*2 if x=='a' else 'na' for x,y in zip(df['x'],df['y']) ]

Related

drop consecutive duplicates of groups

I am removing consecutive duplicates in groups in a dataframe. I am looking for a faster way than this:
def remove_consecutive_dupes(subdf):
dupe_ids = [ "A", "B" ]
is_duped = (subdf[dupe_ids].shift(-1) == subdf[dupe_ids]).all(axis=1)
subdf = subdf[~is_duped]
return subdf
# dataframe with columns key, A, B
df.groupby("key").apply(remove_consecutive_dupes).reset_index()
Is it possible to remove these without grouping first? Applying the above function to each group individually takes a lot of time, especially if the group count is like half the row count. Is there a way to do this operation on the entire dataframe at once?
A simple example for the algorithm if the above was not clear:
input:
key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5
5 x 1 2
output:
key A B
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
5 x 1 2
Row 2 was dropped because A=1 B=2 was also the previous row in group x.
Row 5 will not be dropped because it is not a consecutive duplicate in group x.
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

shift specific rows from one column to next column

I have a column X, and I want to split specific rows in other columns.
x
76.25
'87.12'
1
345.65
'96.45'
2
78.12
'85.23'
3
35.1
'65.21'
1
I want to shift all values with '' to new column Y and all integers to new column sequence. Note all values are text.
desired output is
x y sequence
76.25 '87.12' 1
345.65 '96.45' 2
78.12 '85.23' 3
35.1 '65.21' 1
I have hundreds of rows. I read about shift() to shift values to next column but in this case i don't know row position as there are hundred of rows.is it possible to shift specific values with this criteria? any help will be appreciated.
If data are regular and exist each triple you can convert values to numpy array and reshape, then pass to DataFrame constructor:
df1 = pd.DataFrame(df['x'].to_numpy().reshape(-1,3), columns=['x','y','seq'])
#oldier pandas versions
#df1 = pd.DataFrame(df['x'].values.reshape(-1,3), columns=['x','y','seq'])
print (df1)
x y seq
0 76.25 '87.12' 1
1 345.65 '96.45' 2
2 78.12 '85.23' 3
3 35.1 '65.21' 1

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Python: Add column to panda data frame with different column length

I have a panda dataframe and would like to add data columns using one common column as index. In case the new data does not have the index value it should enter a 0. The new column will have a different length. Is there a better way than using a loop? Example below
main Dataframe:
index_column date value
1 1 A
2 2 B
3 3 C
4 4 D
add new column:
date value
2 G
3 J
Result:
index_column date value new value
1 1 A 0
2 2 B G
3 3 C J
4 4 D 0
Many thanks!
Rolf

Map column names if data is same in two dataframes

I have two pandas dataframes
df1 = A B C
1 2 3
2 3 4
3 4 5
df2 = X Y Z
1 2 3
2 3 4
3 4 5
I need to map based on data If data is same then map column namesenter code here
Output = col1 col2
A X
B Y
C Z
I cannot find any built-in function to support this, hence simply loop over all columns:
pairs = []
for col1 in df1.columns:
for col2 in df2.columns:
if df1[col1].equals(df2[col2]):
pairs.append((col1, col2))
output = pandas.DataFrame(pairs, columns=['col1', 'col2'])