shift specific rows from one column to next column - pandas

I have a column X, and I want to split specific rows in other columns.
x
76.25
'87.12'
1
345.65
'96.45'
2
78.12
'85.23'
3
35.1
'65.21'
1
I want to shift all values with '' to new column Y and all integers to new column sequence. Note all values are text.
desired output is
x y sequence
76.25 '87.12' 1
345.65 '96.45' 2
78.12 '85.23' 3
35.1 '65.21' 1
I have hundreds of rows. I read about shift() to shift values to next column but in this case i don't know row position as there are hundred of rows.is it possible to shift specific values with this criteria? any help will be appreciated.

If data are regular and exist each triple you can convert values to numpy array and reshape, then pass to DataFrame constructor:
df1 = pd.DataFrame(df['x'].to_numpy().reshape(-1,3), columns=['x','y','seq'])
#oldier pandas versions
#df1 = pd.DataFrame(df['x'].values.reshape(-1,3), columns=['x','y','seq'])
print (df1)
x y seq
0 76.25 '87.12' 1
1 345.65 '96.45' 2
2 78.12 '85.23' 3
3 35.1 '65.21' 1

Related

How to split pandas dataframe into multiple dataframes (holding together rows) based upon a column's value

My problem is similar to split a dataframe into chunks of N rows problem, expect that the number of rows in each chunk will be different. I have a datafame as such:
A
B
C
1
2
0
1
2
1
1
2
2
1
2
0
1
2
1
1
2
2
1
2
3
1
2
4
1
2
0
A and B are just whatever don't pay attention. Column C though starts at 0 and increments with each row until it suddenly resets to 0. So in the dataframe included the first 3 rows are a new dataframe, then the next 5 are a second new dataframe, and this continues as my dataframe adds more and more rows.
To finish off the question,
df = [x for _, x in df.groupby(df['C'].eq(0).cumsum())]
allows me to group all the subgroups and then with this groupby I can select each subgroups as a separate dataframe.

Pandas - assign column values to new columns names

I have this dataframe:
player_id scout_occ round scout
812842 2 1 X
812842 4 1 Y
812842 1 1 Z
812842 1 2 X
812842 2 2 Y
812842 2 2 Z
And I need to transpose 'scout' values to columns, as well as using number of occurrences as value or these new columns, ending up with:
player_id round X Y Z
812842 1 2 4 1
812842 2 1 2 2
How do I achieve this?
Use pivot_table. For example:
df = df.pivot_table(values='scout_occ',index=['player_id','round'],columns='scout')
Then if you don't want to use column name(scout):
df.columns.name = None
Also, if you want to use player_id and round as a column not as an index:
df.reset_index()

Drop rows not equal to values for unique items - pandas

I've got a df that contains various strings that are associated with unique values. For these unique values, I want to drop the rows that are not equal to a separate list, except for the last row.
Using below, the various string values in Label are associated with Item. So for each unique Item, there could be multiple rows in Label with various strings. I only want to keep the strings that are in label_list, except for the last row.
I'm not sure I can do this another way as the amount of strings not in label_list is too many to account for. The ordering van also vary. So for each unique value in Item, I really only want the last row and whatever rows that are in label_list.
label_list = ['A','B','C','D']
df = pd.DataFrame({
'Item' : [10,10,10,10,10,20,20,20],
'Label' : ['A','X','C','D','Y','A','B','X'],
'Count' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,310.0],
})
df = df[df['Label'].isin(label_list)]
Intended output:
Item Label Value
0 10 A 80.0
1 10 C 200.0
2 10 D 210.0
3 10 Y 260.0
4 20 A 260.0
5 20 B 300.0
6 20 X 310.0
This comes to mind as a quick and dirty solution:
df = pd.concat([df[df['Label'].isin(label_list)],df.drop_duplicates('Item',keep='last')]).drop_duplicates(keep='first')
We are appending the last row of each Item group, but in case the last row is duplicsted because it is also in label_list we are using drop duplicates for the concatenated outout too.
check if 'Label' isin label_list
check if rows are duplicated
boolean slice the dataframe
isin_ = df['Label'].isin(label_list)
duped = df.duplicated('Item', keep='last')
df[isin_ | ~duped]
Item Label Count
0 10 A 80.0
2 10 C 200.0
3 10 D 210.0
4 10 Y 260.0
5 20 A 260.0
6 20 B 300.0
7 20 X 310.0

drop consecutive duplicates of groups

I am removing consecutive duplicates in groups in a dataframe. I am looking for a faster way than this:
def remove_consecutive_dupes(subdf):
dupe_ids = [ "A", "B" ]
is_duped = (subdf[dupe_ids].shift(-1) == subdf[dupe_ids]).all(axis=1)
subdf = subdf[~is_duped]
return subdf
# dataframe with columns key, A, B
df.groupby("key").apply(remove_consecutive_dupes).reset_index()
Is it possible to remove these without grouping first? Applying the above function to each group individually takes a lot of time, especially if the group count is like half the row count. Is there a way to do this operation on the entire dataframe at once?
A simple example for the algorithm if the above was not clear:
input:
key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5
5 x 1 2
output:
key A B
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
5 x 1 2
Row 2 was dropped because A=1 B=2 was also the previous row in group x.
Row 5 will not be dropped because it is not a consecutive duplicate in group x.
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

create values for column in pandas dataframe only for rows containing certain elements of a column

df = pd.DataFrame({'x':['a','a','b','b'], 'y':[1,2,3,4]})
How can I create a column z which elements are equal to y*2 but only for a elements in column x?
This is what I'm trying to achieve:
x y z
0 a 1 2
1 a 2 4
2 b 3 na
3 b 4 na
#using list comprehension with if else statements
df['z']=[y*2 if x=='a' else 'na' for x,y in zip(df['x'],df['y']) ]