Drop rows not equal to values for unique items - pandas - pandas

I've got a df that contains various strings that are associated with unique values. For these unique values, I want to drop the rows that are not equal to a separate list, except for the last row.
Using below, the various string values in Label are associated with Item. So for each unique Item, there could be multiple rows in Label with various strings. I only want to keep the strings that are in label_list, except for the last row.
I'm not sure I can do this another way as the amount of strings not in label_list is too many to account for. The ordering van also vary. So for each unique value in Item, I really only want the last row and whatever rows that are in label_list.
label_list = ['A','B','C','D']
df = pd.DataFrame({
'Item' : [10,10,10,10,10,20,20,20],
'Label' : ['A','X','C','D','Y','A','B','X'],
'Count' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,310.0],
})
df = df[df['Label'].isin(label_list)]
Intended output:
Item Label Value
0 10 A 80.0
1 10 C 200.0
2 10 D 210.0
3 10 Y 260.0
4 20 A 260.0
5 20 B 300.0
6 20 X 310.0

This comes to mind as a quick and dirty solution:
df = pd.concat([df[df['Label'].isin(label_list)],df.drop_duplicates('Item',keep='last')]).drop_duplicates(keep='first')
We are appending the last row of each Item group, but in case the last row is duplicsted because it is also in label_list we are using drop duplicates for the concatenated outout too.

check if 'Label' isin label_list
check if rows are duplicated
boolean slice the dataframe
isin_ = df['Label'].isin(label_list)
duped = df.duplicated('Item', keep='last')
df[isin_ | ~duped]
Item Label Count
0 10 A 80.0
2 10 C 200.0
3 10 D 210.0
4 10 Y 260.0
5 20 A 260.0
6 20 B 300.0
7 20 X 310.0

Related

Sorting df by column name of type timestamp

I have a dataframe df which consists of columns of countries and rows of dates. The index is of type "DateTime."
I would like to sort the df by the values of each country by the last element in the series (eg, the latest date) and the graph the "top N" countries by this latest value.
I thought if I sorted the transpose of the df and then slice it, I would have what I need. Hence, if N = 10, then I would select df[0:9].
However,when I attempt to select the last column, I get a 'keyerror' message, referencing the selected column:
KeyError: '2021-03-28 00:00:00'.
I'm stumped....
df_T = df.transpose()
column_name = str(df_T.columns[-1])
df_T.sort_values(by = column_name, axis = 'columns', inplace = True)
#select the top 10 countries by latest value, eg
# plot df_T[0:9]
What I'm trying to do, example df:
A B C .... X Y Z
2021-03-29 10 20 5 .... 50 100 7
2021-03-28 9 19 4 .... 45 90 6
2021-03-27 8 15 2 .... 40 80 4
...
2021-01-03 0 0 0 .... 0 0 0
I want to select series representing by the greatest N values as of the latest index value (eg, latest date).

pandas - how to vectorized group by calculations instead of iteration

Here is a code sniplet to simulate the problem i am facing. i am using iteration on large datasets
df = pd.DataFrame({'grp':np.random.choice([1,2,3,4,5],500),'col1':np.arange(0,500),'col2':np.random.randint(0,10,500),'col3':np.nan})
for index, row in df.iterrows():
#based on group label, get last 3 values to calculate mean
d=df.iloc[0:index].groupby('grp')
try:
dgrp_sum=d.get_group(row.grp).col2.tail(3).mean()
except:
dgrp_sum=999
#after getting last 3 values of group with reference to current row reference, multiply by other rows
df.at[index,'col3']=dgrp_sum*row.col1*row.col2
if i want to speed it up with vectors, how do i convert this code?
You basically calculate moving average over every group.
Which means you can group dataframe by "grp" and calculate rolling mean.
At the end you multiply columns in each row because it is not dependent on group.
df["col3"] = df.groupby("grp").col2.rolling(3, min_periods=1).mean().reset_index(0,drop=True)
df["col3"] = df[["col1", "col2", "col3"]].product(axis=1)
Note: In your code, each calculated mean is placed in the next row, thats why you probably have this try block.
# Skipping last product gives only mean
# np.random.seed(1234)
# print(df[df["grp"] == 2])
grp col1 col2 iter mask
4 2 4 6 999.000000 6.000000
5 2 5 0 6.000000 3.000000
6 2 6 9 3.000000 5.000000
17 2 17 1 5.000000 3.333333
27 2 27 9 3.333333 6.333333

Lookup a pandas df for a column value by matching rows with another dataframe

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.

shift specific rows from one column to next column

I have a column X, and I want to split specific rows in other columns.
x
76.25
'87.12'
1
345.65
'96.45'
2
78.12
'85.23'
3
35.1
'65.21'
1
I want to shift all values with '' to new column Y and all integers to new column sequence. Note all values are text.
desired output is
x y sequence
76.25 '87.12' 1
345.65 '96.45' 2
78.12 '85.23' 3
35.1 '65.21' 1
I have hundreds of rows. I read about shift() to shift values to next column but in this case i don't know row position as there are hundred of rows.is it possible to shift specific values with this criteria? any help will be appreciated.
If data are regular and exist each triple you can convert values to numpy array and reshape, then pass to DataFrame constructor:
df1 = pd.DataFrame(df['x'].to_numpy().reshape(-1,3), columns=['x','y','seq'])
#oldier pandas versions
#df1 = pd.DataFrame(df['x'].values.reshape(-1,3), columns=['x','y','seq'])
print (df1)
x y seq
0 76.25 '87.12' 1
1 345.65 '96.45' 2
2 78.12 '85.23' 3
3 35.1 '65.21' 1

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100