Drop the row of dataframe if it's duplicate and if yes drop all except with minimum time - dataframe

I have two columns "key" and "time". If the key is same then compare time, keep the minimum time row and drop other with that particular key.
import pandas as pd
data = {
"key": [1,2,3,4,1,2,3,1],
"time": [12.4,12.6,12.8,12.5,12.9,12.3,12.8,12.1],
}
df = pd.DataFrame(data)
print(df)
I tried with duplicate() but it returns series. Tried many other things but didn't work.

You can use groupby over key and aggregate by minimum as in:
df.groupby(['key']).agg('min')

Related

Different behaviour between two ways of dropping duplicate values in a dataframe

I tested two ways of dropping duplicated rows in a dataframe but they didn't obtain the same result and I don't understand why.
First code:
file_df1 = open('df1.csv', 'r')
df1_list = []
for line in fila_df1:
new_line = line.rsplit(',')
df1_firstcolumn = new_line[0]
if df1_firstcolumn not in df1_list:
df1_list.append(df1_firstcolumn)
#else:
#print('firstcolumn: ' + df1_firstcolumn + ' is duplicated')
file_df1.close()
The second-way using pandas:
import pandas as pd
df1 = pd.read_csv('df1.csv', header=None, names=['firstcolumn','second','third','forth'])
df1.drop_duplicates(inplace=True)
I obtained more unique values using pandas.
The first way you post will "drop duplicates" when duplicates occur based on data in the first column only.
The pandas drop_duplicates function by default is checking that values in all four columns have been duplicated. The version below will remove duplicates based on the first column only.
df1.drop_duplicates(subset=['firstcolumn'], inplace=True)

PySpark dataframe Pandas UDF returns empty dataframe

I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.
This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.
Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID', 'Number']]
I come across this issue with null DataFrame in some groups. I solve this by checking for empty DataFrame and return a DataFrame with schema defined:
if df_out.empty:
# change the schema as needed
return pd.DataFrame({'fullVisitorId': pd.Series([], dtype='str'),
'time': pd.Series([], dtype='datetime64[ns]'),
'total_transactions': pd.Series([], dtype='int')})

add unique column to a pandas dataframe

I have a pandas dataframe with 10 columns. I would like to add a column which will uniquely identify every row. I do have to come up with the unique value(could be as simple as a running sequence). How can I do this? I tried adding index as a column itself but for some reason I get a KeyError when I do this.
add a column from range of len of you index
df['new'] = range(1, len(df.index)+1)

Upsampling datetime - ValueError: cannot reindex a non-unique index with a method or limit

I get the error below when I try to upsample...
import pandas as pd
from datetime import date
df1=pd.read_csv("C:/Codes/test.csv")
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index(['Date'])
df2 = pd.DataFrame()
df2 = df1.Gen.resample('H').ffill()
I get this error...ValueError: cannot reindex a non-unique index with a method or limit. Please advise.
My test.csv is a simple file with two columns containing these 5 records
Date|Gen
----|----
5/1/2017|Ggulf
5/2/2017|Ggulf
5/1/2017|Nelson
5/3/2017|Ggulf
5/4/2017|Nelson
An index needs to have unique values. Your first record and third record have the same date '5/1/2017' which makes it impossible to set the date column as an index column.

renaming columns after group by and sum in pandas dataframe

This is my group by command:
pdf_chart_data1 = pdf_chart_data.groupby('sell').value.agg(['sum']).rename(
columns={'sum':'valuesum','sell' : 'selltime'}
)
I am able to change the column name for value but not for 'sell'.
Please help to resolve this issue.
You cannot rename it, because it is index. You can add as_index=False for return DataFrame or add reset_index:
pdf_chart_data1=pdf_chart_data.groupby('sell', as_index=False)['value'].sum()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
Or:
pdf_chart_data1=pdf_chart_data.groupby('sell')['value'].sum()
.reset_index()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
df = df.groupby('col1')['col1'].count()
df1= df.to_frame().rename(columns={'col1':'new_name'}).reset_index()
If you join to groupby with the same index where one is nunique ->number of unique items and one is unique->list of unique items then you get two columns called Sport. Using as_index=False I was able to rename the second Sport name using rename then concat the two lists together and sort descending on sport and display the 10 five sportcounts.
grouped=df.groupby('NOC', as_index=False)
Nsport=grouped['Sport'].nunique()\
.rename(columns={'Sport':'SportCount'})
Nsport=Nsport.set_index('NOC')
country_grouped=df.groupby('NOC')
Nsport2=country_grouped['Sport'].unique()
df2=pd.concat([Nsport,Nsport2], join='inner',axis=1).reindex(Nsport.index)
df2=df2.sort_values(by=["SportCount"],ascending=False)
print(df2.columns)
for key,item in df2.head(5).iterrows():
print(key,item)