drop duplicates isnt working on my imported csv file - pandas

Looking for some help on this one. I do not know why but drop duplicates is not working, tried a loop with lambda. still nothing I can do will remove mutlple duplicates on the output.
# Import files for use in the program:
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import matplotlib.dates as mdates
import numpy as np
import csv
Import CSV files into a data frame
Crash_Data_df = pd.read_csv("crash_data.csv",encoding='UTF-8')
#split date column
Crash_Data_df[["Day", "Month", "DayNum","Time","Zone","Year"]] =
Crash_Data_df["Date"].str.split(" ", n = 6, expand = True)
#define max and min dates
d_max=Crash_Data_df["Date"].min()
d_min=Crash_Data_df["Date"].max()
#split name column
Crash_Data_df[["A","B"]] = Crash_Data_df["Name"].str.split("_|2018100", n =
2, expand = True)
#Drop time zone
Crash_Data_df.drop(['Zone'], axis = 1, inplace = True)
Crash_Data_df.reset_index(drop=True)
# group by unnamed column
Crash_Data_df = Crash_Data_df.loc[Crash_Data_df['Unnamed: 0'] == 0, :]
#del columns
del Crash_Data_df['Unnamed: 0']
del Crash_Data_df['Name']
del Crash_Data_df['A']
Crash_Data_df = Crash_Data_df.loc[Crash_Data_df['B'] != 9954815, :]
Crash_Data_df = Crash_Data_df.dropna(how='any')
Crash_Data_df.drop_duplicates(subset=['Time'], keep=False)
Crash_Data_df.sort_values(by=['B'])
Crash_Data_df.reset_index(drop=True)
Crash_Data_df = Crash_Data_df.rename(columns=
{'B':'ID','Date':'DATE','Direction':'DIRECTION','Road':'ROAD',
'Location':'LOCATION','Event':'EVENT','Day':'DAY',
'Month':'MONTH','DayNum':'DAYNUM','Time':'TIME','Year':'YEAR'})
Crash_Data_df.set_index('ID', inplace=True,drop=True)
Crash_Data_df.to_csv("crash_data_check.csv", index=False, header=True)
Crash_Data_df.drop_duplicates()
Crash_Data_df.groupby("ID").filter(lambda x: len(x) > 1)
Crash_Data_df.head()
ID duplicates are not dropped. tried different columns no deal.. the output looks like:
DATE DIRECTION ROAD LOCATION EVENT DAY MONTH DAYNUM TIME YEAR
ID
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 2018
9954815 Sun Oct 07 03:35:22 CDT 2018 WB T.H.62 # T.H.100 NB CRASH Sun Oct 07 03:35:22 201
DATE object
DIRECTION object
ROAD object
LOCATION object
EVENT object
DAY object
MONTH object
DAYNUM object
TIME object
YEAR object
dtype: object

Because the .drop_duplicates returns a copy of the df, you want to either update the df variable or do the drop with inplace=True.
Try:
Crash_Data_df = Crash_Data_df.drop_duplicates(subset=['Time'], keep=False)
Or
Crash_Data_df.drop_duplicates(subset=['Time'], keep=False, inplace=True)
Both should work.
BTW, Same applies at the other drop_duplicates calling.

Related

In R, how to mordify 'User-defined function' for date parse

In R, how to mordify below 'User-defined function' my_date_parse?
In below function,I want to replace character GMT+9 and GMT-8 in the beginner,then parse it, but failed.
library(lubridate)
library(tidyverse)
# raw data
df <- data.frame(
date=c("1 Feb 2022 11:48:42 pm GMT+9","1 Feb 2022 9:41:56 am GMT-8","Feb 1, 2022 6:19:26 a.m. PST","Feb 1, 2022 1:22:37 a.m. PST","Feb 12, 2022 7:54:32 a.m. PST","31.01.2022 23:11:54 UTC","Feb 1, 2022 12:00:47 AM PST","Feb 1, 2022 12:44:28 PM PST","Feb 1, 2022 12:20:22 AM PST"),
country=c("AU","AU","CA","CA","CA","DE","US","US","US"))
# User-defined function
my_date_parse <- function(country,date){
# date <- str_replace_all(date,'([GMT+9])',"")
# date <- str_replace_all(date,'([GMT-8])',"")
case_when(
country=='US'~ mdy_hms(date),
country=='DE'~ dmy_hms(date),
country=='CA'~ mdy_hms(date),
country=='AU'~ dmy_hms(date),
)
}
# parse the date with User-defined function
df %>% mutate(date_parsed=my_date_parse(country,date))

Pandas changing data format od a long column

I'm polishing my code. In one point I want to convert date given as string to another string that holds the same date but shows it in different format.
After each date there is code, always same code for given date.
Here is my df:
import pandas as pd
data = ['2012-06-29 A','2012-08-29 B','2012-10-29 X','2012-10-15 A']*50000
data.sort()
df = pd.DataFrame({'A':data})
A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
2012-06-29 A
And here is how I'm doing it now:
df['A'] = df['A'].apply(lambda x: pd.to_datetime(x.split(' ')[0]).strftime('%d %b %Y ') + x.split(' ')[1])
A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
29 Jun 2012 A
It works fine but it seems to create bottleneck (not really it is only part of data preparation).
Can it be done better/faster?
In total I have about 15 dates like this for 1 df (and many dfs). I wonder if creation of dict or temporary support_df from unique dates and applying those somehow (how?) through lambda to avoid multiple conversions.
Additional info (maybe useful): Column A, later on, becomes part of MultiIndex.
IIUC, my first attempt would be this method:
No need to apply on the dataframe:
(pd.to_datetime(df['A'].str.split().str[0]).dt.strftime('%d %b %Y') + ' '
+ df['A'].str.split().str[1])
Second attempt using list comprehension instead of .str accessor:
(pd.to_datetime(pd.Series([i.split()[0] for i in df.A])).dt.strftime('%d %b %Y')
+ ' ' + pd.Series([i.split()[1] for i in df.A]))
Third attempt:
ls = [i.split() for i in df.A]
i,j = zip(*ls)
pd.Series(pd.to_datetime(i).strftime('%d %b %Y')) + ' ' + pd.Series(j)

bokeh select widget datetime inconsistency (pandas)

I have a dataframe full of dates and transactions:
ENTRYDATE | TRANSACTIONS
2017-01-02 20
2017-01-16 51
..
2018-02-01 12
I have a select widget where the user can view the data by ['day,'weekly','monthly',annually]
When daily or annually is chosen, the plot accurately update and summarizes the data into daily or annual transactions. However, when weekly or monthly is selected, it seems that plot bundles the transactions from Jan 2018 and Feb 2018 into 2017 Jan and Feb data, overstating 2017 counts. Why is this happening? How can I repair it?
Here is the relevant piece of my code:
import pandas as pd
from bokeh.models import ColumnDataSource,DatetimeTickFormatter, NumeralTickFormatter, HoverTool, Select
from bokeh.plotting import figure
from bokeh.io import curdoc
df2=df[['ENTRYDATE']]
df2['ENTRYDATE']=pd.to_datetime(df2['ENTRYDATE'],infer_datetime_format=True)
#set data sources
dfdate=(df2.groupby([df2['ENTRYDATE'].dt.date]).size().reset_index(name='Transactions'))
dfweek=(df2.groupby([df2['ENTRYDATE'].dt.week]).size().reset_index(name='Transactions'))
dfmonth=(df2.groupby([df2['ENTRYDATE'].dt.month]).size().reset_index(name='Transactions'))
dfyear=(df2.groupby([df2['ENTRYDATE'].dt.year]).size().reset_index(name='Transactions'))
source1=ColumnDataSource(data=dfdate)
source2=ColumnDataSource(data=dfweek)
p=figure(plot_width=800,plot_height=500, y_axis_label="Count")
p.line(x="ENTRYDATE",y="Transactions",color='blue', source=source1)
p.xaxis.formatter=DatetimeTickFormatter()
#update function
def update_plot(attr, old, new):
if new=='Daily':
source1.data={"ENTRYDATE":dfdate["ENTRYDATE"],"Transactions":dfdate["Transactions"]}
elif new=='Weekly':
source1.data=source2.data
elif new=='Monthly':
source1.data={"ENTRYDATE":dfmonth["ENTRYDATE"],"Transactions":dfmonth["Transactions"]}
elif new=='Annually':
source1.data={"ENTRYDATE":dfyear["ENTRYDATE"],"Transactions":dfyear["Transactions"]}
#selecttool
select=Select(title='Choose Your Time Interval:', options=['Daily','Weekly','Monthly','Annually'], value='daily')
select.on_change('value',update_plot)
layout=row(select, p)
curdoc().add_root(layout)
One idea is to prefix the year to the week or month number and then sort it ascending.
df['YearWk']=df['ENTRYDATE'].dt.strftime('%Y.%W')

Pandas, Bokeh, or using any plotting library for shifting the x-axis for seasonal data (months 7 -> 12 -> 6 or July 01 - June 30)

I want to display seasonal snow data for the seasonal year from July 01 - June 30.
df = pd.DataFrame({'date1':['1954-03-20','1955-02-23','1956-01-01','1956-11-21','1958-01-07'],
'date2':['1954-03-25','1955-02-26','1956-02-11','1956-11-30','1958-01-17']},
index=['1954','1955','1956','1957','1958'])
It is an extension to my previous question Pandas: Visualizing Changes in Event Dates for Multiple Years using Bokeh or any other plotting library
Scott Boston, in his answer to my comment in that question, suggested using Range1D and modifyng the answer in How can I accomplish `set_xlim` or `set_ylim` in Bokeh?. It works for continuous scalars, but I couldn't get it to work with a discontinuous ranges like [182:366], [1:181].
Adding x_range=Range1d(182, 366) shows me the first half of the seasonal year, but I can't get the second half of the seasonal year (1, 181).
df['date2'] = pd.to_datetime(df['date2'])
df['date1'] = pd.to_datetime(df['date1'])
df=df.assign(date2_DOY=df.date2.dt.dayofyear)
df=df.assign(date1_DOY=df.date1.dt.dayofyear)
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import FuncTickFormatter, FixedTicker
p1 = figure(plot_width=1000, plot_height=300,x_range=Range1d(180, 366))
p1.circle(df.date1_DOY,df.index, color='red', legend='Date1')
p1.circle(df.date2_DOY,df.index, color='green', legend='Date2')
p1.xaxis[0].ticker=FixedTicker(ticks=[1,32,60,91,121,152,182,213,244,274,305,335,366])
p1.xaxis.formatter = FuncTickFormatter(code="""
var labels = {'1':'Jan',32:'Feb',60:'Mar',91:'Apr',121:'May',152:'Jun',182:'Jul',213:'Aug',244:'Sep',274:'Oct',305:'Nov',335:'Dec',366:'Jan'}
return labels[tick];
""")
show(p1)
#(Code from Scott's answer to my previous question.)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'date1':['1954-03-20','1955-02-23','1956-01-01','1956-11-21','1958-01-07'],
'date2':['1954-03-25','1955-02-26','1956-02-11','1956-11-30','1958-01-17']},
index=['1954','1955','1956','1957','1958'])
df['date2'] = pd.to_datetime(df['date2'])
df['date1'] = pd.to_datetime(df['date1'])
Adjusted mapping day of year to axis starting with Jun 1st.
df['date2_DOY_map'] = np.where(df['date2'].dt.dayofyear<151,df['date2'].dt.dayofyear-151+365,df['date2'].dt.dayofyear-151)
df['date1_DOY_map'] = np.where(df['date1'].dt.dayofyear<151,df['date1'].dt.dayofyear-151+365,df['date1'].dt.dayofyear-151)
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
Add Range1d import form bokeh.models
from bokeh.models import FuncTickFormatter, FixedTicker,Range1d
p1 = figure(plot_width=1000, plot_height=300,x_range=Range1d(1, 366))
p1.circle(df.date1_DOY_map,df.index, color='red', legend='Date1')
p1.circle(df.date2_DOY_map,df.index, color='green', legend='Date2')
Fixed x-ticks and labels to match Jun 1st start
p1.xaxis[0].ticker=FixedTicker(ticks=[1,31,62,93,123,154,184,215,246,274,305,335,366])
p1.xaxis.formatter = FuncTickFormatter(code="""
var labels = {'1':'Jun',31:'Jul',62:'Aug',93:'Sep',123:'Oct',154:'Nov',184:'Dec',215:'Jan',246:'Feb',274:'Mar',305:'Apr',335:'May',366:'Jun'}
return labels[tick];
""")
show(p1)
EDIT (oops I started on the wrong month above)
Pretty easy to fix, we just need to modify the x-axis mapping of dates and redo the ticks and lables.
Use 181 vs 151 because Jul 1st is the 181st day where June 1st was the 151st day.
df['date2_DOY_map'] = np.where(df['date2'].dt.dayofyear<181,df['date2'].dt.dayofyear-181+365,df['date2'].dt.dayofyear-181)
df['date1_DOY_map'] = np.where(df['date1'].dt.dayofyear<181,df['date1'].dt.dayofyear-181+365,df['date1'].dt.dayofyear-181)
p1.xaxis[0].ticker=FixedTicker(ticks=[1,32,63,93,124,154,185,216,244,275,305,336,366])
p1.xaxis.formatter = FuncTickFormatter(code="""
var labels = {'1':'Jul',32:'Aug',63:'Sep',93:'Oct',124:'Nov',154:'Dec',185:'Jan',216:'Feb',244:'Mar',275:'Apr',305:'May',336:'Jun',366:'Jul'}
return labels[tick];
""")

How to go from relative dates to absolute dates in DataFrame columns

I have a pandas DataFrame containing forward prices for future maturities, quoted on multiple different trading months ('trade date'). Trade dates are given in absolute terms ('January'). The maturities are given in relative terms ('M+1').
How can I convert the maturities into an absolute format, i.e. in trade date 'January' the maturity 'M+1' should say 'February'.
Here is example data:
import pandas as pd
import numpy as np
data_keys = ['trade date', 'm+1', 'm+2', 'm+3']
data = {'trade date':['jan','feb','mar','apr'],
'm+1':np.random.randn(4),
'm+2':np.random.randn(4),
'm+3':np.random.randn(4)}
df = pd.DataFrame(data)
df = df[data_keys]
Starting data:
trade date m+1 m+2 m+3
0 jan -0.446535 -1.012870 -0.839881
1 feb 0.013255 0.265500 1.130098
2 mar 0.406562 -1.122270 -1.851551
3 apr -0.890004 0.752648 0.778100
Result:
Should have Feb, Mar, Apr, May, Jun, Jul in the columns. NaN will be shown in many instances.
The starting DataFrame:
trade date m+1 m+2 m+3
0 jan -1.350746 0.948835 0.579352
1 feb 0.011813 2.020158 -1.221110
2 mar -0.183187 -0.303099 1.323092
3 apr 0.081105 0.662628 -0.703152
Solution:
Define a list of all possible absolute dates you will encounter, in
chronological order. Do the same for relative dates.
Create a function to act on groups coming from df.groupby. The
function will convert the column names of each group appropriately to
an absolute format.
Apply the function.
Pandas handles the clever concatenation of all groups.
Code:
abs_in_order = ['jan','feb','mar','apr','may','jun','jul','aug']
rel_in_order = ['m+0','m+1','m+2','m+3','m+4']
def rel2abs(group, abs_in_order, rel_in_order):
abs_date = group['trade date'].unique()[0]
l = len(rel_in_order)
i = abs_in_order.index(abs_date)
namesmap = dict(zip(rel_in_order, abs_in_order[i:i+l]))
group.rename(columns=namesmap, inplace=True)
return group
grouped = df.groupby(['trade date'])
df = grouped.apply(rel2abs, abs_in_order, rel_in_order)
Pandas may mess up the column order. Do this to get back to something in chronological order:
order = ['trade date'] + abs_in_order
cols = [e for e in order if e in df.columns]
df[cols]
Result:
trade date feb mar apr may jun jul
0 jan -1.350746 0.948835 0.579352 NaN NaN NaN
1 feb NaN 0.011813 2.020158 -1.221110 NaN NaN
2 mar NaN NaN -0.183187 -0.303099 1.323092 NaN
3 apr NaN NaN NaN 0.081105 0.662628 -0.703152
You question doesn't contain enough information to answer it.
You say that the prices are quoted on dates given in absolute terms ('January').
January is not a date, but 2-Jan-2015 is.
What is your actual 'date' and what is its format (i.e. text, datetime.date, pd.Timestamp, etc.). You can use type(date) to check where date is an instance of whatever quote date it represents.
The easiest solution is to get your trade dates into pd.Timestamps and then add an offset:
trade_date = pd.Timestamp('2015-1-15')
>>> trade_date + pd.DateOffset(months=1)
Timestamp('2015-02-15 00:00:00')