pandas reading csv date as a string that's a 5 digit number - pandas

I have a date in a .csv with format YYYY-MM-DD. Pandas is reading it in as a string but instead of the format shown in the csv, it reads in as a 5 digit number, coded as a string.
I've tried:
pd.to_datetime(df['alert_date'], unit = 's')
pd.to_datetime(df['alert_date'], unit = 'D')
I've also tried calling out to read it as a string and let date parser take over. See below:
dtype_dict = {'alert_date':'str','lossdate1':'str', 'lossdate2':'str',
'lossdate3':'str', 'lossdate4':'str', 'lossdate5':'str',
'effdate':'str'}
parse_dates = ['lossdate1', 'lossdate2', 'lossdate3',
'lossdate4', 'lossdate5', 'effdate']
df = pd.read_csv("Agent Alerts Earned and Incurred with Loss Dates as of Q3 2021.csv",
encoding='latin1', dtype = dtype_dict, parse_dates=parse_dates)
I'm not sure what else to try or what is wrong with it to begin with.
Here is an example of what the data looks like.
alertflag,alert_type,alert_date,effdate,cal_year,totalep,eufactor,product,NonCatincrd1,Catincrd1,lossdate1,NonCatcvrcnt1,Catcvrcnt1,NonCatincrd2,Catincrd2,lossdate2,NonCatcvrcnt2,Catcvrcnt2,NonCatincrd3,Catincrd3,lossdate3,NonCatcvrcnt3,Catcvrcnt3,NonCatincrd4,Catincrd4,lossdate4,NonCatcvrcnt4,Catcvrcnt4,NonCatincrd5,Catincrd5,lossdate5,NonCatcvrcnt5,Catcvrcnt5,incurred
1,CANCEL NOTICE,2019-06-06,2018-12-17,2019,91.00,0.96,444,,,,,,,,,,,,,,,,,,,,,,,,,,
The alert_date comes through on that record as 21706.

Related

Dataframe String Separation

enter image description here
How do I extract each column as one column instead of (year, month, day) format???
Please refer to the photo
def temp(i):
i = str(i)
year = i[0:4]
moth = i[4:6]
day = i[6:8]
return year,moth,day
profile_drop["year","moth","day"] = profile_drop["became_member_on"].apply(temp)
Although it isn't directly your question, the easiest way to extract the date is convert it to datetime and then use pandas bulit-in operation:
profile_drop["became_member_on_date"] = pd.to_datetime(profile_drop["became_member_on"], format='%Y%m%d')
profile_drop['year'] = profile_drop["became_member_on_date"].dt.year
profile_drop['month'] = profile_drop["became_member_on_date"].dt.month
profile_drop['day'] = profile_drop["became_member_on_date"].dt.day
In this snippet I first converted the string to a full datetime using pd.to_datetime (and explicitly mentioned the format how to parse) and then extract each relevant year/month/day just by calling to .year over the date column.
It is also a way to avoid .apply which is not recommended to use unless you have to
A classic XY Question.

combining CSV files from Covid-data

I want to combine the CSV files from the Johns Hopkins Covid Data (e.g. https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-10-2020.csv & https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv).
I already managed to load the files into a DataFrame as well as sanitizing the header (_ vs. / in some names). Now I want to pick one column (e.g. Confirmed), rename it to the day of the file and then combine those CSV files to get a progress over time.
This merge needs to be done by state_province. In both frames, the key may not be present. How can I do this? I experimented with rightjoin and outerjoin, but didn't have any success. Can someone point me the right way please?
I initially didn't want to share the code that I have so far because I didn't want to guide to a specific solution - but here it is. It is copied together from several Jupyter cells.
using Dates
start = Dates.Date(2020,1,22) #begin of recording
now = Dates.Date(Dates.now())- Dates.Day(1) #today
date_range = collect(start:Dates.Day(1):now) #create a date range with 1 element per day
prefix = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
suffix = ".csv"
function create_url(date)
return prefix * Dates.format(date, "mm-dd-YYYY") * suffix
end
function cleanup_column_names(name)
if name == "Country/Region" || name == "Country_Region"
return "country"
elseif name == "Province/State" || name == "Province_State"
return "state"
else
return name
end
end
using CSV
using HTTP
using DataFrames
selected_data = "Confirmed"
date = date_range[1]
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
Regards
Tobias
I am relatively new to Julia, so take my answer with a bit of scepticism:
First, we wrap the DataFrame creation into a function:
function prepare_date_df(date)
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
return data
end
Let's create our first Dataframe:
df = prepare_date_df(date_range[1])
Now, let's iterate over all the other dates, create a dataframe for each date and merge this with our first dataframe:
for date in date_range[2:end]
df_new = prepare_date_df(date)
df = outerjoin(df, df_new, on = [:state, :country])
end
This works fine for the first two months, but with the growing Dataframes, it suddenly gets very slow (and even hangs?). So I would be very interested in a more performative answer!

The as.Date() Function does not work, my characters remain characters

I have an Excel File in which there is a column containing the date and hour of a regarding measurment in the format 01.01.2018 01:00.
The first 3 rows contain characters, the whole column is formatted as "Number" (in Excel/libre)
If I try to read the xlxs file with readxl:
NO2_2018 <- read_excel("NO2_2018.xlsx", sheet = "Seite 1",
range = "A2:AU8762", col_types = c("date",
"numeric", ....)
I get NA Values (format is POSIXct) and the warning
Expecting date in .... / .....: got '03.01.2018 02:00'
Then I thought I read it as "txt" and then convert it with as.Date() function:
as.Date(NO2_2018$Zeitpunkt,format = "%d.%m.%Y% H:%M", tz="CEST")
However, it does not change the class
class(NO2_2018$Zeitpunkt)
[1] "character"
Have you tried to change the dot in the date and then use the as.date in your transformed variable?
(gsub(".", "/", date)

Change column format from 'factor' to 'date'

I would like to change the format of a column called "SampleStart" in my dataframe "xray50g".
Checking the data in this column shows it is currently in the "factor" format -
> lapply(xray50g,class)
$SampleStart
[1] "factor"
I would like to change the data format to "Date" in the form "%d/%m/%Y"
Can anyone help with this?
Thanks.
For some reason this shortens year to last two digits, but it's a start, and the class is at least correct.
library(chron)
x <- as.factor(c("2016-10-26", "2016-10-26", "2016-10-26", "2016-10-26"))
z <-format(as.Date(x), "%d/%m/%Y")
date <-chron(z, format = c(dates = "d/m/Y"))

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)