Change column format from 'factor' to 'date' - ggplot2

I would like to change the format of a column called "SampleStart" in my dataframe "xray50g".
Checking the data in this column shows it is currently in the "factor" format -
> lapply(xray50g,class)
$SampleStart
[1] "factor"
I would like to change the data format to "Date" in the form "%d/%m/%Y"
Can anyone help with this?
Thanks.

For some reason this shortens year to last two digits, but it's a start, and the class is at least correct.
library(chron)
x <- as.factor(c("2016-10-26", "2016-10-26", "2016-10-26", "2016-10-26"))
z <-format(as.Date(x), "%d/%m/%Y")
date <-chron(z, format = c(dates = "d/m/Y"))

Related

Dataframe String Separation

enter image description here
How do I extract each column as one column instead of (year, month, day) format???
Please refer to the photo
def temp(i):
i = str(i)
year = i[0:4]
moth = i[4:6]
day = i[6:8]
return year,moth,day
profile_drop["year","moth","day"] = profile_drop["became_member_on"].apply(temp)
Although it isn't directly your question, the easiest way to extract the date is convert it to datetime and then use pandas bulit-in operation:
profile_drop["became_member_on_date"] = pd.to_datetime(profile_drop["became_member_on"], format='%Y%m%d')
profile_drop['year'] = profile_drop["became_member_on_date"].dt.year
profile_drop['month'] = profile_drop["became_member_on_date"].dt.month
profile_drop['day'] = profile_drop["became_member_on_date"].dt.day
In this snippet I first converted the string to a full datetime using pd.to_datetime (and explicitly mentioned the format how to parse) and then extract each relevant year/month/day just by calling to .year over the date column.
It is also a way to avoid .apply which is not recommended to use unless you have to
A classic XY Question.

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

pandas reading csv date as a string that's a 5 digit number

I have a date in a .csv with format YYYY-MM-DD. Pandas is reading it in as a string but instead of the format shown in the csv, it reads in as a 5 digit number, coded as a string.
I've tried:
pd.to_datetime(df['alert_date'], unit = 's')
pd.to_datetime(df['alert_date'], unit = 'D')
I've also tried calling out to read it as a string and let date parser take over. See below:
dtype_dict = {'alert_date':'str','lossdate1':'str', 'lossdate2':'str',
'lossdate3':'str', 'lossdate4':'str', 'lossdate5':'str',
'effdate':'str'}
parse_dates = ['lossdate1', 'lossdate2', 'lossdate3',
'lossdate4', 'lossdate5', 'effdate']
df = pd.read_csv("Agent Alerts Earned and Incurred with Loss Dates as of Q3 2021.csv",
encoding='latin1', dtype = dtype_dict, parse_dates=parse_dates)
I'm not sure what else to try or what is wrong with it to begin with.
Here is an example of what the data looks like.
alertflag,alert_type,alert_date,effdate,cal_year,totalep,eufactor,product,NonCatincrd1,Catincrd1,lossdate1,NonCatcvrcnt1,Catcvrcnt1,NonCatincrd2,Catincrd2,lossdate2,NonCatcvrcnt2,Catcvrcnt2,NonCatincrd3,Catincrd3,lossdate3,NonCatcvrcnt3,Catcvrcnt3,NonCatincrd4,Catincrd4,lossdate4,NonCatcvrcnt4,Catcvrcnt4,NonCatincrd5,Catincrd5,lossdate5,NonCatcvrcnt5,Catcvrcnt5,incurred
1,CANCEL NOTICE,2019-06-06,2018-12-17,2019,91.00,0.96,444,,,,,,,,,,,,,,,,,,,,,,,,,,
The alert_date comes through on that record as 21706.

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

How to change pandas Period Index to lower case? [duplicate]

This question already has an answer here:
Convert pandas._period.Period type Column names to Lowercase
(1 answer)
Closed 4 years ago.
I have a dataframe where I used
df.groupby(pd.PeriodIndex(df.columns, freq='Q'), axis=1).mean() to combine all column names from month into quarter by taking the mean.
However, the result dataframe has columns like below and I could not change all upper case Q into lower case 'q'.
PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2',
'2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4',
'2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2',
'2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4',
'2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2',
'2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4',
'2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2',
'2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4',
'2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2',
'2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4',
'2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2',
'2016Q3'],
dtype='period[Q-DEC]', freq='Q-DEC')
I have tried using df.columns=[x.lower() for x in df.columns] and it gives an
error:'Period' object has no attribute 'lower'
This looks like a duplicate of the issue posted here: Convert pandas._period.Period type Column names to Lowercase
Basically, you'll want to reformat the Period output to have a lowercase q like so:
df.columns = df.columns.strftime('%Yq%q')
Alternatively, if you want to modify your PeriodIndex object directly, you can do something like:
# get the PeriodIndex object you pasted in your question
periods = df.groupby(pd.PeriodIndex(df.columns, freq='Q'), axis=1).mean()
# format the entries accordingly
periods = [p.strftime('%Yq%q') for p in periods]
The %Y denotes the year format, the first q is the lowercase "q" you want, and the %q is the quartile.
Here is the documentation for a Period's strftime() method, which returns the formatted time string. At the bottom they have some nice examples!
Looking at the methods listed in the Pandas documentation, lower() isn't an available method for the Period object, which is why you're getting this error (a PeriodIndex is just an array of Periods, which denote a chunk of time).