Pandas add a static date to a series - pandas

I don't know how to add a static date to a a time series:
The serie in string look like this:
time_s=pd.Series(['000329','000458','154259','232810'])
I convert it in time serie:
time_s=pd.to_datetime(time_s,format='%H%M%S')
But the date is contained in the name of the file :
date_file=datetime.datetime(year=year, month=month, day=day)
The simple "way" doesn't work:
date_file+time_s
I tried to create a series with the static date and add both:
serie_date=[pd.to_datetime(date_file) for x in range(len(time_s)) ]
pd.Series(serie_date)+time_s
Someone can help me please? Thx

You can join strings from datetime first and then generate datetimes:
out = pd.to_datetime(date_file.strftime('%Y-%m-%d') + time_s,format='%Y-%m-%d%H%M%S')

Related

Dataframe String Separation

enter image description here
How do I extract each column as one column instead of (year, month, day) format???
Please refer to the photo
def temp(i):
i = str(i)
year = i[0:4]
moth = i[4:6]
day = i[6:8]
return year,moth,day
profile_drop["year","moth","day"] = profile_drop["became_member_on"].apply(temp)
Although it isn't directly your question, the easiest way to extract the date is convert it to datetime and then use pandas bulit-in operation:
profile_drop["became_member_on_date"] = pd.to_datetime(profile_drop["became_member_on"], format='%Y%m%d')
profile_drop['year'] = profile_drop["became_member_on_date"].dt.year
profile_drop['month'] = profile_drop["became_member_on_date"].dt.month
profile_drop['day'] = profile_drop["became_member_on_date"].dt.day
In this snippet I first converted the string to a full datetime using pd.to_datetime (and explicitly mentioned the format how to parse) and then extract each relevant year/month/day just by calling to .year over the date column.
It is also a way to avoid .apply which is not recommended to use unless you have to
A classic XY Question.

Pandas str split. Can I skip line which gives troubles?

I have a dataframe (all5) including one column with dates('CREATIE_DATUM'). Sometimes the notation is 01/JAN/2015 sometimes it's written as 01-JAN-15.
I only need the year, so I wrote the following code line:
all5[['Day','Month','Year']]=all5['CREATIE_DATUM'].str.split('-/',expand=True)
but I get the following error:
columns must be same length as key
so I assume somewhere in my dataframe (>100.000 lines) a value has more than two '/' signs.
How can I make my code skip this line?
You can try to use pd.to_datetime and then use .dt property to access day, month and year:
x = pd.to_datetime(all5["CREATIE_DATUM"])
all5["Day"] = x.dt.day
all5["Month"] = x.dt.month
all5["Year"] = x.dt.year

Dataframe Row(sum(fld)) to a discrete value

I have this:
df = sqlContext.sql(qry)
df2 = df.withColumn("ext", df.lvl * df.cnt)
ttl = df2.agg(F.sum("ext")).collect()
which returns this:
[Row(sum(ext)=1285430)]
How do devolve this down to just the discreet value 1285430 without it being a list Row(sum())?
I've researched and tried so many things I'm totally stymed.
No need for collect:
n = ...your transformation logic and agg... .first().getInt(0)
Access the first row and then get the first element as int.
df2.agg(F.sum("ext")).collect()(0).getInt(0)
Take a look at the documentation: Spark ScalaDoc.
Also can df.collect()[0][0] -or- df.collect()[0]['sum(ext)']

Change column format from 'factor' to 'date'

I would like to change the format of a column called "SampleStart" in my dataframe "xray50g".
Checking the data in this column shows it is currently in the "factor" format -
> lapply(xray50g,class)
$SampleStart
[1] "factor"
I would like to change the data format to "Date" in the form "%d/%m/%Y"
Can anyone help with this?
Thanks.
For some reason this shortens year to last two digits, but it's a start, and the class is at least correct.
library(chron)
x <- as.factor(c("2016-10-26", "2016-10-26", "2016-10-26", "2016-10-26"))
z <-format(as.Date(x), "%d/%m/%Y")
date <-chron(z, format = c(dates = "d/m/Y"))

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)