The as.Date() Function does not work, my characters remain characters - readxl

I have an Excel File in which there is a column containing the date and hour of a regarding measurment in the format 01.01.2018 01:00.
The first 3 rows contain characters, the whole column is formatted as "Number" (in Excel/libre)
If I try to read the xlxs file with readxl:
NO2_2018 <- read_excel("NO2_2018.xlsx", sheet = "Seite 1",
range = "A2:AU8762", col_types = c("date",
"numeric", ....)
I get NA Values (format is POSIXct) and the warning
Expecting date in .... / .....: got '03.01.2018 02:00'
Then I thought I read it as "txt" and then convert it with as.Date() function:
as.Date(NO2_2018$Zeitpunkt,format = "%d.%m.%Y% H:%M", tz="CEST")
However, it does not change the class
class(NO2_2018$Zeitpunkt)
[1] "character"

Have you tried to change the dot in the date and then use the as.date in your transformed variable?
(gsub(".", "/", date)

Related

pandas reading csv date as a string that's a 5 digit number

I have a date in a .csv with format YYYY-MM-DD. Pandas is reading it in as a string but instead of the format shown in the csv, it reads in as a 5 digit number, coded as a string.
I've tried:
pd.to_datetime(df['alert_date'], unit = 's')
pd.to_datetime(df['alert_date'], unit = 'D')
I've also tried calling out to read it as a string and let date parser take over. See below:
dtype_dict = {'alert_date':'str','lossdate1':'str', 'lossdate2':'str',
'lossdate3':'str', 'lossdate4':'str', 'lossdate5':'str',
'effdate':'str'}
parse_dates = ['lossdate1', 'lossdate2', 'lossdate3',
'lossdate4', 'lossdate5', 'effdate']
df = pd.read_csv("Agent Alerts Earned and Incurred with Loss Dates as of Q3 2021.csv",
encoding='latin1', dtype = dtype_dict, parse_dates=parse_dates)
I'm not sure what else to try or what is wrong with it to begin with.
Here is an example of what the data looks like.
alertflag,alert_type,alert_date,effdate,cal_year,totalep,eufactor,product,NonCatincrd1,Catincrd1,lossdate1,NonCatcvrcnt1,Catcvrcnt1,NonCatincrd2,Catincrd2,lossdate2,NonCatcvrcnt2,Catcvrcnt2,NonCatincrd3,Catincrd3,lossdate3,NonCatcvrcnt3,Catcvrcnt3,NonCatincrd4,Catincrd4,lossdate4,NonCatcvrcnt4,Catcvrcnt4,NonCatincrd5,Catincrd5,lossdate5,NonCatcvrcnt5,Catcvrcnt5,incurred
1,CANCEL NOTICE,2019-06-06,2018-12-17,2019,91.00,0.96,444,,,,,,,,,,,,,,,,,,,,,,,,,,
The alert_date comes through on that record as 21706.

Change column format from 'factor' to 'date'

I would like to change the format of a column called "SampleStart" in my dataframe "xray50g".
Checking the data in this column shows it is currently in the "factor" format -
> lapply(xray50g,class)
$SampleStart
[1] "factor"
I would like to change the data format to "Date" in the form "%d/%m/%Y"
Can anyone help with this?
Thanks.
For some reason this shortens year to last two digits, but it's a start, and the class is at least correct.
library(chron)
x <- as.factor(c("2016-10-26", "2016-10-26", "2016-10-26", "2016-10-26"))
z <-format(as.Date(x), "%d/%m/%Y")
date <-chron(z, format = c(dates = "d/m/Y"))

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

Exporting amounts using space as '000 delimiter

I would like all amounts exported to Excel to use space as '000 delimiter and ',' for decimal. E.g: "3 257 132,54" (common number format in Europe)
I tried to adapt the example provided on xlsxwriter.readthedocs.io :
format1 = workbook.add_format({'num_format': '#,##0.00'})
As follows
format1 = workbook.add_format({'num_format': '# ##0,00'})
I am using the code from the xlsxwriter doc. I just modified the '000 delimiter and the decimal point:
# Add some cell formats.
format1 = workbook.add_format({'num_format': '#,##0.00'})
# Set the column width and format.
worksheet.set_column('B:B', 18, format1)
I obtain a very surprising result. The example provided above will appear,in Excel, as: 3257 132,54.
Almost good, but the '000 separator is only used once for thousands, but not for millions or billions. (nb: the comma as decimal separator works fine)
Is there a trick I missed?
You just need to use whatever number format that you would use in Excel for this. Probably something like ### ### ###.00 (although it doesn't have a comma for a decimal):
import xlsxwriter
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
format1 = workbook.add_format({'num_format': '### ### ###.00'})
worksheet.set_column('B:B', 18, format1)
worksheet.write(0, 1, 123.123)
worksheet.write(1, 1, 1234.123)
worksheet.write(2, 1, 12345.123)
worksheet.write(3, 1, 123456.123)
worksheet.write(4, 1, 1234567.123)
worksheet.write(5, 1, 12345678.123)
workbook.close()
Output:
You can find the exact number format you need by setting it in Excel and then checking what it is in the custom section of the number format dialog.

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']