Using Pyspark to convert column from string to timestamp - apache-spark-sql

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()

You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

Related

Text not read when using pd.read_csv() on a Google Sheet

I am trying to read a Google Sheet using pandas pd.read_csv(), however when the columns contain cells with text and other cells with numeric values, the text is not read. My code is:
def build_sheet_url(doc_id, sheet_id):
return r"https://docs.google.com/spreadsheets/d/{}/gviz/tq?tqx=out:csv&sheet={}".format(doc_id, sheet_id)
sheet_url = build_sheet_url(doc_id, sheet_name)
df = pd.read_csv(sheet_url)
> df
Column1 Column2
0 12 21
1 13 22
2 14 23
3 15 24
This is what the spreadsheet looks like:
I have tried using dtype=str and dtype=object but could not get the text to show in my dataframe. Specifying the encoding encoding='utf-8' did not work either.
This is because query doesn't support mixed data types:
Data type. Supported data types are string, number, boolean, date, datetime and timeofday. All values of a column will have a data type that matches the column type, or a null value. These types are similar, but not identical, to the JavaScript types.
Use the /export end point(or drive-api endpoint instead):
https://docs.google.com/spreadsheets/d/[SPREADSHEET_ID]/export?format=[FORMAT]&gid=(SHEET_ID)&range=(A1NOTATION)
Related:
Google sheet to pandas via shared link without credentials in python
Query is ignoring string (non numeric) value

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

Change column format from 'factor' to 'date'

I would like to change the format of a column called "SampleStart" in my dataframe "xray50g".
Checking the data in this column shows it is currently in the "factor" format -
> lapply(xray50g,class)
$SampleStart
[1] "factor"
I would like to change the data format to "Date" in the form "%d/%m/%Y"
Can anyone help with this?
Thanks.
For some reason this shortens year to last two digits, but it's a start, and the class is at least correct.
library(chron)
x <- as.factor(c("2016-10-26", "2016-10-26", "2016-10-26", "2016-10-26"))
z <-format(as.Date(x), "%d/%m/%Y")
date <-chron(z, format = c(dates = "d/m/Y"))

Python: Remove exponential in Strings

I have been trying to remove the exponential in a string for the longest time to no avail.
The column involves strings with alphabets in it and also long numbers of more than 24 digits. I tried converting the column to string with .astype(str) but it just reads the line as "1.234123E+23". An example of the table is
A
345223423dd234324
1.234123E+23
how do i get the table to show the full string of digits in pandas?
b = "1.234123E+23"
str(int(float(b)))
output is '123412299999999992791040'
no idea how to do it in pandas with mixed data type in column

How to change pandas Period Index to lower case? [duplicate]

This question already has an answer here:
Convert pandas._period.Period type Column names to Lowercase
(1 answer)
Closed 4 years ago.
I have a dataframe where I used
df.groupby(pd.PeriodIndex(df.columns, freq='Q'), axis=1).mean() to combine all column names from month into quarter by taking the mean.
However, the result dataframe has columns like below and I could not change all upper case Q into lower case 'q'.
PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2',
'2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4',
'2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2',
'2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4',
'2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2',
'2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4',
'2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2',
'2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4',
'2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2',
'2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4',
'2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2',
'2016Q3'],
dtype='period[Q-DEC]', freq='Q-DEC')
I have tried using df.columns=[x.lower() for x in df.columns] and it gives an
error:'Period' object has no attribute 'lower'
This looks like a duplicate of the issue posted here: Convert pandas._period.Period type Column names to Lowercase
Basically, you'll want to reformat the Period output to have a lowercase q like so:
df.columns = df.columns.strftime('%Yq%q')
Alternatively, if you want to modify your PeriodIndex object directly, you can do something like:
# get the PeriodIndex object you pasted in your question
periods = df.groupby(pd.PeriodIndex(df.columns, freq='Q'), axis=1).mean()
# format the entries accordingly
periods = [p.strftime('%Yq%q') for p in periods]
The %Y denotes the year format, the first q is the lowercase "q" you want, and the %q is the quartile.
Here is the documentation for a Period's strftime() method, which returns the formatted time string. At the bottom they have some nice examples!
Looking at the methods listed in the Pandas documentation, lower() isn't an available method for the Period object, which is why you're getting this error (a PeriodIndex is just an array of Periods, which denote a chunk of time).