Set a pandas column to a timezone only datetime object - pandas

I'm trying to refactor a piece of code which is very slow. It takes the timezoneId of every row and applies the pytz.timezone() method to transform that into a datetime object, with only the timezone info.
for example, the Dataframe looks like this:
df = pd.DataFrame([['America/Sao_Paulo'], ['US/Eastern'], ['Europe/Moscow']], index =['ID1', 'ID2', 'ID3'])
I need these strings to be converted to datetime objects.
If I try to use .apply(pytz.timezone) I get the following error:
AttributeError: 'Series' object has no attribute 'upper'
And I cannot use the .to_datetime Pandas method with only timezone information.
How would I got about creating a datetime object with only timezone information?
EDIT:
Here's what the code I'm rewriting looks like:
try:
tz_id = data[-1]['timezone']['timeZoneId']
self.timezone = pytz.timezone(tz_id)
except:
self.timezone = pytz.timezone("US/Eastern")
return self.timezone
It only takes one ID per time, which is why it currently works and is so slow.

Related

wrong result shown in converting float to integer in pd

hi i want to ask why i convert to int but the result still remain as float64
As shown in the pandas docs, the method returns a new DataFrame. This means that you want to store the result in the variable df_cleaned to override the previous value:
df_cleaned = df_cleaned.convert_dtypes(convert_integer=True)

Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp

I have a feature which let's me query a databricks delta table from a client app. This is the code I use for that purpose:
df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)
However, the second line throws me the error
Casting from timestamp[us, tz=Etc/UTC] to timestamp[ns] would result in out of bounds timestamp
I know what this error says, my date-type field is out of bounds and I tried searching for the solution but none of them were eligible for my scenario.
The solutions I found were about a specific dataframe column but in my case I have a global problem because I have tons of delta tables and I don't know the specific date-typed column so I can do type manipulation in order to avoid this.
Is it possible to find all Timestamp type columns and cast them to string? Does this seem like a good solution? Do you have any other ideas on how can I achieve what I'm trying to do?
Is it possible to find all Timestamp type columns and cast them to
string?
Yes, that's the way to go. You can loop through df.dtype and handle columns having type = "timestamp" by casting them into strings before calling df.toPandas():
import pyspark.sql.functions as F
df = df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
for c, t in df.dtypes
])
dataframe = df.toPandas()
You can define this as a function that take df as parameter and use it with all your tables:
def stringify_timestamps(df: DataFrame) -> DataFrame:
return df.select(*[
F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
for c, t in df.dtypes
])
If you want to preserve the timestamp type, you can consider nullifying the timestamp values which are greater than pd.Timestamp.max as shown in this post instead of converting into strings.

Converting Date Time index in Pandas [duplicate]

My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to Pandas dtype 'object'.
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: datetime64[ns].
Now I want to convert this date format to 01/26/2016 or any other general date format. How do I do it?
(Whatever the method I try, it always shows the date in 2016-01-26 format.)
You can use dt.strftime if you need to convert datetime to other formats (but note that then dtype of column will be object (string)):
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016', 1: '26/1/2016'}})
print (df)
DOB
0 26/1/2016
1 26/1/2016
df['DOB'] = pd.to_datetime(df.DOB)
print (df)
DOB
0 2016-01-26
1 2016-01-26
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print (df)
DOB DOB1
0 2016-01-26 01/26/2016
1 2016-01-26 01/26/2016
Changing the format but not changing the type:
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y-%m'))
There is a difference between
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don't change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of this answer.
I will suppose that your column DOB already has the datetime64 type (you have shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
Styling it as mm/dd/yyyy:
df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
DOB
0 07/03/2019
1 08/03/2019
2 09/03/2019
3 10/03/2019
Styling it as dd-mm-yyyy:
df.style.format({"DOB": lambda t: t.strftime("%d-%m-%Y")})
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don't assign it back to df:
Don't do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don't do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q: Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don't need the Jupyter notebook for styling (i.e., for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g., for publishing your formatted dataframe on the Web, or simply present your table in the HTML format):
df_styler = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")})
HTML_string = df_styler.render()
Compared to the first answer, I will recommend to use dt.strftime() first, and then pd.to_datetime(). In this way, it will still result in the datetime data type.
For example,
import pandas as pd
df = pd.DataFrame({'DOB': {0: '26/1/2016 ', 1: '26/1/2016 '})
print(df.dtypes)
df['DOB1'] = df['DOB'].dt.strftime('%m/%d/%Y')
print(df.dtypes)
df['DOB1'] = pd.to_datetime(df['DOB1'])
print(df.dtypes)
The below code worked for me instead of the previous one:
df['DOB']=pd.to_datetime(df['DOB'].astype(str), format='%m/%d/%Y')
You can try this. It'll convert the date format to DD-MM-YYYY:
df['DOB'] = pd.to_datetime(df['DOB'], dayfirst = True)
The below code changes to the 'datetime' type and also formats in the given format string.
df['DOB'] = pd.to_datetime(df['DOB'].dt.strftime('%m/%d/%Y'))
Below is the code that worked for me. And we need to be very careful for format. The below link will be definitely useful for knowing your exiting format and changing into the desired format (follow the strftime() and strptime() format codes in strftime() and strptime() Behavior):
data['date_new_format'] = pd.to_datetime(data['date_to_be_changed'] , format='%b-%y')

pandas to_datetime leaves unconverted data

I'm trying to convert a column with strings that looks like "201905011" (year/month/day) to datetime, ideally showing as 05-01-2019 (month/day/year). I'm currently trying to following but it's not working for me.
pd.to_datetime(data.datetime, format = '%Y%m%d%H')
This leaves me with the error: "ValueError: unconverted data remains: 4"
I would like to know instead how to correctly do this.
I created an example based on ALollz comment. I have created a dataframe in which first row is correct and second row has and extra 0 in the end. If you use the method, it will return the rows in which the data doesn't match the specified format.
import pandas as pd
df = pd.DataFrame({"datefield":["201901010","20190101010"]})
df.loc[pd.to_datetime(df.datefield, format='%Y%m%d%H', errors='coerce').isnull(), 'datefield']
1 20190101010
Name: datefield, dtype: object

Convert df column to a tuple

I am having trouble converting a df column into a tuple that I can iterate through. I started with a simple code that works like this:
set= 'pare-10040137', 'pare-10034330', 'pare-00022936', 'pare-10025987', 'pare-10036617'
for i in set:
ref_data=req_data[req_data['REQ_NUM']==i]
This works fine, but now I want my set to come from a df. The df looks like this:
open_reqs
Out[233]:
REQ_NUM
4825 pare-00023728
4826 pare-00023773
.... ..............
I want all of those REQ_NUM values thrown into a tuple, so I tried to do open_reqs.apply(tuple, axis=1) and tuple(zip(open_reqs.columns,open_reqs.T.values.tolist())) but it's not able to iterate through either of these.
My old set looks like this, so this is the format I need to match to iterate through like I was before. I'm not sure if the Unicode is also an issue (when I print above I get (u'pare-10052173',)
In[236]: set
Out[236]:
('pare-10040137',
'pare-10034330',
'pare-00022936',
'pare-10025987',
'pare-10036617')
So basically I need the magic code to get a nice simple set like that from the REQ_NUM column of my open_reqs table. Thank you!
The following statement makes a list out of the specified column and then converts it to an array of tuple
open_req_list = tuple(list(open_reqs['REQ_NUM']))
You can use the tolist() function to convert to a list and the tuple() the whole list
req_num = tuple(open_reqs['REQ_NUM'].tolist())
#type(req_num)
req_num
df = pd.DataFrame(data)
columns_tuple = tuple(df.columns)
df.columns has the datatype of object. To convert it into tuples, use this code and you will GET TUPLE OF ALL COLUMN NAMES