changing of object into datetime format - pandas

I am currently working on google sheet by importing in python.When I import the sheet it was in object format and later I converted into float, but I try to change the format of Date column then it gives me an error.
Following is the Dataframe on which I have to work on
df.head()
Out[21]:
Date Avg_Energy Avg_Voltage
1 24-06-2018 12-50-02 2452.93
2 24-06-2018 12-50-03 2452.98 228.03
3 24-06-2018 12-50-04 2453.04 228.7
4 24-06-2018 12-50-05 2453.1 228.4
5 24-06-2018 12-50-06 2453.16 228.74
I have applied the following code to change it into datetime format
df['DateTime'] = pd.to_datetime(df['Date'])
I provide me the following error
df2['DateTime'] = pd.to_datetime(df2['Date'])
Traceback (most recent call last):
File "<ipython-input-22-0636e9d0e511>", line 1, in <module>
df2['DateTime'] = pd.to_datetime(df2['Date'])
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 380, in _convert_listlike
raise e
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 739, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 733, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\dateutil\parser\_parser.py", line 1356, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\dateutil\parser\_parser.py", line 648, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '24-06-2018 12-50-100')

You have an unorthodox datetime format. Use the format argument.
pd.to_datetime(df.Date, format='%d-%m-%Y %H-%M-%S')
0 2018-06-24 12:50:02
1 2018-06-24 12:50:03
2 2018-06-24 12:50:04
3 2018-06-24 12:50:05
4 2018-06-24 12:50:06
Name: Date, dtype: datetime64[ns]
See http://strftime.org/ for more information.

On my end I tested just with:
pd.to_datetime(df.Date)
And it worked. Appears that you don't have the first Avg_Voltage value.
Date Energy Voltage
1 24-06-2018 12-50-02 2452.93 322323.00
2 24-06-2018 12-50-03 2452.98 228.03
3 24-06-2018 12-50-04 2453.04 228.70
4 24-06-2018 12-50-05 2453.10 228.40
5 24-06-2018 12-50-06 2453.16 228.74
1 2018-06-24 12:00:00-02:00
2 2018-06-24 12:00:00-03:00
3 2018-06-24 12:00:00-04:00
4 2018-06-24 12:00:00-05:00
5 2018-06-24 12:00:00-06:00
Name: Date, dtype: object
You may use:
pd.to_datetime(df.Date).dt.strftime('%Y-%m-%d %H:%M:%S')
to achieve better format.

Related

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

Copying dataframe into a temp table where not all values are set error

This is a continuation of this situation: How does Snowflake handle NULL values?
I am trying to insert the dataframe, into a temp table that was created during the session I created with the python connector and cannot insert values into a table where the dataframe is not set yet. How can I add a column of blank NaN and Null values that I can set in the table later?
conn.cursor().execute("create or replace temp table x as")
>>> conn.cursor().execute("USE DATABASE temp_db;")
<snowflake.connector.cursor.SnowflakeCursor object at 0x10c78b048>
>>> conn.cursor().execute("create or replace temp table x(id number, first_name varchar, last_name varchar, email varchar, null_feild boolean, blank_feild varchar, letter_grade varchar(3));")
<snowflake.connector.cursor.SnowflakeCursor object at 0x10acebc88>
>>> df.to_sql('x', con=conn, index=False)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py:2712: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
method=method,
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/snowflake/connector/cursor.py", line 490, in execute
query = command % processed_params
TypeError: not all arguments converted during string formatting
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 2712, in to_sql
method=method,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 518, in to_sql
method=method,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1749, in to_sql
table.create()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 641, in create
if self.exists():
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 628, in exists
return self.pd_sql.has_table(self.name, self.schema)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1762, in has_table
return len(self.execute(query, [name]).fetchall()) > 0
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1610, in execute
raise_with_traceback(ex)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/compat/__init__.py", line 47, in raise_with_traceback
raise exc.with_traceback(traceback)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/snowflake/connector/cursor.py", line 490, in execute
query = command % processed_params
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
>>>
The dataframe is below:
>>> df['letter_grade'] = np.nan
>>> df.head()
id first_name last_name ... null field blank_ield letter_grade
0 1 Paule Tohill ... False NaN NaN
1 2 Rebe Slyford ... True NaN NaN
2 3 Angelita Antoni ... False NaN NaN
3 4 Giffy Dehm ... False NaN NaN
4 5 Rob Beadle ... False NaN NaN
[5 rows x 7 columns]
>>> df.to_sql('x', con=conn, index=False)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/snowflake/connector/cursor.py", line 490, in execute
query = command % processed_params
TypeError: not all arguments converted during string formatting
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 2712, in to_sql
method=method,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 518, in to_sql
method=method,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1749, in to_sql
table.create()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 641, in create
if self.exists():
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 628, in exists
return self.pd_sql.has_table(self.name, self.schema)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1762, in has_table
return len(self.execute(query, [name]).fetchall()) > 0
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1610, in execute
raise_with_traceback(ex)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/compat/__init__.py", line 47, in raise_with_traceback
raise exc.with_traceback(traceback)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/sql.py", line 1595, in execute
cur.execute(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/snowflake/connector/cursor.py", line 490, in execute
query = command % processed_params
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
>>>
Obviously I don't want this to be a temp database, after the first test scores I will alter the table, I am just not sure why the connection does not like the current dataframe based on the table definition above.
The error thrown here is at a point well before the actual values (such as NaN/None) are evaluated. Before executing the inserts, Pandas runs checks to see if the table exists or if it needs to be created, which is the part that's specifically failing according to the traceback (contains calls to exists, has_table etc.).
To use Panda's to_sql function against a Snowflake DB, ensure you're passing it an actual Snowflake DB SQLAlchemy Engine object and not a generic one.
For connection objects passed to to_sql that are not of SQLAlchemy Engine type, Pandas only supports SQLite3 dialects, which can be observed in the error (the sqlite_master table is not valid for Snowflake DBs, it is only valid for SQLite3 DBs):
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
Follow this Snowflake documentation guide to install an SQLAlchemy engine for Snowflake DB, then rebuild the parts of the code that create the SQLAlchemy engine object. The Verifying Your Installation section in the guide has a code sample that makes use of the snowflake:// URI support:
engine = create_engine(
'snowflake://{user}:{password}#{account}/'.format(
user='<your_user_login_name>',
password='<your_password>',
account='<your_account_name>',
)
)
Note 1: The SQLAlchemy support does not come with the standard Snowflake Python Connector installation and needs to be installed as an add-on for Pandas to make use of it.
Note 2: Support for NaN and NULL value inserts to databases is present in recent versions of Pandas, covered by another question: Python Pandas write to sql with NaN values

How to merge pandas series off of a column of dates

I have two series:
date DEF
0 1/31/1986 0.0140
1 2/28/1986 0.0150
2 3/31/1986 0.0160
3 4/30/1986 0.0120
4 5/30/1986 0.0120
date PE
0 1/31/1900 12.71
1 2/28/1900 12.94
2 3/31/1900 13.04
3 4/30/1900 13.21
4 5/31/1900 12.58
I need to iterate over several DataFrames of this nature and combine them all into one big DataFrame, where only values that align with the dates get added. My function so far:
def get_combined_vars(start, end):
rows = pd.date_range(start=start, end=end, freq='BM')
df1 = pd.DataFrame(rows, columns=['date'])
for key in variables.keys():
check = variables[key][0]
if check == 1:
df2 = pd.DataFrame(variables[key][1]())
print(df2.head(5))
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))
print(df1.head(10))
return df1
I can't seem to find the right command to merge DataFrames based off of a column.
Desired output:
date DEF PE
0 1/31/1900 0.0140 12.71
1 2/28/1900 0.0150 12.94
2 3/31/1900 0.0160 13.04
3 4/30/1900 0.0120 13.21
4 5/31/1900 0.0120 12.58
Merge_asof issue:
runfile('H:/Market Timing/Files/market_timing.py', wdir='H:/Market Timing/Files')
date BY
0 1/31/1963 0.98
1 2/28/1963 1
2 3/29/1963 1.01
3 4/30/1963 1.01
4 5/31/1963 1.01
Traceback (most recent call last):
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 303, in _convert_listlike
values, tz = tslib.datetime_to_datetime64(arg)
File "pandas\_libs\tslib.pyx", line 1884, in pandas._libs.tslib.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "H:/Market Timing/Files/market_timing.py", line 88, in <module>
print(get_combined_vars('1/31/1995', '1/31/2005').head(10))
File "H:/Market Timing/Files/market_timing.py", line 43, in get_combined_vars
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 373, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 306, in _convert_listlike
raise e
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 294, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 2156, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2379, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2373, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 581, in parse
ret = default.replace(**repl)
ValueError: day is out of range for month
I believe on the third pass of these two DataFrames attempting to be combined it runs into this error: ValueError: day is out of range for month
Can a buffer be added for discrepancies in data like this?
You can use pd.merge_asof, however, first you'll need to get your dates on a common year.
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))[['date','DEF','PE']]
Output:
date DEF PE
0 1900-01-31 0.014 12.71
1 1900-02-28 0.015 12.94
2 1900-03-31 0.016 13.04
3 1900-04-30 0.012 13.21
4 1900-05-31 0.012 12.58
You would use pandas.Merge (or DataFrame.join as shorthand) to do this:
import pandas as pd
pd.Merge(df1, df2, on="date")
...But as Scott Boston mentioned in his comment, the data doesn't align so you won't get your expected results.

Pandas to spark data frame converts datetime datatype to bigint

I have a pandas data frame in pyspark. I want to create/load this data frame into a hive table.
pd_df = pandas data frame
id int64
TEST_TIME datetime64[ns]
status_time object
GROUP object
test_type object
dtype: object
id TEST_TIME status_time GROUP test_type
0 1 2017-03-12 02:19:51 Driver started
1 2 2017-03-12 02:19:53 2017-03-11 18:13:43.577 ALARM AL_PT2334_L
2 3 2017-03-12 02:19:53 2017-03-11 18:13:43.577 ALARM AL_Turb_CNet_Ch_A_Fault
3 4 2017-03-12 02:19:53 2017-03-11 18:13:43.577 ALARM AL_Encl_Fire_Sys_Trouble
4 5 2017-03-12 02:19:54 2017-03-11 18:13:44.611 STATUS ST_Engine_Turning_Mode
Now I converted the pandas data frame to spark data frame like below.
spark_df = sqlContext.createDataFrame(pd_df)
+---+-------------------+--------------------+------+--------------------+
| id| TEST_TIME| status_time| GROUP| test_type|
+---+-------------------+--------------------+------+--------------------+
| 1|1489285191000000000| | | Driver started|
| 2|1489285193000000000|2017-03-11 18:13:...| ALARM| AL_PT2334_L|
| 3|1489285193000000000|2017-03-11 18:13:...| ALARM|AL_Turb_CNet_Ch_A...|
| 4|1489285193000000000|2017-03-11 18:13:...| ALARM|AL_Encl_Fire_Sys_...|
| 5|1489285194000000000|2017-03-11 18:13:...|STATUS|ST_Engine_Turning...|
+---+-------------------+--------------------+------+--------------------+
DataFrame[id: bigint, TEST_TIME: bigint, status_time: string, GROUP: string, test_type: string]
I want the TEST_TIME column to be a timestamp column but I am getting bigint.
I want the timestamp to be exactly like in pd_df even in spark_df.
I have done like below while converting pandas dataframe to spark dataframe
spark_df = sqlContext.createDataFrame(pd_df).withColumn("TEST_TIME", (F.unix_timestamp("TEST_TIME") + 28800).cast('timestamp'))
I got below error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/pyspark/sql/dataframe.py", line 1314, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'unixtimestamp(TEST_TIME,yyyy-MM-dd HH:mm:ss)' due to data type mismatch: argument 1 requires (string or date or timestamp) type, however, 'TEST_TIME' is of bigint type.;"
How can I achieve what I want
Convert your pandas dataframe column of type datetime64 to python datetime object, like this:
pd_df['TEST_TIME'] = pandas.Series(pd_df['TEST_TIME'].dt.to_pydatetime(), dtype=object)
And then create the spark dataframe as you were doing.
Just convert it to the right range (from nanoseconds to seconds) and cast
df.withColumn(
"TEST_TIME",
(F.col("TEST_TIME") / F.pow(F.lit(1000), F.lit(3))).cast('timestamp'))

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]