How to read time column in pandas and how to convert it into milliseconds - pandas

I used this code to read excel file
df=pd.read_excel("XYZ.xlsb",engine='pyxlsb',dtype={'Time':str})
This is just to show what i am getting after reading excel file.
import pandas as pd
import numpy as np
data = {'Name':['T1','T2','T3'],
'Time column in excel':['01:57:15', '00:30:00', '05:00:00'],
'Time column in Python':['0.0814236111111111', '0.0208333333333333', '0.208333333333333']}
df = pd.DataFrame(data)
print (df)
| left | Time column in excel | Time column in Python|
| T1 | 01:57:15 | 0.0814236111111111 |
| T2 | 00:30:00 | 0.0208333333333333 |
| T3 | 05:00:00 | 0.208333333333333 |
I want read this time exactly as in excel.and want to convert into milliseconds,as i want to use time to calculate time difference in percentage for further working

try dividing the microsecond of the datetime by 1000
def get_milliseconds(dt):
return dt.microsecond / 1000

Related

argument must be a string or a number, not 'datetime.datetime', but i have a string (Pandas + Matplotlib)

I have a pandas dataframe
published | sentiment
2022-01-31 10:00:00 | 0
2021-12-29 00:30:00 | 5
2021-12-20 | -5
Since some rows don't have hours, minutes and seconds I delete them:
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
I get:
published | sentiment
2022-01-31 | 0
2021-12-29 | 5
2021-12-20 | -5
If I plot the data:
plt.pyplot.plot_date(df['published'],df['sentiment'] )
I get this error:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
But I don't know why since it should be a string.
How can I plot it (possibly keeping the temporal order)? Thank you
Try like this:
import pandas as pd
from matplotlib import pyplot as plt
values=[('2022-01-31 10:00:00',0),('2021-12-29 00:30:00',5),('2021-12-20',-5)]
cols=['published','sentiment']
df_dominant_topic2 = pd.DataFrame.from_records(values, columns=cols)
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
#you may sort the data by date
df_dominant_topic2.sort_values(by='published', ascending=True, inplace=True)
plt.plot(df_dominant_topic2['published'],df_dominant_topic2['sentiment'])
plt.show()

Loading csv with pandas, wrong columns

I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!
import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Efficient Dataframe column (Object) to DateTime conversion

I'm attempting to create a new column that contains the data of the Date input column as a datetime. I'd also happily accept changing the datatype of the Date column, but I'm just as unsure how to to that.
I'm currently using DateTime = dd.to_datetime. I'm importing from a CSV and letting dask decide on data types.
I'm fairly new to this, so I've tried a few stackoverflow answers, but I'm just fumbling and getting more errors than answers.
My input date string is, for example:
2019-20-09 04:00
This is what I currently have,
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
# Dataframes implement the Pandas API
import dask.dataframe as dd
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\State_Weathergrids.csv')
print(ddf.describe(include='all'))
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%y-%d-%m %H:%M')
The error I'm receiving is below. I 'm assuming that the last line is the most relevant piece, but for the life of me I cannot work out why the date format given doesn't match the format I'm specifying.
TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
290 try:
--> 291 values, tz = conversion.datetime_to_datetime64(arg)
292 return DatetimeIndex._simple_new(values, name=name, tz=tz)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
....
ValueError: time data '2019-20-09 04:00' does not match format '%y-%d-%m %H:%M' (match)
Data Frame current properties using describe:
Dask DataFrame Structure:
Location Date Temperature RH
npartitions=1
float64 object float64 float64
... ... ... ...
Dask Name: describe, 971 tasks
Sample Data
+-----------+------------------+-------------+--------+
| Location | Date | Temperature | RH |
+-----------+------------------+-------------+--------+
| 1075 | 2019-20-09 04:00 | 6.8 | 99.3 |
| 1075 | 2019-20-09 05:00 | 6.4 | 100.0 |
| 1075 | 2019-20-09 06:00 | 6.7 | 99.3 |
| 1075 | 2019-20-09 07:00 | 8.6 | 95.4 |
| 1075 | 2019-20-09 08:00 | 12.2 | 76.0 |
+-----------+------------------+-------------+--------+
Try this,
['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M', errors = 'ignore')
errors ignore will return Nan wherever to_datetime fails..
For more detail visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

Get distinct rows by creation date

I am working with a dataframe like this:
DeviceNumber | CreationDate | Name
1001 | 1.1.2018 | Testdevice
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:
DeviceNumber | CreationDate | Name
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.
import pyspark.sql.functions as psf
max_df = df\
.groupBy('DeviceNumber')\
.agg(psf.max('CreationDate').alias('max_CreationDate'))
and then join max_df with original dataframe.
joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]
df.join(max_df,joining_condition,'left_semi').show()
left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.
You can use PySpark windowing functionality:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
# make sure that creation is a date data-type
df = df.withColumn('CreationDate', f.to_timestamp('CreationDate', format='dd.MM.yyyy'))
# partition on device and get a row number by (descending) date
win = Window.partitionBy('DeviceNumber').orderBy(f.col('CreationDate').desc())
df = df.withColumn('rownum', f.row_number().over(win))
# finally take the first row in each group
df.filter(df['rownum']==1).select('DeviceNumber', 'CreationDate', 'Name').show()
------------+------------+------+
|DeviceNumber|CreationDate| Name|
+------------+------------+------+
| 1002| 2019-01-01| Lamp|
| 1001| 2019-06-30|Device|
+------------+------------+------+