How to replicate the between_time function of Pandas in PySpark - pandas

I want to replicate the between_time function of Pandas in PySpark.
Is it possible since in Spark the dataframe is distributed and there is no indexing based on datetime?
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts.between_time('0:45', '0:15')
Is something similar possible in PySpark?
pandas.between_time - API

If you have a timestamp column, say ts, in a Spark dataframe, then for your case above, you can just use
import pyspark.sql.functions as F
df2 = df.filter(F.hour(F.col('ts')).between(0,0) & F.minute(F.col('ts')).between(15,45))

Related

group by in pandas API on spark

I have a pandas dataframe below,
data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
Here df is a Pandas dataframe.
I am trying to convert this dataframe to pandas API on spark
import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))
Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'>
'
No I am applying group by function on pdf like below,
for i,j in pdf.groupby("Team"):
print(i)
print(j)
I am getting an error below like
KeyError: (0,)
Not sure this functionality will work on pandas API on spark ?
The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.
If you want to print the groups, then pyspark pandas code:
pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))
is equivalent to pandas code:
for name, sub_grp in df.groupby("Team"):
print(name)
print(sub_grp)
Reference to source code
If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html
but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816
Documentation reference to iterate groups
pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration
pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups
If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas
for i in df.groupby("Team"):
print(i)
Or
for i in pdf.groupBy("Team"):
print(i)

How to check for Range of Values (domain) in a Dataframe?

So want to determine what values are in a Pandas Dataframe:
import pandas as pd
d = {'col1': [1,2,3,4,5,6,7], 'col2': [3, 4, 3, 5, 7,22,3]}
df = pd.DataFrame(data=d)
col2 hast the unique values 3,4,5,6,22 (domain). Each value that exists shall be determined. But only once.
Is there anyway to fastly extract what the domain is in a Pandas Dataframe Column?
Use df.max() and df.min() to find the range.
print(df["col2"].unique())
by Andrej Kesely is the solution. Perfect!

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Streamlit + Panda Dataframe: Change the way date gets displayed

I am using streamlit to render a padas dataframe with st.table(dataframe). Right now dates in the dataframe get displayed like this:
https://ibb.co/dKq2KQm
I would like to display it in this way: 2020-12-30 00:30,
is there any way to change this?
Thanks a lot!
You could use pandas styler and python's strftime method to achieve what you need.
This is an example:
import streamlit as st
import pandas as pd
df = pd.DataFrame({'date': ['2015-01-05', '2015-01-06', '2015-01-07'],
'goal': [4, 2.1, 5.9],
'actual': [8, 5.1, 7.7]})
df['date'] = pd.to_datetime(df['date'])
df = df.style.format({'date': lambda x: "{}".format(x.strftime('%m/%d/%Y %H:%M:%S'))}).set_table_styles('styles')
st.dataframe(df)

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.