create a Dask DataFrame from a Dask Serie of dict - dataframe

Everything is in the question.
When I did this pre-processing part with pandas, I just have to do this:
import pandas as pd
serie = pd.Series([{"Event1": 1, "Event42": 1}, {"Event2": 1, "Event5": 1}], name="events")
df = pd.DataFrame(serie.tolist())
Now, I have my Dask DataFrame, with a Series containing dict, and I would like to obtain a dataframe, where each column is a key of my dicts.

Related

Convert multiple downloaded time series share to pandas dataframe

i downloaded the information about multiple shares using nsepy library for the last 10 days, but could not save it in the pandas dataframe.
Below code to download the multiples share data:
import datetime
from datetime import date
from nsepy import get_history
import pandas as pd
symbol=['SBIN','GAIL','NATIONALUM' ]
data={}
for s in symbol:
data[s]=get_history(s,start=date(2022, 11, 29),end=date(2022, 12, 12))
Below code using to convert the data to pd datafarme, but i am getting error
new = pd.DataFrame(data, index=[0])
new
error message:
ValueError: Shape of passed values is (14, 3), indices imply (1, 3)
Documentation of get_history sais:
Returns:
pandas.DataFrame : A pandas dataframe object
Thus, data is a dict with the symbol as keys and the pd.DataFrames as values. Then you are trying to insert a DataFrame inside of another DataFrame, that does not work. If you want to create a new MultiIndex Dataframe from the 3 existing DataFrames, you can do something like this:
result = {}
for df, symbol in zip(data.values(), data.keys()):
data = df.to_dict()
for key, value in data.items():
result[(symbol, key)] = value
df_multi = pd.DataFrame(result)
df_multi.columns
Result (just showing two columns per Symbol to clarifying the Multiindex structure)
MultiIndex([( 'SBIN', 'Symbol'),
( 'SBIN', 'Series'),
( 'GAIL', 'Symbol'),
( 'GAIL', 'Series'),
('NATIONALUM', 'Symbol'),
('NATIONALUM', 'Series')
Edit
So if you just want a single index DF, like in your attached file with the symbols in a column, you can simply to this:
new_df = pd.DataFrame()
for symbol in data:
# sequentally concat the DataFrames from your dict of DataFrames
new_df = pd.concat([data[symbol], new_df],axis=0)
new_df
Then the output looks like in your file.

group by in pandas API on spark

I have a pandas dataframe below,
data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
Here df is a Pandas dataframe.
I am trying to convert this dataframe to pandas API on spark
import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))
Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'>
'
No I am applying group by function on pdf like below,
for i,j in pdf.groupby("Team"):
print(i)
print(j)
I am getting an error below like
KeyError: (0,)
Not sure this functionality will work on pandas API on spark ?
The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.
If you want to print the groups, then pyspark pandas code:
pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))
is equivalent to pandas code:
for name, sub_grp in df.groupby("Team"):
print(name)
print(sub_grp)
Reference to source code
If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html
but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816
Documentation reference to iterate groups
pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration
pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups
If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas
for i in df.groupby("Team"):
print(i)
Or
for i in pdf.groupBy("Team"):
print(i)

Converting dict to dataframe of Solution point values & plotting

I am trying to plot some results obtained after optimisation using Gurobi.
I have converted the dictionary to python dataframe.
it is 96*1
But now how do I use this dataframe to plot as 1st row-value, 2nd row-value, I am attaching the snapshot of the same.
Please anyone can help me in this?
x={}
for t in time1:
x[t]= [price_energy[t-1]*EnergyResource[174,t].X]
df = pd.DataFrame.from_dict(x, orient='index')
df
You can try pandas.DataFrame(data=x.values()) to properly create a pandas DataFrame while using row numbers as indices.
In the example below, I have generated a (pseudo) random dictionary with 10 values, and stored it as a data frame using pandas.DataFrame giving a name to the only column as xyz. To understand how indexing works, please see Indexing and selecting data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a dictionary 'x'
rng = np.random.default_rng(121)
x = dict(zip(np.arange(10), rng.random((1, 10))[0]))
# Create a dataframe from 'x'
df = pd.DataFrame(x.values(), index=x.keys(), columns=["xyz"])
print(df)
print(df.index)
# Plot the dataframe
plt.plot(df.index, df.xyz)
plt.show()
This prints df as:
xyz
0 0.632816
1 0.297902
2 0.824260
3 0.580722
4 0.593562
5 0.793063
6 0.444513
7 0.386832
8 0.214222
9 0.029993
and gives df.index as:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
and also plots the figure:

How to replicate the between_time function of Pandas in PySpark

I want to replicate the between_time function of Pandas in PySpark.
Is it possible since in Spark the dataframe is distributed and there is no indexing based on datetime?
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts.between_time('0:45', '0:15')
Is something similar possible in PySpark?
pandas.between_time - API
If you have a timestamp column, say ts, in a Spark dataframe, then for your case above, you can just use
import pyspark.sql.functions as F
df2 = df.filter(F.hour(F.col('ts')).between(0,0) & F.minute(F.col('ts')).between(15,45))

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.