create dask DataFrame from a list of dask Series - dataframe

I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.

In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s

Related

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Can't convert Matrix to DataFrame JULIA

How can i convert a matrix to DataFrame in Julia?
I have an 10×2 Matrix{Any}, and when i try to convert it to a dataframe, using this:
df2 = convert(DataFrame,Xt2)
i get this error:
MethodError: Cannot `convert` an object of type Matrix{Any} to an object of type DataFrame
Try instead
df2 = DataFrame(Xt2,:auto)
You cannot use convert for this; you can use the DataFrame constructor, but then as the documentation (simply type ? DataFrame in the Julia REPL) will tell you, you need to either provide a vector of column names, or :auto to auto-generate column names.
Tangentially, I would also strongly recommend avoiding Matrix{Any} (or really anything involving Any) for any scenario where performance is at all important.

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

Create Spark DataFrame from Pandas DataFrames inside RDD

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science

Applying functions to DataFrame columns in plots

I'd like to apply functions to columns of a DataFrame when plotting them.
I understand that the standard way to plot when using Pandas is the .plot method.
How can I do math operations within this method, say for example multiply two columns in the plot?
Thanks!
Series actually have a plot method as well, so it should work to apply
(df['col1'] * df['col2']).plot()
Otherwise, if you need to do this more than once it would be the usual thing to make a new column in your dataframe:
df['newcol'] = df['col1'] * df['col2']