Is there a way to speed up the conversion of spark dataframe to pandas dataframe? - pandas

I tried to convert spark dataframe to pandas in databricks notebook with pyspark. It takes for ever running. Is there a better way to do this? There are more than 600,000 rows.
df_PD = sparkDF.toPandas()
df_PD = sparkDF.toPandas()

Can you try changing your import statement and importing the Pandas API for Spark?
import pyspark.pandas as pd
df_PD = sparkDF.to_pandas()

Related

Dataframe conversion from pandas to polars -- difference in the final dimensions

I'm trying to convert a Pandas Dataframe to a Polar one.
I simply used the function result_polars = pl.from_pandas(result). Conversion proceeds well, but when I check the shape of the two dataframe I get that the Polars one has half the size of the original Pandas Dataframe.
I believe that 4172903059 in length is almost the maximum dimension that the polars dataframe allows.
Does anyone have suggestions?
Here a screenshot of the shape of the two dataframes.
Here a Minimum working example
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4292903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
Using these dimensions the two dataframes have the same size. If instead I put the following:
import polars as pl
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros((4392903069,1), dtype=np.uint8))
df_polars = pl.from_pandas(df)
The Polars dataframe has much smaller dimension (97935773).
The default polars wheel retrieved with pip install polars "only" allows for 2^32 e.g. ~4.2 billion rows.
Do you need more than that install pip install polars-u64-idx and uninstall the previous installation.

How can I convert csv to pickle?

I have some csv files which take a bit long to load as dataframe into my workspace. Is there a fast and easy tool to convert them to pickle to load faster?
After you load the data using Pandas,
Use the following:
import pandas as pd
df.to_pickle('/Drive Path/df.pkl') #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('/Drive Path/df.pkl') #to load 123.pkl back to the dataframe df

Can spark dataframe (scala) be converted to dataframe in pandas (python)

The Dataframe is created using scala api for SPARK
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
I want to convert this to Pandas Dataframe
PySpark provides .toPandas() to convert a spark dataframe to pandas but there is no equivalent for scala(that I can find)
Please help me in this regard.
To convert a Spark DataFrame into a Pandas DataFrame, you can enable spark.sql.execution.arrow.enabled to true and then read/create a DataFrame using Spark and then convert it to Pandas DataFrame using Arrow
Enable spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Create DataFrame using Spark like you did:
val someDF = spark.createDataFrame()
Convert the same to a pandas DataFrame
result_pdf = someDF.select("*").toPandas()
The above commands run using Arrow, because of the config spark.sql.execution.arrow.enabled set to true
Hope this helps!
In Spark DataFrame is just abstraction above data, most common sources of data are files from file system. When you convert dataframe in PySpark to Pandas format, PySpark just convert PySpark abstraction above data to another abstraction from another python framework. If you want made conversion in Scala between Spark and Pandas you can't do that because Pandas is Python library for work with data but spark is not and you will have some difficulties with Python and Scala integration. The best simple things you can do here:
Write dataframe to file system on scala Spark
Read data from file system using Pandas.

create a dask dataframe from a dictionary

I have a dictionary like this:
d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'}
I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily imported to a Pandas df by:
df = pd.DataFrame({'Caps': cap_list, 'Term': unique_tokens, 'LocalFreq': local_freq_list,
'CorpusFreq': corpus_freq_list})
Should I first load into a bag and then convert from bag to ddf?
If your data fits in memory then I encourage you to use Pandas instead of Dask Dataframe.
If for some reason you still want to use Dask dataframe then I would convert things to a Pandas dataframe and then use the dask.dataframe.from_pandas function.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(...)
ddf = dd.from_pandas(df, npartitions=20)
But there are many cases where this will be slower than just using Pandas well.

Convert IEX Finance API data to pandas dataframe

I want to pull data from the IEX finance api and put it into a pandas dataframe but I don't know the correct code. Can someone help?
URL call for the api =
https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5
I tried the below but it doesn't work
import pandas as pd
api_call = 'https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5'
price = pd.read_csv(api_call)
The data is in JSON format. To load into dataframe you have to call read_json function.
import pandas as pd
df = pd.read_json("https://api.iextrading.com/1.0/stock/aapl/chart/1d?chartInterval=5")