The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.
For example here: Get CSV to Spark dataframe
Related
I want to save some Polars dataframes into one file, and next I want to request data from file with filter by timestamp (datetime) column. I don't need to take all the file in memory, but only filtered part.
I see, Polars API list has Feather/IPC and Parquet files that can do it in a theory, but I don't know how to read this files in Polars with a filter by data.
Before for Pandas I used hdf5 format and it was very clear, but I have not expirience with that new formats for me. Maybe you can help me how to make it most effective.
I have Polars df and I want to save it into Parquet file. And next df too, and next.
Code df.write_parquet("path.parquet") is only rewriting it. How I can do it in Polars?
Polars does not support appending to Parquet files, and most tools do not, see for example this SO post.
Your best bet would be to cast the dataframe to an Arrow table using .to_arrow(), and use pyarrow.dataset.write_dataset. In particular, see the comment on the parameter existing_data_behavior. Still, that requires organizing your data in partitions, which effectively means you have a separate parquet file per partition, stored in the same directory. So each df you have, becomes its own parquet file, and you abstract away from that on the read. Polars does not support writing partitions as far as I'm aware. There is support for reading though, see the source argument in pl.read_parquet.
I plan to:
join
group by
filter
data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas.
Seems like pyarrow has no support for join two Tables / Dataset by key so I have to fallback to pandas.
I don't really follow how pyarrow <-> pandas integration works. Will pandas realy on apache arrow data structure? I'm fine with using only these types.
string
long
decimal
I have a feeling that pandas will copy all data from apache arrow and double size (according to the doc)
pyarrow itself doesn't provide these capabilities to the end user but is rather meant as a library that can be used by DataFrame library developers as the base. Thus the intention is not that you as a DataFrame user switch one day to using pyarrow directly but that libraries like pandas use Arrow as a backend.
This is already happening with the new ArrowStringType introduced in pandas 1.2 (not yet really functional) or the fletcher library that provides the possibility to use pyarrow as the backend for a selection of the columns of your pandas.DataFrame through pandas's ExtensionArray interface.
Disclaimer: I'm the main author of fletcher.
Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?
In particular is there a caching method for get_yahoo_data for financial?
A very plus would be:
very few lines of code to write
possibility to integrate the persisted series when new data is downloaded for the same source
There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.
An example for a pandas.DataFrame:
# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory
# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory
The same methods also work for pandas.Series:
# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory
# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory
You could use the Data cache package.
from data_cache import pandas_cache
#pandas_cache
def foo():
...
Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.
In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.
I think there is no single solution to fit all requirements, really case by case:
for human readability of saved data: CSV, Excel
for binary python object serialization (use-cases): Pickle
for data-interchange: JSON
for long-time and incrementally updating: SQL
etc.
And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:
from sqlalchemy import create_engine
new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
new_data.to_sql('table_name', conn) # To Write
df = pd.read_sql_table('sql_query', conn) # To Read
I have a defined list of S3 file paths and I want to read them as DataFrames:
ss = SparkSession(sc)
JSON_FILES = ['a.json.gz', 'b.json.gz', 'c.json.gz']
dataframes = {t: ss.read.json('s3a://bucket/' + t) for t in JSON_FILES}
The code above works, but in an unexpected way. When the code is submitted to a Spark cluster, only a single file is read at time, keeping only a single node occupied.
Is there a more efficient way to read multiple files? A way to make all nodes work at the same time?
More details:
PySpark - Spark 2.2.0
Files stored on S3
Each file contains one JSON object per line
The files are compressed, as it can be seen by their extensions
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
ss = SparkSession(sc)
dataframes = ss.read.json("s3a://bucket/*.json.gz")
The problem was: I didn't understand Spark's runtime architecture. Spark has the notion of "workers", which, if I now understand it better (don't trust me), are capable of doing stuff in parallel. When we submit a Spark job, we can set both things, the number of workers and the level of parallelism they can leverage.
If you are using the Spark command spark-submit, these variables are represented as the following options:
--num-executors: similar to the notion of number of workers
--executor-cores: how many CPU cores a single worker should use
This is a document that helped me understand these concepts and how to tune them.
Coming back to my problem, in that situation I would have one worker per file.