pyarrow and pandas integration - pandas

I plan to:
join
group by
filter
data using pyarrow (new to it). The idea is to get better performance and memory utilisation ( apache arrow compression) comparing to pandas.
Seems like pyarrow has no support for join two Tables / Dataset by key so I have to fallback to pandas.
I don't really follow how pyarrow <-> pandas integration works. Will pandas realy on apache arrow data structure? I'm fine with using only these types.
string
long
decimal
I have a feeling that pandas will copy all data from apache arrow and double size (according to the doc)

pyarrow itself doesn't provide these capabilities to the end user but is rather meant as a library that can be used by DataFrame library developers as the base. Thus the intention is not that you as a DataFrame user switch one day to using pyarrow directly but that libraries like pandas use Arrow as a backend.
This is already happening with the new ArrowStringType introduced in pandas 1.2 (not yet really functional) or the fletcher library that provides the possibility to use pyarrow as the backend for a selection of the columns of your pandas.DataFrame through pandas's ExtensionArray interface.
Disclaimer: I'm the main author of fletcher.

Related

Concatenate more than 2 dataframes side by side using a for loop

I am new to Pandas and was curious to know if I can merge more than 2 dataframes (generated within a for loop) side by side?
Use the pandas library's DataFrame method called concat.
Here's documentation: https://pandas.pydata.org/docs/reference/api/pandas.concat.html
(Avoid looping over dataframes at all costs. It's much slower than the tools provided to you by pandas. In most cases, there's probably a pandas function for it, or a few you can use together to achieve the same thing.)

Fast ascii loader to NumPy arrays

It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.
I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.
The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.
Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?
Thank you in advance.
[1] Numpy loading csv TOO slow compared to Matlab
[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/
[Edit 1]
For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.
Sample:
{
arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)
}
Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.
Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.
Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.

Python pandas persistent cache

Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?
In particular is there a caching method for get_yahoo_data for financial?
A very plus would be:
very few lines of code to write
possibility to integrate the persisted series when new data is downloaded for the same source
There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.
An example for a pandas.DataFrame:
# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory
# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory
The same methods also work for pandas.Series:
# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory
# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory
You could use the Data cache package.
from data_cache import pandas_cache
#pandas_cache
def foo():
...
Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.
In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.
I think there is no single solution to fit all requirements, really case by case:
for human readability of saved data: CSV, Excel
for binary python object serialization (use-cases): Pickle
for data-interchange: JSON
for long-time and incrementally updating: SQL
etc.
And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:
from sqlalchemy import create_engine
new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
new_data.to_sql('table_name', conn) # To Write
df = pd.read_sql_table('sql_query', conn) # To Read

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.

Load csv data to spark dataframes using pd.read_csv?

The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.
For example here: Get CSV to Spark dataframe