Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?
In particular is there a caching method for get_yahoo_data for financial?
A very plus would be:
very few lines of code to write
possibility to integrate the persisted series when new data is downloaded for the same source
There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.
An example for a pandas.DataFrame:
# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory
# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory
The same methods also work for pandas.Series:
# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory
# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory
You could use the Data cache package.
from data_cache import pandas_cache
#pandas_cache
def foo():
...
Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.
In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.
I think there is no single solution to fit all requirements, really case by case:
for human readability of saved data: CSV, Excel
for binary python object serialization (use-cases): Pickle
for data-interchange: JSON
for long-time and incrementally updating: SQL
etc.
And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:
from sqlalchemy import create_engine
new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
new_data.to_sql('table_name', conn) # To Write
df = pd.read_sql_table('sql_query', conn) # To Read
Related
I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.
I have a bunch of files in S3 which comprise a larger-than-memory dataframe.
Currently, I use Dask to read the files into a dataframe, perform an inner-join with a smaller dataset (which will change on each call to this function, whereas huge_df is basically the full dataset & does not change), call compute to get a much smaller pandas dataframe, and then do some processing. E.g:
huge_df = ddf.read_csv("s3://folder/**/*.part")
merged_df = huge_df.join(small_df, how='inner', ...)
merged_df = merged_df.compute()
...other processing...
Most of the time is spent downloading the files from S3. My question is: is there a way to use Dask to cache the files from S3 on disk, so that on subsequent calls to this code, I could just read the dataframe files from disk, rather than from S3? I figure I can't just call huge_df.to_csv(./local-dir/) since that will bring huge_df into memory which won't work.
I'm sure there is a way to do this using a combination of other tools plus standard Python IO utilities, but I wanted to see if there was a way to use Dask to download the file contents from S3 and store them on the local disk without bringing everything into memory.
Doing huge_df.to_csv would have worked, because it would write each partition to a separate file locally, and so the whole thing would not have been in memory at once.
However, to answer the specific question, dask uses fsspec to manage file operations, and it allows for local caching, e.g., you could do
huge_df = ddf.read_csv("simplecache::s3://folder/**/*.part")
By default, this will store the files in a temporary folder, which gets cleaned up when you exit the python session, but you can provide options using an optional argument storage_options={"simplecache": {..}} to specify the cache location, or use "filecache" instead of "simplecache" if you want to enable the local copies to expire after some time or to check the target for updated versions.
Note that, obviously, these will only work with a distributed cluster only if all the workers have access to the same cache location, since the loading of a partition might happen on any of your workers.
I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so:
import pandas as pd
import dask.dataframe as dd
import fastparquet
##### Generate random data to Simulate Process creating a Parquet file ######
test_df = pd.DataFrame(data=np.random.randn(10000, 2), columns=['data1', 'data2'])
test_df['time'] = pd.bdate_range('1/1/2000', periods=test_df.shape[0], freq='1S')
# some grouping column
test_df['name'] = np.random.choice(['jim', 'bob', 'jamie'], test_df.shape[0])
##### Write to partitioned parquet file, hive and simple #####
fastparquet.write('test_simple.parquet', data=test_df, partition_on=['name'], file_scheme='simple')
fastparquet.write('test_hive.parquet', data=test_df, partition_on=['name'], file_scheme='hive')
# now check partition sizes. Only Hive version works.
assert test_df.name.nunique() == dd.read_parquet('test_hive.parquet').npartitions # works.
assert test_df.name.nunique() == dd.read_parquet('test_simple.parquet').npartitions # !!!!FAILS!!!
My goal here is to be able to quickly filter and process individual partitions in parallel using dask, something like this:
df = dd.read_parquet('test_hive.parquet')
df.map_partitions(<something>) # operate on each partition
I'm fine with using the Hive-style Parquet directory, but I've noticed it takes significantly longer to operate on compared to directly reading from a single parquet file.
Can someone tell me the idiomatic way to achieve this? Still fairly new to Dask/Parquet so apologies if this is a confused approach.
Maybe it wasn't clear from the docstring, but partitioning by value simply doesn't happen for the "simple" file type, which is why it only has one partition.
As for speed, reading the data in one single function call is fastest when the data are so small - especially if you intend to do any operation such as nunique which will require a combination of values from different partitions.
In Dask, every task incurs an overhead, so unless the amount of work being done by the call is large compared to that overhead, you can lose out. In addition, disk access is not generally parallelisable, and some parts of the computation may not be able to run in parallel in threads if they hold the GIL. Finally, the partitioned version contains more parquet metadata to be parsed.
>>> len(dd.read_parquet('test_hive.parquet').name.nunique())
12
>>> len(dd.read_parquet('test_simple.parquet').name.nunique())
6
TL;DR: make sure your partitions are big enough to keep dask busy.
(note: the set of unique values is already apparent from the parquet metadata, it shouldn't be necessary to load the data at all; but Dask doesn't know how to do this optimisation since, after all, some of the partitions may contain zero rows)
I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)
The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.
For example here: Get CSV to Spark dataframe