I am looking for a solution to resample time series data on a big scale (tens or hundreds of millions of data records). Pandas resample() worked well until about 10 mio data records were reached, afterwards it actually stopped working, because the hardware had not enough memory. I had this problem several times with Pandas on huge datasets. But if I just used a for loop on huge datasets, I could read the data and work with it, even if it was much slower. Does anybody know a good solution to resample time series data without pandas?
The source of the data is a MySQL server and the records contain OHLC data and a timestamp. The frequency of the time series is 1 minute and the resampling frequencies are 5 min, 30 min, 1h, 6h, 1d, 1w, 1m, which I all store into different tables. I consider to switch in future to mongoDB.
Look at this:
Pandas Panel resampling alternatives
Meanwhile the package is called xarray. Although you can check out dask, which together with xarray can offer fast, parallel resampling (and many other numpy and pandas functions).
Related
I have a generic spark 2.3 job that does many transformations and joins and produces a huge dag.
That is having a big impact on driver side as dag becomes very complex.
In order to release pressure on the driver I've though on checkpointing some intermediate dataframes to cut the dag, but I've noticed that dataframe.checkpoint is using rdds underneath and it spends many time serializing and deserializing the dataframe.
According to this Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk and my experience, writing the dataframe as parquet and reading it back is faster than checkpointing, but it has a disadvantage. Dataframe loses the partitioner.
Is there any way of writing and reading the dataframe and keep the partitioner? I've though in using buckets when writing the dataframe, so that when the dataframe is read back it knows the data partitioning.
Problem is, how can I know the columns that the dataframe has as partitioner?
Spark job I'm running is kind of generic, so I can't hardcode the columns
Thanks
After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error
MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64
Doing a bit of research I find issue (#14942) reported on Git for an earlier version
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cat': np.random.randint(0, 255, size=3000000),
'int_id': np.random.randint(0, 255, size=3000000),
'other_id': np.random.randint(0, 10000, size=3000000),
'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')
# killed after 6 minutes of 100% cpu and 90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()
Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?
Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:
df.groupby(index, observed=True)
There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup.
While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.
My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.
db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.
class DuckDB:
def __init__(self, db=None):
self.db_loc = db or ':memory:'
self.db = duckdb.connect(self.db_loc)
def sql(self, sql=""):
return self.db.execute(sql).fetchdf()
def __del__():
self.db.close()
Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.
I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.
I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.
Why, so much memory is consumed?
I would say duplication of dataframe in memory, as you suggested.
How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run
df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()
Although I know this does not directly answer your question, I hope it may help !
What about trying to delete the PySpark df? :
del(df)
Is there a way to take a very large amount of data on disk (a few 100 GB) and interact with it on disk as a pandas dataframe?
Here's what I've done so far:
Described the data using pytables and this example:
http://www.pytables.org/usersguide/introduction.html
Run a test by loading a portion of the data (a few GB) into an HDF5 file
Converted the data into a dataframe using pd.DataFrame.from_records()
This last step loads all the data in memory.
I've looked for some way to describe the data as a pandas dataframe in step 1 but haven't been able to find a good set of instructions to do that. Is what I want to do feasible?
blaze is a nice way to interact with out-of-core data by using lazy expression evaluation. This uses pandas and PyTables under the hood (as well as a host of conversions with odo)
It's been months now since I started to use Pandas DataFrame to deserialize GPS data and perform some data processing and analyses.
Although I am very impressed with Pandas robustness, flexibility and power, I'm a bit lost about which features, and in which way, I should use to properly model the data, both for clarity, simplicity and computational speed.
Basically, each DataFrame is primarily indexed by a datetime object, having at least one column for a latitude-longitude tuple, and one column for elevation.
The first thing I do is to calculate a new column with the geodesic distance between coordinate pairs (first one being 0.0), using a function that takes two coordinate pairs as arguments, and from that new column I can calculate the cumulative distance along the track, which I use as a Linear Referencing System
The questions I need to address would be:
Is there a way in which I can use, in the same dataframe, two different monotonically increasing columns (cumulative distance and timestamp), choosing whatever is more convenient in each given context at runtime, and use these indexes to auto-align newly inserted rows?
In the specific case of applying a diff function that could be vectorized (applied like an array operation instead of an iterative pairwise loop), is there a way to do that idiomatically in pandas? Should I create a "coordinate" class which support the diff (__sub__) operation so I could use dataframe.latlng.diff directly?
I'm not sure these questions are well formulated, but that is due, at least a bit, by the overwhelming number of possibilities, and a somewhat fragmented documentation (yet).
Also, any tip about using Pandas for GPS data (tracklogs) or Geospatial data in general is very much welcome.
Thanks for any help!