How to filter some data by read_parquet() in pandas? - pandas

i want to reduce loading memory usage by filter some gid
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
But in docs kwargs havn't been shown .
For example:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
so,how can i only load gid that i want to calculate?

The introduction of the **kwargs to the pandas library is documented here. It looks like the original intent was to actually pass columns into the request to limit IO volumn. The contributors took the next step and added a general pass for **kwargs.
For pandas/io/parquet.py the following is for read_parquet:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
For pandas/io/parquet.py the following is for read on the pyarrow engine:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py the following is for read_pandas:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
For pyarrow/parquet.py the following is for read:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
So, if I understand correctly maybe you can access nthreads and use_pandas_metadata - but then again, neither is explicitly assigned (??). I haven't tested it - but it maybe a start.

Related

Include partition steps as columns when reading Synapse spark dataframe

I have the following partition strategy in an ADLS Gen2 store
dir_parquet = "abfss://blah.windows.net/container_name/project=cars/make=*/model=*/*.parquet"
And this would load in the already partitioned data into a dataframe accordingly. I am aware of using the .filepath(n) in SQL to achieve this, and effectively require the same thing but in a notebook dataframe.
How can I keep the project, make and model values in the dataframe as separate columns?
According to this other SO thread setting .option("mergeSchema","true") on read would work however it did not.
Thanks.
Since I received no answer to this and cannot find an official means to do so, I wrote the below code.
People with this problem may also find recursively returning blob directories to be useful and if so please see the deep_ls function here (not my code).
import pyspark
import pyspark.sql.functions as F
from typing import List
def load_dataframes_with_partition_steps(dir_urls:List[str]) -> List[pyspark.sql.dataframe.DataFrame]:
"""
Written by: Paul Wilson, 2022-07-29
Takes in a list of blob directories including their partition steps and returns a list of dataframes with the associated
partition steps in the in the dataframe.
Ex. input...:
['abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Vauxhall/model=Astra/transmission=Manual',
'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Ford/model=Fiesta/transmission=Automatic']
...which is turned into a list of dicts...
[{'url': 'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Vauxhall/model=Astra/transmission=Manual',
'make': 'Vauxhall',
'model': 'Astra',
'transmission': 'Manual'},
{'url': 'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Ford/model=Fiesta/transmission=Automatic',
'make': 'Ford',
'model': 'Fiesta',
'transmission': 'Automatic'}]
...and from that list a list of dataframes per url and associated partition steps, such as:
[df1, df2, ..., dfn]
"""
def load_dataframe(url:str=None, partition_steps:dict={}, file_format:str=None, df:pyspark.sql.dataframe.DataFrame=None) -> pyspark.sql.dataframe.DataFrame:
"""
Recursively load a dataframe and apply the partition steps via withColumn
"""
if file_format is None or len(file_format) == 0:
raise(ValueError('file_format must not be none, the URL must end in the file format (.parquet, .csv, etc)'))
# if there is a url and non empty partition steps without a df loaded then load the dataframe
if (url is not None and len(partition_steps.keys()) > 0 and df is None):
df = spark.read.format(file_format).load(url)
# df is loaded so do not pass a url indicating it is loaded
return load_dataframe(url=None, partition_steps=partition_steps, df=df)
# if here then the df is loaded and proceed to apply withColumn
if (url is None and df is not None and len(partition_steps.keys()) > 0):
# load the first item in the partition steps dict
key = list(partition_steps.keys())[0]
value = list(partition_steps.values())[0]
# remove the first item from the partition steps dict
partition_steps.pop(key)
# load the dataframe with the new partition step
df = df.withColumn(key, F.lit(value))
return load_dataframe(url=None, partition_steps=partition_steps, df=df)
# if it makes it here then the dataframe is loaded and the partition steps are applied
return df
# list of dataframe dict values of url and partition steps
list_df_dicts = list()
if not isinstance(dir_urls, list):
raise TypeError('dir_urls must be a list of string values')
# iterate over all urls and generate dict of partition values
for url in dir_urls:
# dict to store url and partition steps
d_dict = dict()
d_dict['url'] = url
# get the format from the last part of the url
file_format = url.split('.')[-1]
d_dict['file_format'] = file_format
# split the url keeping only partition steps (ex. make=Vauxhall)
url_split = [u for u in d_dict['url'].split('/') if '=' in u]
if len(url_split) == 0:
raise ValueError('The list of URLs must contain the partition steps, ex. make=ford')
# turn the partition=item into a key:value
partition_items = [u.split('=') for u in url_split]
# iterate over every item in partition_items=[['key', 'value']] and set dict[key] = value
for item in partition_items:
key = item[0]
value = item[1]
d_dict[key] = value
list_df_dicts.append(d_dict)
# iterate over all the dicts and load the dataframes to a list with their partition steps in place
list_dfs = list()
for d_dict in list_df_dicts:
# get the url from the d_dict
url = d_dict['url']
# get the format
file_format = d_dict['file_format']
# remove the url from the d_dict
d_dict.pop('url')
df = load_dataframe(url=url, partition_steps=d_dict, file_format=file_format)
list_dfs.append(df)
# return the list of dataframes
return list_dfs

Pandas: what are "string function names" called technically?

When using pandas you can in certain cases pass names of functions as strings instead of actual references to those functions. For example: df.transform('round').
In the pandas docs they call these "string function names" but is there another (perhaps more technical) name for these kinds of strings?
Well, Pandas doesn't really want to do this, it's just that in some cases i.e. when using some functions like mean it's required to put the quotes, otherwise errors would be called.
With cases like round quotes wouldn't be actually needed, since they're already builtin functions. The "function names" are really just sort of a way to represent these function names so that they don't get mixed up with other functions.
As mentioned in the documentation link you provided, they call it:
string function name
There is really no special turn IMO.
By passing an invalid string to the aggregate method (ex. df.agg('max2')) and following the Traceback I got to the following code (pandas version 1.1.4):
class SelectionMixin:
"""
mixin implementing the selection & aggregation interface on a group-like
object sub-classes need to define: obj, exclusions
"""
# < some lines deleted >
def _try_aggregate_string_function(self, arg: str, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
- raise
"""
assert isinstance(arg, str)
f = getattr(self, arg, None)
if f is not None:
if callable(f):
return f(*args, **kwargs)
# people may try to aggregate on a non-callable attribute
# but don't let them think they can pass args to it
assert len(args) == 0
assert len([kwarg for kwarg in kwargs if kwarg not in ["axis"]]) == 0
return f
f = getattr(np, arg, None)
if f is not None:
if hasattr(self, "__array__"):
# in particular exclude Window
return f(self, *args, **kwargs)
raise AttributeError(
f"'{arg}' is not a valid function for '{type(self).__name__}' object"
)
It seems that we fall into this code whenever we pass a string function name to aggregate. If we were to look into the familiar pandas objects (Series, DataFrame, GroupBy) we would find that they inherit from SelectionMixin.
The string function names are looked up either in the pandas object itself (getattr(self, arg, None)) or in Numpy (getattr(np, arg, None)). So the string function names simply represent attributes of some object, either methods of a pandas object or functions defined in Numpy.

How do I add a directory of .wav files to the Kedro data catalogue?

This is my first time trying to use the Kedro package.
I have a list of .wav files in an s3 bucket, and I'm keen to know how I can have them available within the Kedro data catalog.
Any thoughts?
I don't believe there's currently a dataset format that handles .wav files. You'll need to build a custom dataset that uses something like Wave - not as much work as it sounds!
This will enable you to do something like this in your catalog:
dataset:
type: my_custom_path.WaveDataSet
filepath: path/to/individual/wav_file.wav # this can be a s3://url
and you can then access your WAV data natively within your Kedro pipeline. You can do this for each .wav file you have.
If you wanted to be able to access a whole folders worth of wav files, you might want to explore the notion of a "wrapper" dataset like the PartitionedDataSet whose usage guide can be found in the documentation.
This worked:
import pandas as pd
from pathlib import Path, PurePosixPath
from kedro.io import AbstractDataSet
class WavFile(AbstractDataSet):
'''Used to load a .wav file'''
def __init__(self, filepath):
self._filepath = PurePosixPath(filepath)
def _load(self) -> pd.DataFrame:
df = pd.DataFrame({'file': [self._filepath],
'data': [load_wav(self._filepath)]})
return df
def _save(self, df: pd.DataFrame) -> None:
df.to_csv(str(self._filepath))
def _exists(self) -> bool:
return Path(self._filepath.as_posix()).exists()
def _describe(self):
return dict(filepath=self._filepath)
class WavFiles(PartitionedDataSet):
'''Replaces the PartitionedDataSet.load() method to return a DataFrame.'''
def load(self)->pd.DataFrame:
'''Returns dataframe'''
dict_of_data = super().load()
df = pd.concat(
[delayed() for delayed in dict_of_data.values()]
)
return df
my_partitioned_dataset = WavFiles(
path="path/to/folder/of/wav/files/",
dataset=WavFile,
)
my_partitioned_dataset.load()

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Concurrently read an HDF5 file in Pandas

I have a data.h5 file organised in multiple chunks, the entire file being several hundred gigabytes. I need to work with a filtered subset of the file in memory, in the form of a Pandas DataFrame.
The goal of the following routine is to distribute the filtering work across several processes, then concatenate the filtered results into the final DataFrame.
Since reading from the file takes a significant amount of time, I'm trying to make each process read its own chunk in a concurrent manner as well.
import multiprocessing as mp, pandas as pd
store = pd.HDFStore('data.h5')
min_dset, max_dset = 0, len(store.keys()) - 1
dset_list = list(range(min_dset, max_dset))
frames = []
def read_and_return_subset(dset):
# each process is intended to read its own chunk in a concurrent manner
chunk = store.select('batch_{:03}'.format(dset))
# and then process the chunk, do the filtering, and return the result
output = chunk[chunk.some_condition == True]
return output
with mp.Pool(processes=32) as pool:
for frame in pool.map(read_and_return_subset, dset_list):
frames.append(frame)
df = pd.concat(frames)
However, the above code triggers this error:
HDF5ExtError Traceback (most recent call last)
<ipython-input-174-867671c5a58f> in <module>()
53
54 with mp.Pool(processes=32) as pool:
---> 55 for frame in pool.map(read_and_return_subset, dset_list):
56 frames.append(frame)
57
/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 554, in H5D__read
can't read data
File "H5Dchunk.c", line 1856, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 340, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1262, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3574, in H5C_protect
can't load entry
File "H5C.c", line 7954, in H5C_load_entry
unable to load entry
File "H5Bcache.c", line 143, in H5B__load
wrong B-tree signature
End of HDF5 error back trace
Problems reading the array data.
It seems that Pandas/pyTables have troubles when trying to access the same file in a concurrent manner, even if it's only for reading.
Is there a way to be able to make each process read its own chunk concurrently?
IIUC you can index those columns that are used for filtering data (chunk.some_condition == True - in your sample code) and then read up only that subset of data that satisfies needed conditions.
In order to be able to do that you need to:
save HDF5 file in table format - use parameter: format='table'
index columns, that will be used for filtering - use parameter: data_columns=['col_name1', 'col_name2', etc.]
After that you should be able to filter your data just by reading:
store = pd.HDFStore(filename)
df = store.select('key_name', where="col1 in [11,13] & col2 == 'AAA'")