Combine/Append two Dictionaries of Dataframes with same keys - dataframe

I have two Dictionaries of Dataframes dict_of_balance_winter{} and dict_of_balance_summer{}.
Both have 365 Days of data, So let's say I want to access the data of 1st jan 2023, I'll write dict_winter['20230101'] and it will give me winter crop data for 1st january, similarly, dict_summer['20230101] will display the dataframe for summer crop data, Both dictionaries have same keys, so to append the dataframes I cannot use the Update() method of the dictionaries because the Values get overwritten from the same key.
Sample of how dict_of_balance_winter/summer look is
I used the below code to append the dictionaries using
from itertools import chain
from collections import defaultdict
dall = {}
dall = defaultdict(list)
for k,v in chain(dict_of_balance_summer.items(),dict_of_balance_winter.items()):
dall[k].append(v)
type(dall)
def ddict2dict(d):
for k, v in d.items():
if isinstance(v, dict):
d[k] = ddict2dict(v)
return dict(d)
mydict = ddict2dict(dall)
This returns me a dictionary, But I am not able to access the dataframe at each key value.
I tried printing the Dictionary which I get after converting defaultdict to dict is as below, Seems like appending using defaultdict and converting back to dict gives me a correct key but value in the form of a list, How can I can concatenate the two dictionaries and still be able to access the dataframes using the keys in resulting dictionary, Please Help!!
The Dictionary after appending and conversion from defaultdict to dict\
I tried to append two dictionary of dataframes with same keys by using chain and defaultdict and then convert the defaultdict back to a regular dictionary.
This didn't preserve the resulting dictionary as dictionary of dataframes, The keys were correctly retrieved but the dataframes were not preserved.

Related

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Sparse columns in pandas: directly access the indices of non-null values

I have a large dataframe (approx. 10^8 rows) with some sparse columns. I would like to be able to quickly access the non-null values in a given column, i.e. the values that are actually saved in the array. I figured that this could be achieved by df.<column name>[<indices of non-null values>]. However, I can't see how to access <indices of non-null values> directly, i.e. without any computation. When I try df.<column name>.index it tells me that it's a RangeIndex, which doesn't help. I can even see <indices of non-null values> when I run df.<column name>.values, but looking through dir(df.<column name>.values) I still cant't see a way to access them.
To make clear what I mean, here is a toy example:
In this example <indices of non-null values> is [0,1,3].
EDIT: The answer below by #Piotr Żak is a viable solution, but it requires computation. Is there a way to access <indices of non-null values> directly via an attribute of the column or array?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1], [np.nan], [4], [np.nan], [9]]),
columns=['a'])
just filter without nan:
filtered_df = df[df['a'].notnull()]
transform column from df to array:
s_array = filtered_df[["a"]].to_numpy()
or - transform indexes from df to array:
filtered_df.index.tolist()

how to compress lists/nested lists in hdf5

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files.
I managed to try out a small list, since I do sometimes work with lists that have strings as follows;
def write():
test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
f.close()
However I got this error:
f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)
After searching for hours over the net on any better ways to do this, I couldn't get.
Is there a better way to compress lists with H5?
This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.
h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings.
You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'
import h5py
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
with h5py.File('example_nested.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip')
PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.
import tables as tb
import numpy as np
test_list = [['a01','a02','a03','a04','a05','a06'],
['a11','a12','a13','a14','a15','a16','a17'],
['a21','a22','a23','a24','a25','a26','a27','a28']]
slen = max(len(item) for sublist in test_list for item in sublist)
with tb.File('example_nested_tb.h5', 'w') as h5f:
vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) )
for slist in test_list:
arr = np.array(slist,dtype='S'+str(slen))
vlarray.append(arr)
print('-->', vlarray.name)
for row in vlarray:
print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)
You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5
I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.
I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)
Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.
import h5py
import numpy as np
test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2',
'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')
data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
with h5py.File('example_file.h5', 'w') as f:
f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9)
f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1)

How to access generator object from executor.map?

I have function that converts non numerical data in a dataframe to numerical.
import numpy as np
import pandas as pd
from concurrent import futures
def convert_to_num(df):
do stuff
return df
I am wanting to use the futures library to speed up this task. This is how I am using the library:
with futures.ThreadPoolExecutor() as executor:
df_test = executor.map(convert_to_num,df_sample)
First I do not see the variable df_test being created and second when I run df_test in I get this message:
<generator object Executor.map.<locals>.result_iterator at >
What am I doing wrong to not be able to use the futures library? Can I only use this library to iterate values into a function versus passing a entire dataframe to be edited?
The map method for the executor object, as per the documentation, takes the following arguments,
map(func, *iterables, timeout=None, chunksize=1)
From your example you only provide a single df (the df_sample) but you could provide a list of df_samples which are unpacked in as the iterables parameter.
For example,
Let us create a list of dataframes,
import concurrent.futures
import pandas as pd
df_samples = [pd.DataFrame({f"col{j}{i}": [j,i] for i in range(1,5)}) for j in range(1,5)]
Which would look like, df_samples
And now we add a function which will add an additional column to a df,
def add_x_column(df):
df['col_x'] = ['a', 'b']
return df
and now use the ThreadPoolExecutor to apply this function to the df_samples list in a concurrent manner. You would also need to make convert the generator object to a list to access the changed df's
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(add_x_column, df_samples))
Where the results would be the list of the resultant df's
Where the results would look like, df_results

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series,
analogously to constructing a pandas DataFrame from lists
pd.DataFrame({'l1': list1, 'l2': list2})
I am not seeing anything in the API. The dask DataFrame constructor is not supposed to be called by users directly and takes a computation graph as it's mainargument.
In general I agree that it would be nice for the dd.DataFrame constructor to behave like the pd.DataFrame constructor.
If your series have well defined divisions then you might try dask.dataframe.concat with axis=1.
You could also try converting one of the series into a DataFrame and then use assignment syntax:
L = # list of series
df = L[0].to_frame()
for s in L[1:]:
df[s.name] = s