Pandas to dict conversion with condition - pandas

My dataframe:
data_part = [{'Part': 'A', 'Engine': True, 'TurboCharger': True, 'Restricted': True},
{'Part': 'B', 'Engine': False, 'TurboCharger': True, 'Restricted': False},]
My expect output is this:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}
This is what I am doing:
df_part = pd.DataFrame(data_part).set_index('Part').astype(int).to_dict('index')
This is what it gives:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'Engine': 0, 'TurboCharger': 1, 'Restricted': 0}}
Anything that can be done to reach expected output?

We can fix your output
d=pd.DataFrame(data_part).set_index('Part').astype(int).stack().loc[lambda x : x!=0].reset_index('Part').groupby('Part').agg(dict)[0].to_dict()
Out[192]:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}

You may call agg before to_dict
df_part = (pd.DataFrame(data_part).set_index('Part')
.agg(lambda x: dict(x[x].astype(int)), axis=1)
.to_dict())
Out[60]:
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}

Here's a way to convert the list to a dict without pandas:
from pprint import pprint
data_2 = dict()
for dp in data_part:
ts = [(k, v) for k, v in dp.items()]
key = ts[0][1]
values = {k: int(v) for k, v in ts[1:] if v}
data_2[key] = values
pprint(data_2)
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}

Related

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Pandas dataframe to json with key

I have a dataframe with columns ['a', 'b', 'c' ]
and would like to export in dictionnary as follow :
{ 'value of a' : { 'b': 3, 'c': 7},
'value2 of a' : { 'b': 7, 'c': 9}
}
I believe you need set_index with DataFrame.to_dict:
df = pd.DataFrame({'a':list('ABC'),
'b':[4,5,4],
'c':[7,8,9]})
print (df)
a b c
0 A 4 7
1 B 5 8
2 C 4 9
d = df.set_index('a').to_dict('index')
print (d)
{'A': {'b': 4, 'c': 7}, 'B': {'b': 5, 'c': 8}, 'C': {'b': 4, 'c': 9}}
And for json use DataFrame.to_json:
j = df.set_index('a').to_json(orient='index')
print (j)
{"A":{"b":4,"c":7},"B":{"b":5,"c":8},"C":{"b":4,"c":9}}

Use an ufunc analogous to numpy.where

For example, if I want to add conditionally, I can use:
y = numpy.where(condition, a+b, b)
Is there a way to directly combine an ufunc and where? Something like:
y = numpy.add.where(condition, a, b)
Something along that line is add.at.
In [21]: b = np.arange(10)
In [22]: cond = b%3==0
Your where:
In [24]: np.where(cond, 10+b, b)
Out[24]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
Use the other where (or np.nonzeros) to turn the boolean mask into index tuple
In [25]: cond
Out[25]: array([ True, False, False, True, False, False, True, False, False, True], dtype=bool)
In [26]: idx = np.where(cond)
In [27]: idx
Out[27]: (array([0, 3, 6, 9], dtype=int32),)
add.at does inplace, unbuffered addition:
In [28]: np.add.at(b,idx[0],10)
In [29]: b
Out[29]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
add.at is intended as a way of getting around buffering problems with the more direct index +=:
In [30]: b = np.arange(10)
In [31]: b[idx[0]] += 10
In [32]: b
Out[32]: array([10, 1, 2, 13, 4, 5, 16, 7, 8, 19])
Here the action is the same (add.at is slower). But if there were duplicates in idx the results will be different.
+= also works with the boolean mask:
In [33]: b[cond] -= 10
In [34]: b
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
There's got to be a ufunc equivalent to the += operator, but I don't use ufunc enough to know off hand.

Dask bag from multiple files into Dask dataframe with columns

I am given a list of filenames files which contain comma-delimited data which has to be cleaned as well as further extended by columns containing information based on the filenames. Thus, I implemented a small read_file function, which handles both, the initial cleaning, as well as the computation of additional columns. Using db.from_sequence(files).map(read_file), I am mapping the read function to all of the files, getting a list of dictionaries each.
However, rather than a list of dictionaries, I want my bag to contain each individual line of the input files as an entry. Subsequently, I want to map the keys of the dictionaries to column names in a dask dataframe.
from dask import bag as db
def read_file(filename):
ret = []
with open(filename, 'r') as fp:
... # reading line of file and storing result in dict
ret.append({'a': val_a, 'b': val_b, 'c': val_c})
return ret
from dask import bag as db
files = ['a.txt', 'b.txt', 'c.txt']
my_bag = db.from_sequence(files).map(read_file)
# a,b,c are the keys of the dictionaries returned by read_file
my_df = my_bag.to_dataframe(columns=['a', 'b', 'c'])
Could someone let me know what I have to change to get this code running? Are there different approaches that would be more suitable?
Edit:
I have created three test files a_20160101.txt, a_20160102.txt, a_20160103.txt. All of them contain just a few lines with a single string each.
asdf
sadfsadf
sadf
fsadff
asdf
sadfasd
fa
sf
ads
f
Previously I had a small error in read_file, but now, calling my_bag.take(10) after mapping to the reader works fine:
([{'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'asdf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadfsadf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'fsadff', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'asdf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadfasd', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'fa', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'ads', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'f', 'c': 'XY'}],)
However my_df = my_bag.to_dataframe(columns=['a', 'b', 'c']) and subsequently
my_df.head(10) still raises dask.async.AssertionError: 3 columns passed, passed data had 10 columns
You probably need to call flatten
Your bag of filenames looks like this:
['a.txt',
'b.txt',
'c.txt']
After you call map your bag looks like this:
[[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}],
[{'a': 1, 'b': 2, 'c': 3}],
[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}]]
Each file was turned into a list of dicts. Now your bag is kind of like a list-of-lists-of-dicts.
The .to_dataframe method wants you to have a list-of-dicts. So lets concatenate our bag to be a single flattened collection of dicts
my_bag = db.from_sequence(files).map(read_file).flatten()
[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30},
{'a': 1, 'b': 2, 'c': 3},
{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}]