Boolean Indexing in Numpy involving two arrays - numpy

I was reading a book on Data Analysis with Python where there's a topic on Boolean Indexing.
This is the Code given in the Book:
>>> import numpy as np
>>> names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
>>> data = np.random.randn(7,4)
>>> names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
>>> data
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[-0.54500574, -0.21700484, 0.34375588, -0.99216205],
[ 0.29883509, -3.08641931, 0.61289669, 0.58233649],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794],
[ 0.35764077, -0.51405297, -0.21406197, -0.88982479],
[-0.59219242, -1.87402141, -2.66339726, 1.30208623],
[ 0.32612407, 0.19612659, -0.63334406, 1.0275622 ]])
>>> names == 'Bob'
array([ True, False, False, True, False, False, False])
Until this it's perfectly clear. But I'm unable to understand when they do data[names == 'Bob']
>>> data[names == 'Bob']
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794]])
>>> data[names == 'Bob', 2:]
array([[-1.18156785, -0.75981437],
[-2.29797299, 0.04553794]])
How is this happening?

data[names == 'Bob']
is the same as:
data[[True, False, False, True, False, False, False]]
And this just means to get row 0 and row 4 from data.
data[names == 'Bob',2:]
gives the same rows, but now restricts the columns to start with column 2. Before the comma concerns the rows, after the comma concerns the columns.

Related

How to combine two Pandas dataframes into a single one across the axis=2 (ie. so that the cell values are tuples)?

I have two (large) dataframes. They have the same index & columns, and I want to combine them so that they have tuple values in each cell.
The example explains it best:
pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
# Desired output:
pd.DataFrame({
'A':[(True, 1), (True, 2), (False, 3)],
'B':[(False, 5), (True, 6), (False, 7)],
})
The DataFrames are large (1m rows+), so looking to do this somewhat efficiently.
I tried np.stack([df1.values, df2.values], axis=2) and that got me the right value array, but I could not convert it into a dataframe.
Any ideas?
I got your desired output with this solution
import pandas as pd
df1 = pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
for df_1k, df_2k in zip(df1.columns, df2.columns):
df1[df_1k] = list(map(tuple, zip(df1[df_1k], df2[df_2k])))
print(df1)

Set index for aggregated dataframe

I did some calculation to a list of dataframes. I'd like the result dataframe uses rangeindex. However, it uses one of the column name as index, even I set index=None
d1 = {'id': [1, 2, 3, 4, 5], 'is_free': [True, False, False, True, True], 'level': ['Top', 'Mid', 'Top', 'Top', 'Low']}
d2 = {'id': [1, 3, 4, 5, 7], 'is_free': [True, True, False, False, False], 'level': ['Top', 'High', 'Top', 'Top', 'Low']}
d1 = pd.DataFrame(data=d1)
d2 = pd.DataFrame(data=d2)
df_list = [d1, d2]
dfs = []
for i, df in enumerate(df_list):
df = df.groupby('is_free')['id'].count()
dfs.append(df)
df = pd.DataFrame(data=dfs, index=None)
It returns
is_free False True
id 2 3
id 3 2
df.index returns
Index(['id', 'id'], dtype='object')
From your code:
df = pd.DataFrame(data=dfs, index=None).reset_index(drop=True)
However, in general, I would avoid append iteratively. Try concat:
pd.concat({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)},
axis=1).T
Or use pd.DataFrame:
pd.DataFrame({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)}).T
Output:
is_free False True
0 2 3
1 3 2

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Do operation based on a current event index, and the index of the next event of another column [pandas]

Ok so this one is very hard to describe. So i will just put together an example to explain.
pd.DataFrame({'event_a': [False, True, False, False, False, True, False, False, False, True, False],
'event_b': [False, False, False, True, False, False, False, False, True, False, False],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
Here we have two events and a value columns. The events will always alternate (there will never be event a and event b in the same index, and there will never be two of the same events in a row without the other event in between)
my specific operation i want to perform is abs(next_value / current_value - 1)
Given this, my output for this example should look like...
output = [na, 1, na, 0.5, na, 0.5, na, na, 0.111, na, na]
Row 2 for example is abs(4 (value of next event) / 2 (current event) - 1) = 1
try doing:
cond = df.loc[:, ['event_a', 'event_b']].any(axis=1)
output = np.ones(cond.size) * np.nan
output[cond] = (t.loc[cond, 'value'].shift(-1) / t.loc[cond, 'value']).subtract(1).abs()

numpy.genfromtxt cannot read boolean data correctly

No matter what the input value is, the np.genfromtxt will always return False.
Using dtype='u1' I get '1' as expected. But with dtype='b1' (Numpy's bool) I get 'False'.
I don't know if this is a bug or not, but so far, I've been able to get dtype=bool to work (without an explicit converter) only if the file contains the literal strings 'False' and 'True':
In [21]: bool_lines = ['False,False', 'False,True', 'True,False', 'True,True']
In [22]: genfromtxt(bool_lines, delimiter=',', dtype=bool)
Out[22]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)
If your data is 0s and 1s, you can read it as integers and then convert to bool:
In [26]: bits = ['0,0', '0,1', '1,0', '1,1']
In [27]: genfromtxt(bits, delimiter=',', dtype=np.uint8).astype(bool)
Out[27]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)
Or you can use a converter for each column
In [28]: cnv = lambda s: bool(int(s))
In [29]: converters = {0: cnv, 1: cnv}
In [30]: genfromtxt(bits, delimiter=',', dtype=bool, converters=converters)
Out[30]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)