I was reading a book on Data Analysis with Python where there's a topic on Boolean Indexing.
This is the Code given in the Book:
>>> import numpy as np
>>> names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
>>> data = np.random.randn(7,4)
>>> names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
>>> data
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[-0.54500574, -0.21700484, 0.34375588, -0.99216205],
[ 0.29883509, -3.08641931, 0.61289669, 0.58233649],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794],
[ 0.35764077, -0.51405297, -0.21406197, -0.88982479],
[-0.59219242, -1.87402141, -2.66339726, 1.30208623],
[ 0.32612407, 0.19612659, -0.63334406, 1.0275622 ]])
>>> names == 'Bob'
array([ True, False, False, True, False, False, False])
Until this it's perfectly clear. But I'm unable to understand when they do data[names == 'Bob']
>>> data[names == 'Bob']
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794]])
>>> data[names == 'Bob', 2:]
array([[-1.18156785, -0.75981437],
[-2.29797299, 0.04553794]])
How is this happening?
data[names == 'Bob']
is the same as:
data[[True, False, False, True, False, False, False]]
And this just means to get row 0 and row 4 from data.
data[names == 'Bob',2:]
gives the same rows, but now restricts the columns to start with column 2. Before the comma concerns the rows, after the comma concerns the columns.
Related
I have two (large) dataframes. They have the same index & columns, and I want to combine them so that they have tuple values in each cell.
The example explains it best:
pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
# Desired output:
pd.DataFrame({
'A':[(True, 1), (True, 2), (False, 3)],
'B':[(False, 5), (True, 6), (False, 7)],
})
The DataFrames are large (1m rows+), so looking to do this somewhat efficiently.
I tried np.stack([df1.values, df2.values], axis=2) and that got me the right value array, but I could not convert it into a dataframe.
Any ideas?
I got your desired output with this solution
import pandas as pd
df1 = pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
for df_1k, df_2k in zip(df1.columns, df2.columns):
df1[df_1k] = list(map(tuple, zip(df1[df_1k], df2[df_2k])))
print(df1)
I did some calculation to a list of dataframes. I'd like the result dataframe uses rangeindex. However, it uses one of the column name as index, even I set index=None
d1 = {'id': [1, 2, 3, 4, 5], 'is_free': [True, False, False, True, True], 'level': ['Top', 'Mid', 'Top', 'Top', 'Low']}
d2 = {'id': [1, 3, 4, 5, 7], 'is_free': [True, True, False, False, False], 'level': ['Top', 'High', 'Top', 'Top', 'Low']}
d1 = pd.DataFrame(data=d1)
d2 = pd.DataFrame(data=d2)
df_list = [d1, d2]
dfs = []
for i, df in enumerate(df_list):
df = df.groupby('is_free')['id'].count()
dfs.append(df)
df = pd.DataFrame(data=dfs, index=None)
It returns
is_free False True
id 2 3
id 3 2
df.index returns
Index(['id', 'id'], dtype='object')
From your code:
df = pd.DataFrame(data=dfs, index=None).reset_index(drop=True)
However, in general, I would avoid append iteratively. Try concat:
pd.concat({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)},
axis=1).T
Or use pd.DataFrame:
pd.DataFrame({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)}).T
Output:
is_free False True
0 2 3
1 3 2
Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.
Ok so this one is very hard to describe. So i will just put together an example to explain.
pd.DataFrame({'event_a': [False, True, False, False, False, True, False, False, False, True, False],
'event_b': [False, False, False, True, False, False, False, False, True, False, False],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
Here we have two events and a value columns. The events will always alternate (there will never be event a and event b in the same index, and there will never be two of the same events in a row without the other event in between)
my specific operation i want to perform is abs(next_value / current_value - 1)
Given this, my output for this example should look like...
output = [na, 1, na, 0.5, na, 0.5, na, na, 0.111, na, na]
Row 2 for example is abs(4 (value of next event) / 2 (current event) - 1) = 1
try doing:
cond = df.loc[:, ['event_a', 'event_b']].any(axis=1)
output = np.ones(cond.size) * np.nan
output[cond] = (t.loc[cond, 'value'].shift(-1) / t.loc[cond, 'value']).subtract(1).abs()
No matter what the input value is, the np.genfromtxt will always return False.
Using dtype='u1' I get '1' as expected. But with dtype='b1' (Numpy's bool) I get 'False'.
I don't know if this is a bug or not, but so far, I've been able to get dtype=bool to work (without an explicit converter) only if the file contains the literal strings 'False' and 'True':
In [21]: bool_lines = ['False,False', 'False,True', 'True,False', 'True,True']
In [22]: genfromtxt(bool_lines, delimiter=',', dtype=bool)
Out[22]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)
If your data is 0s and 1s, you can read it as integers and then convert to bool:
In [26]: bits = ['0,0', '0,1', '1,0', '1,1']
In [27]: genfromtxt(bits, delimiter=',', dtype=np.uint8).astype(bool)
Out[27]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)
Or you can use a converter for each column
In [28]: cnv = lambda s: bool(int(s))
In [29]: converters = {0: cnv, 1: cnv}
In [30]: genfromtxt(bits, delimiter=',', dtype=bool, converters=converters)
Out[30]:
array([[False, False],
[False, True],
[ True, False],
[ True, True]], dtype=bool)