How to find elements of series containing a list? - pandas

I can find Series cells matching tuples...
>>> s = pd.Series([(1,2,3),(4,5,6)], index=[1,2])
>>> print s[s==(1,2,3)]
1 (1, 2, 3)
dtype: object
How do I do the same for lists:
>>> s = pd.Series([[1,2,3],[4,5,6]], index=[1,2])
>>> print s[s==[1,2,3]]
ValueError: Arrays were different lengths: 2 vs 3

Easy Approach
s[s.apply(tuple) == (1, 2, 3)]
1 [1, 2, 3]
dtype: object
Less Easy
Assumes all sub-lists are the same length
def contains_list(s, l):
a = np.array(s.values.tolist())
return (a == l).all(1)
s[contains_list(s, [1, 2, 3])]
1 [1, 2, 3]
dtype: object
Timing
Assume a larger series
s = pd.Series([[1,2,3],[4,5,6]] * 1000)
%timeit s[pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)]
100 loops, best of 3: 2.22 ms per loop
%timeit s[contains_list(s, [1, 2, 3])]
1000 loops, best of 3: 1.01 ms per loop
%timeit s[s.apply(tuple) == (1, 2, 3)]
1000 loops, best of 3: 1.07 ms per loop

alternative solution:
In [352]: s[pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)]
Out[352]:
1 [1, 2, 3]
dtype: object
step-by-step:
In [353]: pd.DataFrame(s.values.tolist(), index=s.index)
Out[353]:
0 1 2
1 1 2 3
2 4 5 6
In [354]: pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3])
Out[354]:
0 1 2
1 True True True
2 False False False
In [355]: pd.DataFrame(s.values.tolist(), index=s.index).isin([1,2,3]).all(1)
Out[355]:
1 True
2 False
dtype: bool

Related

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.
This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4
I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

Group Pandas dataframe Age column by Age groups [duplicate]

I have a data frame column with numeric values:
df['percentage'].head()
46.5
44.2
100.0
42.12
I want to see the column as bin counts:
bins = [0, 1, 5, 10, 25, 50, 100]
How can I get the result as bins with their value counts?
[0, 1] bin amount
[1, 5] etc
[5, 10] etc
...
You can use pandas.cut:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts or groupby and aggregate size:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut returns categorical.
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit
#njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x < 1):
bins[idx] = 1
elif (x >= 1) & (x < 5):
bins[idx] = 2
elif (x >= 5) & (x < 10):
bins[idx] = 3
elif (x >= 10) & (x < 25):
bins[idx] = 4
elif (x >= 25) & (x < 50):
bins[idx] = 5
elif (x >= 50) & (x < 100):
bins[idx] = 6
else:
bins[idx] = 7
return bins
cut(df['percentage'].to_numpy())
# array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())
conversion_dict = {1: 'bin1',
2: 'bin2',
3: 'bin3',
4: 'bin4',
5: 'bin5',
6: 'bin6',
7: 'bin7'}
bins = list(map(conversion_dict.get, a))
# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)
dfbig.shape
# (8000000, 1)
%%timeit
cut(dfbig['percentage'].to_numpy())
# 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)
# 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
We could also use np.select:
bins = [0, 1, 5, 10, 25, 50, 100]
df['groups'] = (np.select([df['percentage'].between(i, j, inclusive='right')
for i,j in zip(bins, bins[1:])],
[1, 2, 3, 4, 5, 6]))
Output:
percentage groups
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Convenient and fast version using Numpy
np.digitize is a convenient and fast option:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,3,4,5]})
df['y'] = np.digitize(a['x'], bins=[3,5])
print(df)
returns
x y
0 1 0
1 2 0
2 3 1
3 4 1
4 5 2

Pandas apply to index

I have a df that has some columns and a multi-index with bytes datatype. to clean the columns I can do
for c in df.columns:
df[c] = df[c].apply(lambda x: x.decode('UTF-8'))
and for a single index this should work
df.index.map(lambda x: x.decode('UTF-8'))
But it appears to fail with multi-index. Is there anything similar I can do for the multi-index?
EDIT:
example
pd.DataFrame().from_dict({'val': {(b'A', b'a'): 1,
(b'A', b'b'): 2,
(b'B', b'a'): 3,
(b'B', b'b'): 4,
(b'B', b'c'): 5}})
and the desired output it
pd.DataFrame().from_dict({'val': {('A', 'a'): 1,
('A', 'b'): 2,
('B', 'a'): 3,
('B', 'b'): 4,
('B', 'c'): 5}})
Method 1:
df.index = pd.MultiIndex.from_tuples([(x[0].decode('utf-8'), x[1].decode('utf-8')) for x in df.index])
%timeit result: 1000 loops, best of 3: 573 µs per loop
Method 2:
df.reset_index().set_index('val').applymap(lambda x: x.decode('utf-8')).reset_index().set_index(['level_0', 'level_1'])
%timeit result: 100 loops, best of 3: 4.17 ms per loop
df.index.levels = ([ names.map(lambda x: x.decode('UTF-8')) for i, names in enumerate(df.index.levels)])
OUTPUT:
val
A a 1
b 2
B a 3
b 4
c 5

Assign conditional values to columns in Dask

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.
OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.
They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64

Get elements of a permutation

I have an 1D-array whose elements are a permutation of 0:N, and I need to take the first K elements of this permutation
For example, in the case the permutation is
0 [[9]
1 [0]
2 [1]
3 [2]
4 [3]
5 [4]
6 [5]
7 [6]
8 [7]
9 [8]]
The first 3 elements are 9 , 8 , 7
The code is
n = start
r = zeros (nodeCount, dtype = int)
i = 0
while (self.nodes[n][direction] != stop):
r[i] = n
n = self.nodes[n][direction]
i+=1
I need a faster way to extract the elements from permutation.
This works, but I don't think it is going to be especially fast:
>>> a
array([9, 0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> n = 3
>>> b = np.empty((n,), dtype=a.dtype)
>>> b[0] = a[0]
>>> for k in xrange(1, n):
... b[k] = a[b[k-1]]
...
>>> b
array([9, 8, 7])
Is numpy.roll what you're after?
>>> a = np.arange(10)
>>> b = np.roll(a,1)
>>> b
array([9, 0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> np.roll(b[::-1],1)[:3]
array([9, 8, 7])
That last line of code is pretty cryptic, but b[::-1] reverses the array, np.roll shifts it, and the [:3] only takes the first three elements.