Crosstab using multi-element calculation - pandas

I would like to create a crosstab from a dataframe df, comparing each record of df to each other, i.e. pairwise, and calculate one number from several elements of the rows of df. As an example, let's take the following dataframe and calculate the (squared) distance between the points:
import pandas as pd
df = pd.DataFrame({"Point": ["A", "B", "C"], "x": [10, 20, 30], "y": [1, 2, 3]})
df["XX"] = 1
result = (
df.merge(df, on="XX")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y")
)
yielding the desired result:
distance
Point_y A B C
Point_x
A 0 5 20
B 5 0 5
C 20 5 0
Is there a better way to do this without resorting to adding a dummy field XX and merging on it? I tried multiple variations of
df = df.drop("XX", axis=1)
result = pd.crosstab(index=df["Point"], columns=df["Point"])
with values= and aggfunc= parameters, but to no avail. Possibly there is also an easier way using numpy?

"cross" merge
Assuming 1.2.0+, you can avoid the dummy XX column by merging with how="cross":
cross: creates the cartesian product from both frames, preserves the order of the left keys (new in version 1.2.0)
(df.merge(df, how="cross")
.assign(distance=lambda d: (d["x_x"] - d["x_y"]) ** 2 + (d["y_x"] - d["y_y"]) ** 2)
.loc[:, ["Point_x", "Point_y", "distance"]]
.pivot(index="Point_x", columns="Point_y"))
# distance
# Point_y A B C
# Point_x
# A 0 101 404
# B 101 0 101
# C 404 101 0
numpy broadcasting
You can do the pairwise calculations in numpy by using singleton dimensions (None or np.newaxis):
x = (df.x.values[:, None] - df.x.values) ** 2
y = (df.y.values[:, None] - df.y.values) ** 2
pd.DataFrame(x + y, index=df.Point, columns=df.Point)
# Point A B C
# Point
# A 0 101 404
# B 101 0 101
# C 404 101 0
scipy squareform
If you compute a vector of pairwise values (e.g., result of pdist), you can use squareform to crosstab the vector:
from scipy.spatial.distance import squareform, pdist
pd.DataFrame(squareform(pdist(df[["x", "y"]]) ** 2), columns=df.Point, index=df.Point)
# Point A B C
# Point
# A 0.0 101.0 404.0
# B 101.0 0.0 101.0
# C 404.0 101.0 0.0

As another option, using euclidean_distances from sklearn:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(
df[['x', 'y']],
df[['x', 'y']], squared=True)
Output:
array([[ 0., 101., 404.],
[101., 0., 101.],
[404., 101., 0.]])

Related

Pandas aggregate to a list of dicts [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Filtering the Pandas data frame according to a condition of the timestamp [duplicate]

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).
Here's what I've tried:
df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)
But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.
You can use len:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
A B
0 1 2
1 1 4
The number of rows is in the attribute .shape[0]:
df.groupby('A').filter(lambda x: x.shape[0] >= min_size)
NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).
groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:
Here's an example, first create the dataset:
import pandas as pd
import numpy as np
df = pd.concat([
pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i))
for i in range(1,1000)
]).reset_index(drop=True)
print(df)
y A
0 1.375980 1
1 -0.023861 1
2 -0.474707 1
3 -0.151859 2
4 -1.696823 2
... ... ...
2424 0.276737 998
2425 -0.142171 999
2426 -0.718891 999
2427 -0.621315 999
2428 1.335450 999
[2429 rows x 2 columns]
Time it:

How to transform columns with method chaining?

What's the most fluent (or easy to read) method chaining solution for transforming columns in Pandas?
(“method chaining” or “fluent” is the coding style made popular by Tom Augspurger among others.)
For the sake of the example, let's set up some example data:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("iris").astype(str) # Just for this example
df.loc[1, :] = "NA"
df.head()
#
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 NA NA NA NA NA
# 2 4.7 3.2 1.3 0.2 setosa
# 3 4.6 3.1 1.5 0.2 setosa
# 4 5.0 3.6 1.4 0.2 setosa
Just for this example: I want to map certain columns through a function - sepal_length using pd.to_numeric - while keeping the other columns as they were. What's the easiest way to do that in a method chaining style?
I can already use assign, but I'm repeating the column name here, which I don't want.
new_result = (
df.assign(sepal_length = lambda df_: pd.to_numeric(df_.sepal_length, errors="coerce"))
.head() # Further chaining methods, what it may be
)
I can use transform, but transform drops(!) the unmentioned columns. Transform with passthrough for the other columns would be ideal:
# Columns not mentioned in transform are lost
new_result = (
df.transform({'sepal_length': lambda series: pd.to_numeric(series, errors="coerce")})
.head() # Further chaining methods...
)
Is there a “best” way to apply transformations to certain columns, in a fluent style, and pass the other columns along?
Edit: Below this line, a suggestion after reading Laurent's ideas.
Add a helper function that allows applying a mapping to just one column:
import functools
coerce_numeric = functools.partial(pd.to_numeric, errors='coerce')
def on_column(column, mapping):
"""
Adaptor that takes a column transformation and returns a "whole dataframe" function suitable for .pipe()
Notice that columns take the name of the returned series, if applicable
Columns mapped to None are removed from the result.
"""
def on_column_(df):
df = df.copy(deep=False)
res = mapping(df[column])
# drop column if mapped to None
if res is None:
df.pop(column)
return df
df[column] = res
# update column name if mapper changes its name
if hasattr(res, 'name') and res.name != col:
df = df.rename(columns={column: res.name})
return df
return on_column_
This now allows the following neat chaining in the previous example:
new_result = (
df.pipe(on_column('sepal_length', coerce_numeric))
.head() # Further chaining methods...
)
However, I'm still open to ways how to do this just in native pandas without the glue code.
Edit 2 to further adapt Laurent's ideas, as an alternative. Self-contained example:
import pandas as pd
df = pd.DataFrame(
{"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)
def map_columns(mapping=None, /, **kwargs):
"""
Transform the specified columns and let the rest pass through.
Examples:
df.pipe(map_columns(a=lambda x: x + 1, b=str.upper))
# dict for non-string column names
df.pipe({(0, 0): np.sqrt, (0, 1): np.log10})
"""
if mapping is not None and kwargs:
raise ValueError("Only one of a dict and kwargs can be used at the same time")
mapping = mapping or kwargs
def map_columns_(df: pd.DataFrame) -> pd.DataFrame:
mapping_funcs = {**{k: lambda x: x for k in df.columns}, **mapping}
# preserve original order of columns
return df.transform({key: mapping_funcs[key] for key in df.columns})
return map_columns_
df2 = (
df
.pipe(map_columns(col2=pd.to_numeric))
.sort_values(by="col1")
.pipe(map_columns(col1=lambda x: x.astype(str) + "0"))
.pipe(map_columns({'col2': lambda x: -x, 'col3': str.upper}))
.reset_index(drop=True)
)
df2
# col1 col2 col3
# 0 10 -7 Z
# 1 20 -5 Y
# 2 30 -6 X
# 3 40 -9 W
Here is my take on your interesting question.
I don't know of a more idiomatic way in Pandas to do method chaining than combining pipe, assign, or transform. But I understand that "transform with passthrough for the other columns would be ideal".
So, I suggest using it with a higher-order function to deal with other columns, doing even more functional-like coding by taking advantage of Python standard library functools module.
For example, with the following toy dataframe:
df = pd.DataFrame(
{"col1": ["4", "1", "3", "2"], "col2": [9, 7, 6, 5], "col3": ["w", "z", "x", "y"]}
)
You can define the following partial object:
from functools import partial
from typing import Any, Callable
import pandas as pd
def helper(df: pd.DataFrame, col: str, method: Callable[..., Any]) -> pd.DataFrame:
funcs = {col: method} | {k: lambda x: x for k in df.columns if k != col}
# preserve original order of columns
return {key: funcs[key] for key in df.columns}
on = partial(helper, df)
And then do all sorts of chain assignments using transform, for instance:
df = (
df
.transform(on("col1", pd.to_numeric))
.sort_values(by="col1")
.transform(on("col2", lambda x: x.astype(str) + "0"))
.transform(on("col3", str.upper))
.reset_index(drop=True)
)
print(df)
# Ouput
col1 col2 col3
0 1 70 Z
1 2 50 Y
2 3 60 X
3 4 90 W
If I understand the question correctly, perhaps using ** within assign will be helpful. For example, if you just wanted to convert the numeric data types using pd.to_numeric the following should work.
df.assign(**df.select_dtypes(include=np.number).apply(pd.to_numeric,errors='coerce'))
By unpacking the df, you are essentially giving assign what it needs to assign each column. This would be equivalent to writing sepal_length = pd.to_numeric(df['sepal_length'],errors='coerce'), sepal_width = ... for each column.

Rolling Second highest in a pandas dataframe

I am trying to find the top and second highest value
I can get the highest using
df['B'] = df['a'].rolling(window=3).max()
But how do I get the second highest please?
Such that df['C'] will display as per below
A B C
1
6
5 6 5
4 6 5
12 12 5
Generic n-highest values in rolling/sliding windows
Here's one using np.lib.stride_tricks.as_strided to create sliding windows that lets us choose any generic N highest value in sliding windows -
# https://stackoverflow.com/a/40085052/ #Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# Return N highest nums in rolling windows of length W off array ar
def N_highest(ar, W, N=1):
# ar : Input array
# W : Window length
# N : Get us the N-highest in sliding windows
A2D = strided_app(ar,W,1)
idx = (np.argpartition(A2D, -N, axis=1) == A2D.shape[1]-N).argmax(1)
return A2D[np.arange(len(idx)), idx]
Sample runs -
In [634]: a = np.array([1,6,5,4,12]) # input array
In [635]: N_highest(a, W=3, N=1) # highest in W=3
Out[635]: array([ 6, 6, 12])
In [636]: N_highest(a, W=3, N=2) # second highest
Out[636]: array([5, 5, 5])
In [637]: N_highest(a, W=3, N=3) # third highest
Out[637]: array([1, 4, 4])
Another shorter way based on strides, would be with direct sorting, like so -
np.sort(strided_app(ar,W,1), axis=1)[:,-N]]
Solving our case
Hence, to solve our case, we need to concatenate with NaNs alongwith the result from the above mentioned function, like so -
W = 3
df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
Based on direct sorting, we would have -
df['C'] = np.r_[ [np.nan]*(W-1), np.sort(strided_app(df.A,W,1), axis=1)[:,-2]]
Sample run -
In [578]: df
Out[578]:
A
0 1
1 6
2 5
3 4
4 3 # <== Different from given sample, for variety
In [619]: W = 3
In [620]: df['C'] = np.r_[ [np.nan]*(W-1), N_highest(df.A.values, W=W, N=2)]
In [621]: df
Out[621]:
A C
0 1 NaN
1 6 NaN
2 5 5.0
3 4 5.0
4 3 4.0 # <== Second highest from the last group off : [5,4,3]

pandas faster series of lists unrolling for one-hot encoding?

I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )