groupby on sparse matrix in pandas: filling them first - pandas

I have a pandas DataFrame df with shape (1000000,3) as follows:
id cat team
1 'cat1' A
1 'cat2' A
2 'cat3' B
3 'cat1' A
4 'cat3' B
4 'cat1' B
Then I dummify with respect to the cat column in order to get ready for a machine learning classification.
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True)
But when I try to do:
df2.groupby(['id','team']).sum()
It get stuck and the computing never ends. So instead of grouping by right away, I try:
df2 = df2.fillna(0)
But it does not work and the DataFrame is still full of NaN values. Why does the fillna() function does not fill my DataFrame as it should?
In other words, how can a pandas sparse matrix I got from get_dummies be filled with 0 instead of NaN?
I also tried:
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True).to_sparse(fill_value=0)
This time, df2 is well filled with 0, but when I try:
print df2.groupby(['id','sexe']).sum()
I get:
C:\Anaconda\lib\site-packages\pandas\core\groupby.pyc in loop(labels, shape)
3545 for i in range(1, nlev):
3546 stride //= shape[i]
-> 3547 out += labels[i] * stride
3548
3549 if xnull: # exclude nulls
ValueError: operands could not be broadcast together with shapes (1205800,) (306994,) (1205800,)
My solution was to do:
df2 = pandas.DataFrame(np.nan_to_num(df2.as_matrix()))
df2.groupby(['id','sexe']).sum()
And it works, but it takes a lot of memory. Can someone help me to find a better solution or at least understand why I can't fill sparse matrix with zeros easily? And why it is impossible to use groupby() then sum() on a sparse matrix?

I think your problem is due to mixing of dtypes. But you could get around it like this. First, provide only the relevant column to get_dummies() rather than the whole dataframe:
df2 = pd.get_dummies(df['cat']).to_sparse(0)
After that, you can add other variables back but everything needs to be numeric. A pandas sparse dataframe is just a wrapper on a sparse (and homogenous dtype) numpy array.
df2['id'] = df['id']
'cat1' 'cat2' 'cat3' id
0 1 0 0 1
1 0 1 0 1
2 0 0 1 2
3 1 0 0 3
4 0 0 1 4
5 1 0 0 4
For non-numeric types, you could do the following:
df2['team'] = df['team'].astype('category').cat.codes
This groupby seems to work OK:
df2.groupby('id').sum()
'cat1' 'cat2' 'cat3'
id
1 1 1 0
2 0 0 1
3 1 0 0
4 1 0 1
An additional but possibly important point for memory management is that you can often save substantial memory with categoricals rather than string objects (perhaps you are already doing this though):
df['cat2'] = df['cat'].astype('category')
df[['cat','cat2']].memory_usage()
cat 48
cat2 30
Not much savings here for the small example dataframe but could be a substantial difference in your actual dataframe.

I was tackling a similar problem before. What I did was, I applied the groupby operation before and followed it up with get_dummies().
This worked for me as groupby, after formation of thousands of dummified columns (in my case), is very slow especially on sparse dataframes. It basically gave up for me. Grouping over the columns first and then dummifying made it work.
df = pd.DataFrame(df.groupby(['id','team'])['cat'].unique())
df.columns = ['cat']
df.reset_index(inplace=True)
df = df[['id','team']].join(df['cat'].str.join('|').str.get_dummies().add_prefix('CAT_'))
Hope this helps out someone!

Related

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

filter on pandas array

I'm doing this kind of code to find if a value belongs to the array a inside a dataframe:
Solution 1
df = pd.DataFrame([{'a':[1,2,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
Problem
This can go worst if there are repetitions:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
0 1 4
Solution 2
Another solution could go like:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df[df['a'].map(lambda row: 1 in row)]
Problem
That Lambda can't go fast if the Dataframe is Big.
Question
As a first goal, I want all the lines where the value 1 belongs to a:
without using Python, since it is slow
with high performance
avoiding memory issues
...
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
IIUC, you are trying to do:
df[df['a'].eq(1).groupby(level=0).transform('any')
Output:
a b
0 1 4
0 2 4
0 3 4
Nothing is wrong. This is normal behavior of pandas.explode().
To check whether a value belongs to values in a you may use this:
if x in df.a.explode()
where x is what you test for.
I think you can convert arrays to scalars with DataFrame constructor and then test value with DataFrame.eq and DataFrame.any:
df = df[pd.DataFrame(df['a'].tolist()).eq(1).any(axis=1)]
print (df)
a b
0 [1, 2, 1, 3] 4
Details:
print (pd.DataFrame(df['a'].tolist()))
0 1 2 3
0 1 2 1.0 3.0
1 5 6 NaN NaN
print (pd.DataFrame(df['a'].tolist()).eq(1))
0 1 2 3
0 True False True False
1 False False False False
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
I think working with lists in pandas is not good idea.

pandas syntax examples confusion

I am confused by some of the examples I see for pandas. For example this is shortened from a post I recently read:
df[df.duplicated()|df()]
What I don't understand is why df needs to be on the outside: df[df.duplicated()]
vs just using df.duplicated(). In the documentation I have not yet seen the first example, everything is presented in the format df.something_doing(). But I see many examples such as df[df.something_doing()] and I don't understand what the df on the outside does.
df.duplicated() returns the boolean values. They provide a mask with True if the condition mentioned is satisfied, False otherwise.
If you want a slice of the dataframe based on the boolean mask, you need:
df[df.duplicated()]
Another simple example, consider this dataframe
col1 id
0 1 a
1 0 a
2 1 a
3 1 b
If you only want the columns where 'id' is 'a',
df.id == 'a'
would give you boolean mask but
df[df.id == 'a']
would return the dataframe
col1 id
0 1 a
1 0 a
2 1 a

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')