This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2
Related
get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0
I'm trying to find intersection of A, B, C for all possible A,B,C column vectors in the following df.
dataframe
print(df)
0 1 2 3
0 1 0 0 0
1 0 0 1 1
2 1 0 1 1
3 0 1 0 0
4 1 0 1 0
df.T.dot(df)
gives pairwise intersection counts of column vectors
dot product
0 1 2 3
0 3 0 2 1
1 0 1 0 0
2 2 0 3 2
3 1 0 2 2
How do I get to intersection triples among column vectors
Eg: col 0, col 2, col 3 has value 1*1*1 = 1.
I'm trying to make a three dimensional association matrix for item-item-item similarity. What is the best approach here?
I've generated a list of combination and would like to turn it into "dummies" matrix
import pandas as pd
from itertools import combinations
comb = pd.DataFrame(list(combinations(range(1, 6), 4)))
0 1 2 3
0 1 2 3 4
1 1 2 3 5
2 1 2 4 5
3 1 3 4 5
4 2 3 4 5
would like to turn the above dataframe to a dataframe look like below. Thanks.
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1
You can use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
df = pd.DataFrame(lb.fit_transform(comb.values), columns= lb.classes_)
print (df)
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1
I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.
I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1