How to plot a bar chart from a dataframe with only dummy variables? - pandas

Given the following Pandas dataframe :
|---+---+---+---|
| A | B | C | D |
|---+---+---+---|
| 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 |
|---+---+---+---|
How would you do to a bar chart like this :
0 being failure, and 1 success.

With pandas and matplotlib use melt + crosstab:
dfm = df.melt()
plot_df = (
pd.crosstab(dfm['variable'], dfm['value'])
.rename(columns={0: 'failure', 1: 'success'})
)
plot_df.plot.bar()
plt.tight_layout()
plt.show()
plot_df:
value failure success
variable
A 4 6
B 8 2
C 9 1
D 9 1
Or with Seaborn use sns.countplot after melt:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'A': [0, 1, 1, 1, 1, 1, 0, 0, 0, 1], 'B': [1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
'C': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], 'D': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
})
dfm = df.melt()
ax = sns.countplot(data=dfm, x='variable', hue='value')
ax.legend(labels=['failure', 'success'])
plt.tight_layout()
plt.show()

Related

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

Filtering a view using three bit columns

The view can be filtered on these three columns:
Profit(bit), Loss(bit), NoImpact(bit)
Backstory: On a webpage a user can choose to filter the data based on three checkboxes (Profit, Loss, No impact).
What I am looking for: If they check 'Profit' return everything where 'Profit' = 1, if then they check 'Loss' show 'Profit' AND 'Loss' results but exclude 'NoImpact', and so forth.
This is what I've tried so far and part of my stored proc:
WHERE (
((#ProfitSelected is null OR #ProfitSelected = 'false') OR (Profit = #ProfitSelected))
--I've tried using AND here as well.
OR ((#LossSelected is null OR #LossSelected = 'false') OR (Loss = #LossSelected))
OR ((#NoImpactSelected is null OR #NoImpactSelected = 'false') OR (NoImpact = #NoImpactSelected))
)
END
exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 0
Thank you.
EDIT: As requested here are some tests and desired results:
TEST exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 0
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 1 | 0 | 0
2 | 1 | 0 | 0
3 | 0 | 1 | 0
4 | 0 | 1 | 0
5 | 0 | 1 | 0
TEST exec dbo.SearchErrorReports #ProfitSelected = 0, #LossSelected = 1, #NoImpactSelected = 0
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 0 | 1 | 0
2 | 0 | 1 | 0
3 | 0 | 1 | 0
TEST exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 1
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 1 | 0 | 0
2 | 0 | 1 | 0
3 | 0 | 1 | 0
4 | 0 | 0 | 1
5 | 1 | 0 | 0
6 | 0 | 0 | 1
Etc and all the different permutations.
If I understand the question correctly, the following WHERE clause should return the expected results:
WHERE
(#ProfitSelected = 1 AND Profit = 1) OR
(#LossSelected = 1 AND Loss = 1) OR
(#NoImpactSelected = 1 AND NoImpact = 1) OR
(#ProfitSelected = 0 AND #LossSelected = 0 AND #NoImpactSelected = 0)
#Zhorov helped me a lot. I had to modify his query slightly to have all test cases covered:
WHERE (#ProfitSelected = 0 AND #LossSelected = 0 AND #NoImpactSelected = 0) OR
(#ProfitSelected = 1 AND Profit = 1) OR
(#LossSelected = 1 AND Loss = 1) OR
(#NoImpactSelected = 1 AND NoImpact = 1) OR
(#ProfitSelected = 1 AND #LossSelected = 1 AND #NoImpactSelected = 1)

how to ensure all rows/columns show up with pandas crosstab?

I am computing a simple crosstab for the purpose of a transition matrix like this:
test_df = pd.DataFrame({'from': ['A', 'A', 'B', 'C'], 'to': ['A', 'B', 'B', None]},
columns=['from', 'to'])
pd.crosstab(test_df['from'], test_df['to'], dropna=False)
It produces the following matrix:
A | B
---------
A 1 | 1
---------
B 0 | 1
I want it to include all transitions, even if they're 0, like the following:
A | B | C
-------------
A 1 | 1 | 0
-------------
B 0 | 1 | 0
-------------
C 0 | 0 | 0
Is there some setting I am missing to do this? I tried checking the options and couldn't find anything.
Use DataFrame.reindex at the end:
i = test_df[['from','to']].stack().unique()
new_df = (pd.crosstab(test_df['from'], test_df['to'],dropna = False)
.reindex(index = i,columns=i,fill_value =0))
print(new_df)
to A B C
from
A 1 1 0
B 0 1 0
C 0 0 0
Another approach: DataFrame.pivot_table
(test_df.pivot_table(index = 'from',columns = 'to',aggfunc = 'size',fill_value = 0)
.reindex(index = i,columns = i,fill_value = 0))

Broadcasting multi-dimensional array indices of the same shape

I have a mask array which represents a 2-dimensional binary image. Let's say it's simply:
mask = np.zeros((9, 9), dtype=np.uint8)
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
Suppose I want to flip the elements in the middle left ninth:
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# 1 1 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
My incorrect approach was something like this:
x = np.arange(mask.shape[0])
y = np.arange(mask.shape[1])
mask[np.logical_and(y >= 3, y < 6), x < 3] = 1
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# ------+-------+------
# 1 0 0 | 0 0 0 | 0 0 0
# 0 1 0 | 0 0 0 | 0 0 0
# 0 0 1 | 0 0 0 | 0 0 0
# ------+-------+------
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
# 0 0 0 | 0 0 0 | 0 0 0
(This is a simplification of the constraints I'm really dealing with, which would not be easily expressed as something like mask[:3,3:6] = 1 as in this case. Consider the constraints arbitrary, like x % 2 == 0 && y % 3 == 0 if you will.)
Numpy's behavior when the two index arrays are the same shape is to take them pairwise, which ends up only selecting the 3 elements above, rather than 9 I would like.
How would I update the right elements with constraints that apply to different axes? Given that the constraints are independent, can I do this by only evaluating my constraints N+M times, rather than N*M?
You can't broadcast the boolean arrays, but you can construct the equivalent numeric indices with ix_:
In [330]: np.ix_((y>=3)&(y<6), x<3)
Out[330]:
(array([[3],
[4],
[5]]), array([[0, 1, 2]]))
Applying it:
In [331]: arr = np.zeros((9,9),int)
In [332]: arr[_330] = 1
In [333]: arr
Out[333]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attempting to broadcast the booleans directly raises an error (too many indices):
arr[((y>=3)&(y<6))[:,None], x<3]
Per your comment, let's try this fancier example:
mask = np.zeros((90,90), dtype=np.uint8)
# criteria
def f(x,y): return ((x-20)**2 < 50) & ((y-20)**2 < 50)
# ranges
x,y = np.arange(90), np.arange(90)
# meshgrid
xx,yy = np.meshgrid(x,y)
zz = f(xx,yy)
# mask
mask[zz] = 1
plt.imshow(mask, cnap='gray')
Output:

Python Pandas Dataframe cell value split

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter
You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]