how to ensure all rows/columns show up with pandas crosstab? - pandas

I am computing a simple crosstab for the purpose of a transition matrix like this:
test_df = pd.DataFrame({'from': ['A', 'A', 'B', 'C'], 'to': ['A', 'B', 'B', None]},
columns=['from', 'to'])
pd.crosstab(test_df['from'], test_df['to'], dropna=False)
It produces the following matrix:
A | B
---------
A 1 | 1
---------
B 0 | 1
I want it to include all transitions, even if they're 0, like the following:
A | B | C
-------------
A 1 | 1 | 0
-------------
B 0 | 1 | 0
-------------
C 0 | 0 | 0
Is there some setting I am missing to do this? I tried checking the options and couldn't find anything.

Use DataFrame.reindex at the end:
i = test_df[['from','to']].stack().unique()
new_df = (pd.crosstab(test_df['from'], test_df['to'],dropna = False)
.reindex(index = i,columns=i,fill_value =0))
print(new_df)
to A B C
from
A 1 1 0
B 0 1 0
C 0 0 0
Another approach: DataFrame.pivot_table
(test_df.pivot_table(index = 'from',columns = 'to',aggfunc = 'size',fill_value = 0)
.reindex(index = i,columns = i,fill_value = 0))

Related

How can I aggregate strings from many cells into one cell?

Say I have two classes with a handful of students each, and I want to think of the possible pairings in each class. In my original data, I have one line per student.
What's the easiest way in Pandas to turn this dataset
Class Students
0 1 A
1 1 B
2 1 C
3 1 D
4 1 E
5 2 F
6 2 G
7 2 H
Into this new stuff?
Class Students
0 1 A,B
1 1 A,C
2 1 A,D
3 1 A,E
4 1 B,C
5 1 B,D
6 1 B,E
7 1 C,D
6 1 B,E
8 1 C,D
9 1 C,E
10 1 D,E
11 2 F,G
12 2 F,H
12 2 G,H
Try This:
import itertools
import pandas as pd
cla = [1, 1, 1, 1, 1, 2, 2, 2]
s = ["A", "B", "C", "D" , "E", "F", "G", "H"]
df = pd.DataFrame(cla, columns=["Class"])
df['Student'] = s
def create_combos(list_students):
combos = itertools.combinations(list_students, 2)
str_students = []
for i in combos:
str_students.append(str(i[0])+","+str(i[1]))
return str_students
def iterate_df(class_id):
df_temp = df.loc[df['Class'] == class_id]
list_student = list(df_temp['Student'])
list_combos = create_combos(list_student)
list_id = [class_id for i in list_combos]
return list_id, list_combos
list_classes = set(list(df['Class']))
new_id = []
new_combos = []
for idx in list_classes:
tmp_id, tmp_combo = iterate_df(idx)
new_id += tmp_id
new_combos += tmp_combo
new_df = pd.DataFrame(new_id, columns=["Class"])
new_df["Student"] = new_combos
print(new_df)

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

Filtering a view using three bit columns

The view can be filtered on these three columns:
Profit(bit), Loss(bit), NoImpact(bit)
Backstory: On a webpage a user can choose to filter the data based on three checkboxes (Profit, Loss, No impact).
What I am looking for: If they check 'Profit' return everything where 'Profit' = 1, if then they check 'Loss' show 'Profit' AND 'Loss' results but exclude 'NoImpact', and so forth.
This is what I've tried so far and part of my stored proc:
WHERE (
((#ProfitSelected is null OR #ProfitSelected = 'false') OR (Profit = #ProfitSelected))
--I've tried using AND here as well.
OR ((#LossSelected is null OR #LossSelected = 'false') OR (Loss = #LossSelected))
OR ((#NoImpactSelected is null OR #NoImpactSelected = 'false') OR (NoImpact = #NoImpactSelected))
)
END
exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 0
Thank you.
EDIT: As requested here are some tests and desired results:
TEST exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 0
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 1 | 0 | 0
2 | 1 | 0 | 0
3 | 0 | 1 | 0
4 | 0 | 1 | 0
5 | 0 | 1 | 0
TEST exec dbo.SearchErrorReports #ProfitSelected = 0, #LossSelected = 1, #NoImpactSelected = 0
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 0 | 1 | 0
2 | 0 | 1 | 0
3 | 0 | 1 | 0
TEST exec dbo.SearchErrorReports #ProfitSelected = 1, #LossSelected = 1, #NoImpactSelected = 1
Result
id | Profit | Loss | NoImpact
----------------------------------------
1 | 1 | 0 | 0
2 | 0 | 1 | 0
3 | 0 | 1 | 0
4 | 0 | 0 | 1
5 | 1 | 0 | 0
6 | 0 | 0 | 1
Etc and all the different permutations.
If I understand the question correctly, the following WHERE clause should return the expected results:
WHERE
(#ProfitSelected = 1 AND Profit = 1) OR
(#LossSelected = 1 AND Loss = 1) OR
(#NoImpactSelected = 1 AND NoImpact = 1) OR
(#ProfitSelected = 0 AND #LossSelected = 0 AND #NoImpactSelected = 0)
#Zhorov helped me a lot. I had to modify his query slightly to have all test cases covered:
WHERE (#ProfitSelected = 0 AND #LossSelected = 0 AND #NoImpactSelected = 0) OR
(#ProfitSelected = 1 AND Profit = 1) OR
(#LossSelected = 1 AND Loss = 1) OR
(#NoImpactSelected = 1 AND NoImpact = 1) OR
(#ProfitSelected = 1 AND #LossSelected = 1 AND #NoImpactSelected = 1)

How to do the formulas without splitting the dataframe which had different conditions

I have the following dataframe
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Type':['A','A','B','C','D'],
}
df = pd.DataFrame(data = d)
df
For applying the formula without condition I use the following code
df = df.eval(
'Price = (Price1*Price1)/2'
)
df
How to do the formulas without splitting the dataframe which had different conditions
Need a new column called Price_on_type
The formula is differing for each type
For type A the formula for Price_on_type = Price1+Price1
For type B the formula for Price_on_type = (Price1+Price1)/2
For type C the formula for Price_on_type = Price1
For type D the formula for Price_on_type = Price2
Expected Output:
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Price':[12.5,40.5, 8.0, 4.5, 40.5],
'Price_on_type':[14,19,8.0,3,18],
}
df = pd.DataFrame(data = d)
df
You can use numpy.select:
masks = [df['Type'] == 'A',
df['Type'] == 'B',
df['Type'] == 'C',
df['Type'] == 'D']
vals = [df.eval('(Price1*Price1)'),
df.eval('(Price1*Price1)/2'),
df['Price1'],
df['Price2']]
Or:
vals = [df['Price1'] + df['Price2'],
(df['Price1'] + df['Price2']) / 2,
df['Price1'],
df['Price2']]
df['Price_on_type'] = np.select(masks, vals)
print (df)
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
If your data is not too big, using apply with custom function on axis=1
def Prices(x):
dict_sw = {
'A': x.Price1 + x.Price2,
'B': (x.Price1 + x.Price2)/2,
'C': x.Price1,
'D': x.Price2,
}
return dict_sw[x.Type]
In [239]: df['Price_on_type'] = df.apply(Prices, axis=1)
In [240]: df
Out[240]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
Or using the trick convert True to 1 and False to 0
df['Price_on_type'] = \
(df.Type == 'A') * (df.Price1 + df.Price2) + \
(df.Type == 'B') * (df.Price1 + df.Price2)/2 + \
(df.Type == 'C') * df.Price1 + \
(df.Type == 'D') * df.Price2
Out[308]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!
Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1