How to create dummy variables on Ordinal columns in Python

How to create dummy variables on Ordinal columns in Python - pandas

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)

Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.

Related

getting dummy values acorss all columns

get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0

Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0

T-SQL poor CTE join performance

Long story short. I have a two SQL queries that are very similar and outputs value pairs of short and long names. Input parameter is for first query: short name and for second query is long name. The queries are constructed that outputs all rows that contained input parameters and this is not exact match (for example if I set tagname as ST2 this query outputs all object that contains ST2 and all other object that has ST2 at the beginning of its name.
All the queries are executed in same database. In below queries input parameters are set to "" this means that query outputs key value pairs of short and long names for all objects
declare #tagName as varchar(50) = ''
set #tagName = #tagName + '.'
-- this one query outputs ~700 rows
;with AnalogTag as
(
Select *
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, charindex('.',#tagName))) in (substring(#tagName,0, len(#tagName)))
and (substring(TagName, charindex('.',TagName), 2)) not in ('.#')
),
-- this one query outputs ~7000 rows
HierarchicalName as
(
Select *
from [proba7].[dbo].[internal_list_objects_view]
where substring(tag_name, 0,len(#tagName)) = substring(#tagName,0, len(#tagName))
)
select HierarchicalName.tag_name as TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
Whole query above runs at approx 3 seconds. And outputs about 450 rows
I created a similar one query on same database:
declare #hierarchicalName as varchar(200) = ''
declare #Length as int
set #Length = LEN(#hierarchicalName)+1
-- this query outputs approx 700 rows and if runs separately it runs
--almost instantly
;with AnalogTag as
(
Select TagName
from [Runtime].[dbo].[AnalogTag]
where (substring(TagName, 0, CHARINDEX('.',TagName))) in
(
Select tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
and (substring(TagName, CHARINDEX('.',TagName), 2)) not in ('.#')
),
-- this query outputs approx 7000 rows and if runs separately it runs
--almost instantly
HierarchicalName as
(
Select hierarchical_name, tag_name from [proba7].[dbo].[internal_list_objects_view]
where substring(hierarchical_name, 0, #Length) = #hierarchicalName
)
select HierarchicalName.tag_name as ilo_view_TagName
,HierarchicalName.hierarchical_name ilo_view_HierarchicalName
from AnalogTag
inner join HierarchicalName
on substring(AnalogTag.TagName, 0, CHARINDEX('.',AnalogTag.TagName)) = HierarchicalName.tag_name
And this time query runs in 28 seconds. Outputs similar amount of row as first query (because ouptut must be similar on these two queries). I noticed that if i change "inner join" to for example "full join", query runs instantly.
Analog tag example output:
TagName a b c d e f g h j j k l m
PomFPTemp.PV 1062 0 10 0 10 0 4 0 0 0 0 0 0
PomFPWilgWzgl.PV 1 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocD3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocP3f.PV 46 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntExp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQ3fIntImp.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1BrakFWHMocQn3f.PV 1063 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhExp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
SNST3P1EnkvarhImp3f.PV 1060 0 10 0 10 0 4 0 0 0 0 0 0
hierarchical name example output:
ID tag_name contained_name hierarchical_name a b c d e f g h i j k l m n o p q r s t u v w x y z aa bb cc dd ee ff gg hh
1 $Galaxy $Galaxy 1 0 0 0 0 1 NULL NULL 0 0 1 1 1 0 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 133123 50026 1 NULL
2 proba7 Galaxy_001 0 0 0 0 1 1 NULL NULL 0 0 848 1 0 1 0 0 0 0 1 0 NULL 23 0 0 0 0 0 0 0 0 0 4098 878020 1 NULL
3 $_AutoImport $_AutoImport 1 0 0 0 0 3 NULL NULL 0 0 4 2 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 3699 1 NULL
4 $_DiCommon $_DiCommon 1 0 0 0 0 4 NULL NULL 0 0 5 3 0 0 0 0 0 0 1 0 NULL 10 0 0 0 0 0 0 0 0 0 131075 50023 1 NULL
5 $WinPlatform $WinPlatform 1 0 0 0 0 5 NULL 1 0 0 6 4 1 0 0 0 0 0 0 0 NULL 1 0 0 0 0 0 0 0 0 0 133121 419340 1 1
6 $AppEngine $AppEngine 1 0 0 0 0 6 NULL 1 0 0 7 5 1 0 0 0 0 0 0 0 NULL 3 0 0 0 0 0 0 0 0 0 133121 419341 1 1
7 $Area $Area 1 0 0 0 0 7 NULL 1 0 0 8 6 1 0 0 0 0 0 0 0 NULL 13 0 0 0 0 0 0 0 0 0 133121 3452998 1 1
8 $AnalogDevice $AnalogDevice 1 0 0 0 0 8 NULL 2 0 0 9 7 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419343 1 2
9 $DDESuiteLinkClient $DDESuiteLinkClient 1 0 0 0 0 9 NULL 3 0 0 10 8 1 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 133121 419344 1 3
10 $DiscreteDevice $DiscreteDevice 1 0 0 0 0 10 NULL 2 0 0 11 9 0 0 0 0 0 0 0 0 NULL 10 0 0 0 0 0 0 0 0 0 131073 419345 1 2
11 $InTouchProxy $InTouchProxy 1 0 0 0 0 11 NULL 3 0 0 12 10 0 0 0 0 0 0 0 0 NULL 11 0 0 0 0 0 0 0 0 0 131073 419346 1 3
Output table (example):
ilo_view_TagName ilo_view_HierarchicalName
ST4FP12Rozl ST4_FP1.Galaz_2.Rozladowywanie
ST4FP21Rozl ST4_FP2.Galaz_1.Rozladowywanie
ST4FP22Rozl ST4_FP2.Galaz_2.Rozladowywanie
ST4FP31Rozl ST4_FP3.Galaz_1.Rozladowywanie
ST4RS41AnWspKFL2 ST4_S1_RS4_1.Wsp.K_Factor.L2
ST4FP32Rozl ST4_FP3.Galaz_2.Rozladowywanie
ST4RS31AnWspKFL2 ST4_S2_RS3_1.Wsp.K_Factor.L2
ST4RS51AnWspKFL2 ST4_S3_RS5_1.Wsp.K_Factor.L2
ST4FP11U ST4_FP1.Galaz_1.Napiecie
Best regards and thanks in advance for any advices. I tried at my best to make this exapmple tables readable.

Truth table with 5 inputs and 3 outputs

I have to make a truth table with 5 inputs and 3 outputs, something like this:
A B C D E red green blue
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
0 0 0 1 0 0 0 1
.
.
.
.
1 1 0 1 0 0 1 1
.
.
.
1 1 1 1 1 1 0 1
etc. (in total 32 rows, the numbers in the rgb table represents the number of 1's in each row in binary i.e in row 1 1 0 1 0 there are three 1's, so three in binary is 0 1 1).
I would like to present the result of it in the Atanua (http://sol.gfxile.net/atanua/index.html) tool (so fore example when I press button E, the blue light will shine, when pressing A B D the green and blue light will shine and so on). But there is a requirement that I can only use AND, OR, NOT operands, and each operand can only have two inputs. Although I'm using Karnaugh map to minimize it, still for so many records the results for each output are very long (especially for the last one).
I tried to simplify it more by adding all of the three output boolean functions into one, and the minimization process ended pretty well:
A + B + C + D
It seems to work fine (but as there is only one output light, it works only in red green blue column separately). My concern is the fact that I would like to have three outputs (three lights, not one), and is that even possible after this kind of minimization? Is there a good solution to do it in Atanua? Or do I have to make 3 separate boolean functions, no matter how long they will be (and there is a lot of them even after minimization)?
EDIT: the whole truth table :)
A B C D E R G B
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1
0 0 0 1 0 0 0 1
0 0 0 1 1 0 1 0
0 0 1 0 0 0 0 1
0 0 1 0 1 0 1 0
0 0 1 1 0 0 1 0
0 0 1 1 1 0 1 1
0 1 0 0 0 0 0 1
0 1 0 0 1 0 1 0
0 1 0 1 0 0 1 0
0 1 0 1 1 0 1 1
0 1 1 0 0 0 1 0
0 1 1 0 1 0 1 1
0 1 1 1 0 0 1 1
0 1 1 1 1 1 0 0
1 0 0 0 0 0 0 1
1 0 0 0 1 0 1 0
1 0 0 1 0 0 1 0
1 0 0 1 1 0 1 1
1 0 1 0 0 0 1 0
1 0 1 0 1 0 1 1
1 0 1 1 0 0 1 1
1 0 1 1 1 1 0 0
1 1 0 0 0 0 1 0
1 1 0 0 1 0 1 1
1 1 0 1 0 0 1 1
1 1 0 1 1 1 0 0
1 1 1 0 0 0 1 1
1 1 1 0 1 1 0 0
1 1 1 1 0 1 0 0
1 1 1 1 1 1 0 1
And the karnaugh map for each color (~is the gate NOT, * is AND, + OR):
RED:
BCDE+ACDE+ABDE+ABCE+ABCD
GREEN:
~A~BDE+~AC~DE+~ACD~E+~BCD~E+~AB~CE+B~CD~E+BC~D~E+A~B~CE+A~B~CD+A~BC~D+AB~C~D
BLUE:
~A~B~C~DE+~A~B~CD~E+~A~BC~D~E+~A~BCDE+~AB~C~D~E+~AB~CDE+~ABC~DE+~ABCD~E+A~B~C~D~E+A~B~CDE+A~BC~DE+A~BCD~E+AB~C~DE+AB~CD~E+ABC~D~E+ABCDE

Have to admit that the formulas are somewhat ugly, but it's not too complicated to implement with logic gatters, because you can reuse parts.
A -----+------+------------- - - -
NOT |
+------|--AND- ~AB
| | |
AND-----|---|-- ~A~B
+--AND-+ |
| +--|---|-- A~B
NOT AND--|-- AB
B -----+------+---+---------- - - -
Here as an example I created all combinations of [not]A and [not]B. You can do the same for C and D. So you can get any combination of [not]A and [not]B and [not]C and [not]D by combining a wire from each "box" with an and gatter (e.g. for ABCD we would take the AB wire AND the CD wire).

Extract columns from row values in Python

I am using this dataframe:
dfPredET.head(5)
id Class
1 Class_2
2 Class_1
3 Class_6
4 Class_2
5 Class_1
and I would like to transforms it indicating if one instance belongs to a class (1) or not (0):
id Class_1 Class_2 Class_3 Class_4 Class_5 Class_6 Class_7 Class_8 Class_9
1 0 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0
4 0 1 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0
Can I do that using pivot() function? And how?

Use get_dummies:
In [7]:
pd.get_dummies(df)
Out[7]:
id Class_Class_1 Class_Class_2 Class_Class_6
0 1 0 1 0
1 2 1 0 0
2 3 0 0 1
3 4 0 1 0
4 5 1 0 0

Matplotlib pcolor not plotting correctly

I am trying to create a heat map from a DataFrame (df) of IDs (rows) and Positions (columns) at which a motif is possible. If the motif is present the value of the table is 1 and 0 if it is not present. Such as:
ID Position 1 2 3 4 5 6 7 8 9 10 ...etc
A 0 1 0 0 0 1 0 0 0 1
B 1 0 1 0 1 0 0 1 0 0
C 0 0 0 1 0 0 1 0 1 0
D 1 0 1 0 0 0 1 0 1 0
I then multiply this matrix by itself to find the number of times the motifs present co-occur with motifs at other positions using the code:
df.T.dot(df)
To obtain the Data Frame:
POS 1 2 3 4 5 6 7 8 9 10 ...
1 2 0 2 0 1 0 1 1 1 0
2 0 1 0 0 0 1 0 0 0 1
3 2 0 2 0 1 0 1 1 1 0
4 0 0 0 1 0 0 1 0 1 0
5 1 0 1 0 1 0 0 1 0 0
6 0 1 0 0 0 1 0 0 0 1
7 1 0 1 1 0 0 2 0 2 0
8 1 0 1 0 1 0 0 1 0 0
9 1 0 1 1 0 0 2 0 2 0
10 0 1 0 0 0 1 0 0 0 1
...
Which is symmetrical with the diagonal, however when I try to create the Heat Map using
pylab.pcolor(df)
It gives me an asymmetrical map that does not seem to be representing the dotted matrix. I don't have enough reputation to post an image though.
Does anyone know why this might be occurring? Thanks

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to create dummy variables on Ordinal columns in Python - pandas

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)

Related

getting dummy values acorss all columns

T-SQL poor CTE join performance

Truth table with 5 inputs and 3 outputs

Extract columns from row values in Python

Matplotlib pcolor not plotting correctly

Categories

Resources