is there a function where I can do one hot encoding and removing duplicates in R?

is there a function where I can do one hot encoding and removing duplicates in R? - one-hot-encoding

I have this database
ID
LABEL
1
A
1
B
2
B
3
c
I'm trying to do an one hot encoding, which I was able to do. However, I also need to remove the duplicated IDs, so my one hot code appears to be like below:
ID
A
B
C
1
1
0
0
1
0
1
0
2
0
1
0
3
0
0
1
and I need this to be the final database
ID
A
B
C
1
1
1
0
2
0
1
0
3
0
0
1
this is my code
dummy <- dummyVars('~ .', data = data_to_be_encoded)
encoded_data <- data.frame(predict(dummy, newdata = data_to_be_encoded))

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.

Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?

You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1

I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Updating multiple columns based on multiple conditions

I've below table with some results for both Morning and Afternoon session (for different periods).
I would like to updated the results based on the simple condition:
Check if in 2 following morning sessions there was a change - if not add 5 to the score:
Example: ID=1, Mor2=C, Mor3=C so Score_M3 = 5+5= 10 (new value). All updated values are marked in the 'Wanted' table.
How can I write this in SQL? I will have a lot of columns and IDs.
My dataset:
ID Mor1 Aft1 Mor2 Aft2 Mor3 Aft3 Score_M1 Score_A1 Score_M2 Score_A2 Score_M3 Score_A3
1 A A C B C B 1 1 1 1 5 6
2 C C C B C B 1 1 1 1 4 5
3 A A A A A A 1 1 1 1 4 1
Wanted :
ID Mor1 Aft1 Mor2 Aft2 Mor3 Aft3 Score_M1 Score_A1 Score_M2 Score_A2 Score_M3 Score_A3
1 A A C B C B 1 1 1 1 *10 6
2 C C C B C B 1 1 *6 1 *9 5
3 A A A A A A 1 1 *6 1 *9 1

Here is the SQL to get you started. You can add many more columns as you see fit.
Can we restate as SAME, rather than Change?
If Mor1 = Mor2 then add +5 to Score2
If Mor2 = Mor3 then add +5 to Score3
UPDATE [StackOver].[dbo].[UpdateMultiCols]
SET
[Score_M1] = Score_M1
,[Score_M2] = Score_M2 +
Case When Mor1 = Mor2 Then 5 else 0 End
,[Score_M3] = Score_M3 +
Case When Mor2 = Mor3 Then 5 else 0 End
GO

How to define incomplete sets in GAMS?

There is an incomplete graph (e.g. including 5 vertices). The adjacency matrix "a" is available. I want to define the set which includes all edges but exclude any other pair of vertices. That is, the pair of vertices belongs to the set of edges iff the element in matrix "a" is positive.
The last line of following code does not work!
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) = a(i,j)$(a(i,j)>0);

If you want to have edge , you must define a set and parameter like this :
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) $ a(i,j) =yes;

You can simplify your last line to
edges(i,j) = a(i,j);
This automatically acts as if you wrote something like $(a<>0). However, since you defined your symbol a as set already and not as parameter, I think you actually do not have to do anything. A just is what you are looking for. Just do
display a;
and look at the result in the lst file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

is there a function where I can do one hot encoding and removing duplicates in R? - one-hot-encoding

Related

Change 1st instance of every unique row as 1 in pandas

How to check pair of string values in a column, after grouping the dataframe using ID column?

Pandas merge conflict rows by counts?

Updating multiple columns based on multiple conditions

How to define incomplete sets in GAMS?

Categories

Resources