User based encoding/convert with its interaction in pandas - pandas

I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
timestamp : UNIX TIMESTAMP OF EACH INTERACTION
real_date : I have added this column and converted timestamp to human date & time
** user_*iD ***
** question_*id ***
** user_*answer ***
** correct_answer **
** correct **
** elapsed_*time ***
** solving_*id ***
** bundle_*id ***
timestamp
real_date
1
1
A
A
1.0
5.00
1
b1
1547794902000
Friday, January 18, 2019 7:01:42 AM
1
2
D
D
1.0
3.00
2
b2
1547795130000
Friday, January 18, 2019 7:05:30 AM
1
5
C
C
1.0
7.00
5
b5
1547795370000
Friday, January 18, 2019 7:09:30 AM
2
10
C
C
1.0
5.00
10
b10
1547806170000
Friday, January 18, 2019 10:09:30 AM
2
1
B
B
1.0
15.0
1
b1
1547802150000
Friday, January 18, 2019 9:02:30 AM
2
15
A
A
1.0
2.00
15
b15
1547803230000
Friday, January 18, 2019 9:20:30 AM
2
7
C
C
1.0
5.00
7
b7
1547802730000
Friday, January 18, 2019 9:12:10 AM
3
12
A
A
1.0
1.00
25
b12
1547771110000
Friday, January 18, 2019 12:25:10 AM
3
10
C
C
1.0
2.00
10
b10
1547770810000
Friday, January 18, 2019 12:20:10 AM
3
3
D
D
1.0
5.00
3
b3
1547770390000
Friday, January 18, 2019 12:13:10 AM
104
6
C
C
1.0
6.00
6
b6
1553040610000
Wednesday, March 20, 2019 12:10:10 AM
104
4
A
A
1.0
5.00
4
b4
1553040547000
Wednesday, March 20, 2019 12:09:07 AM
104
1
A
A
1.0
2.00
1
b1
1553040285000
Wednesday, March 20, 2019 12:04:45 AM
I need to do some encoding , I don't know which encoding should I do and how?
What i need a next dataframe to look like this :
user_id
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
1
1
2
0
0
3
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
2
0
0
0
0
3
3
0
0
1
0
0
0
0
0
0
2
0
3
0
0
0
104
1
0
0
2
0
3
0
0
0
0
0
0
0
0
0
As you can see with the help of timestamp and real_date ; the question_id of each user is not sorted,
The new dataframe should contain which of the bundles user has interacted with, time-based sorted.

First create the final value for each bundle element using groupby and cumcount then pivot your dataframe. Finally reindex it to get all columns:
bundle = [f'b{i}' for i in range(1, 16)]
values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)
out = (
df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
.reindex(bundle, axis=1, fill_value=0)
)
Output:
>>> out
bundle_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
user_iD
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
>>> out.reset_index().rename_axis(columns=None)
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0

Lacking more Pythonish experience, I'm proposing the following (partially commented) code snippet which is not optimized in any way, being based merely on elementary pandas.DataFrame API reference.
import pandas as pd
import io
import sys
data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''
df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp'])
# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')
dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
dict_answered[user_id] = 0
if len( sys.argv) > 1 and sys.argv[1]:
# supplied arg in the next line for better result readability
df_new_values = [sys.argv[1].strip('"').strip("'")
for x in range(len(df_new_headers)-1)]
else:
# zeroes (original assignment)
df_new_values = [0 for x in range(len(df_new_headers)-1)]
df_new_values.insert(0, user_id)
df_new_data.append(df_new_values)
df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)
# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
if aux[3] == 1.0:
# add 1 to number of already answered questions for current user
dict_answered[aux[1]] += 1
df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]
print( df_new)
Output examples
Example: .\SO\70751715.py
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
Example: .\SO\70751715.py .
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 . . 3 . . . . . . . . . .
1 2 1 . . . . . 2 . . 4 . . . . 3
2 3 . . 1 . . . . . . 2 . 3 . . .
3 104 1 . . 2 . 3 . . . . . . . . .
Example: .\SO\70751715.py ''
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 3
1 2 1 2 4 3
2 3 1 2 3
3 104 1 2 3

I think you are looking for LabelEncoder. First import the library:
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
Then you should be able to convert objects to category:
#CONVERT: convert objects to category
#code categorical data
label = LabelEncoder()
dataset['question_id'] = label.fit_transform(dataset['question_id']
dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])
Or just use below:
dataset.apply(LabelEncoder().fit_transform)

Related

how to sum vlaues in dataframes based on index match

I have about 16 dataframes representing weekly users' clickstream data. The photos show the samples for weeks from 0-3. I want to make a new dataframe in this way: for example if a new df is w=2, then w2=w0+w1+w2. For w3, w3=w0+w1+w2+3. As you can see the datasets do not have identical id_users, but id a user does not show in a certain week. All dataframes have the same columns, but indexes are not exactly same. So how to add based on the logic where indexes match?
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
43284 1 8 0 8 5 0 0 0 2 3 1
45664 0 16 0 4 0 0 0 0 5 16 2
52014 0 0 0 5 4 0 0 0 0 2 2
53488 1 37 0 19 0 0 3 0 3 23 6
60135 0 124 0 87 3 0 24 0 8 19 14
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
concat then groupby sum
out = pd.concat([df1,df2]).groupby('id_user',as_index=False).sum()
Out[147]:
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
0 40419 0 8 0 3 4 0 6 0 1 6 0
1 43284 2 12 0 22 31 2 0 0 4 7 3
2 45664 0 25 0 19 11 0 0 0 6 22 16
3 52014 0 0 0 13 13 0 8 0 2 4 3
4 53488 1 39 0 23 0 0 7 0 3 23 6
5 60135 0 124 0 87 3 0 24 0 8 19 14

Add positive elements (and negative) in each row?

For each row of my data I want positive values together and negative values together:
c1 c2 c3 c4 c5
1 2 3 -1 -2
3 2 -1 2 -9
3 -5 1 2 4
Output
c1 c2 c3 c4 c5 sum_positive sum_negative
1 2 3 -1 -2 6 -3
3 2 -1 2 -9 7 -10
3 -5 1 2 4 10 -5
I was trying to use a for loop like: (G is my df) and add positive and negative elements in 2 list and add them but I thought there might be a better way to do that.
g=[]
for i in range(G.shape[0]):
for j in range(G.shape[1]):
if G.iloc[i,j]>=0:
g.append(G.iloc[i,j])
g.append('skill_next')
Loops or .apply will be pretty slow, so your best bet is to just .clip the values and take the sum directly:
In [58]: df['sum_positive'] = df.clip(lower=0).sum(axis=1)
In [59]: df['sum_negative'] = df.clip(upper=0).sum(axis=1)
In [60]: df
Out[60]:
c1 c2 c3 c4 c5 sum_positive sum_negative
0 1 2 3 -1 -2 6 -3
1 3 2 -1 2 -9 7 -10
2 3 -5 1 2 4 10 -5
Or you can use where:
df['sum_negative'] = df.where(df<0).sum(1)
df['sum_positive'] = df.where(df>0).sum(1)
RESULT:
c1 c2 c3 c4 c5 sum_negative sum_positive
0 1 2 3 -1 -2 -3.0 6.0
1 3 2 -1 2 -9 -10.0 7.0
2 3 -5 1 2 4 -5.0 10.0

How do I one-hot encode pandas dataframe for whole columns, not for each column?

I want to one-hot encode pandas dataframe for whole columns, not for each column.
If there is a dataframe like below:
df = pd.DataFrame({'A': ['A1', 'A1', 'A1', 'A1', 'A4', 'A5'], 'B': ['A2', 'A2', 'A2', 'A3', np.nan, 'A6], 'C': ['A4', 'A3', 'A3', 'A5', np.nan, np.nan]})
df =
A B C
0 A1 A2 A4
1 A1 A2 A3
2 A1 A2 A3
3 A1 A3 A5
4 A4 NaN NaN
5 A5 A6 NaN
I want to encode it like below:
df =
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
However, if I write a code like belows, the results are like belows:
df = pd.get_dummies(df, sparse=True)
df =
A_A1 A_A4 A_A5 B_A2 B_A3 B_A6 C_A3 C_A4 C_A5
0 1 0 0 1 0 0 0 1 0
1 1 0 0 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0 0
3 1 0 0 0 1 0 0 0 1
4 0 1 0 0 0 0 0 0 0
5 0 0 1 0 0 1 0 0 0
How do I one-hot encode for whole columns?
If I use prefix = '', it also makes columns such as _A1 _A4 _A5 _A2 _A3 _A6 _A3 _A4 _A5.
(I hope to make code using pandas or numpy library, not for-loop naive code because my data are so huge; 16000000 rows, so iterative for-loop naive code will require long calculation time).
In your case
df.stack().str.get_dummies().sum(level=0)
Out[116]:
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
Or fix your pd.get_dummies with prefix
pd.get_dummies(df, prefix='',prefix_sep='').sum(level=0,axis=1)
Out[118]:
A1 A4 A5 A2 A3 A6
0 1 1 0 1 0 0
1 1 0 0 1 1 0
2 1 0 0 1 1 0
3 1 0 1 0 1 0
4 0 1 0 0 0 0
5 0 0 1 0 0 1
Quicker
# Pandas 0.24 or greater use `.to_numpy()` instead of `.values`
v = df.values
n, m = v.shape
j, cols = pd.factorize(v.ravel()) # -1 when `np.nan`
# Used to grab only non-null values
mask = j >= 0
i = np.arange(n).repeat(m)[mask]
j = j[mask]
out = np.zeros((n, len(cols)), dtype=int)
# Useful when not one-hot. Otherwise use `out[i, j] = 1`
np.add.at(out, (i, j), 1)
pd.DataFrame(out, df.index, cols)
A1 A2 A4 A3 A5 A6
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 1 0 1 0 0
3 1 0 0 1 1 0
4 0 0 1 0 0 0
5 0 0 0 0 1 1
Not quicker
This is intended to show that you can join the row values then use str.get_dummies
df.stack().groupby(level=0).apply('|'.join).str.get_dummies()
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
sklearn
from sklearn.preprocessing import MultiLabelBinarizer as MLB
mlb = MLB()
out = mlb.fit_transform([[*filter(pd.notna, x)] for x in zip(*map(df.get, df))])
pd.DataFrame(out, df.index, mlb.classes_)
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1

Filter rows in pandas based on threshold

I have the following data frame.
A1 A2 A3 B1 B2 B3 C1 C2 C3
0 0 0 1 1 1 1 0 1 1
1 0 0 0 0 0 0 0 0 0
2 1 1 1 0 1 1 1 1 1
I am looking to filter it based on groups of column and occurrence of non-zero. I wrote the following to achieve it.
import pandas as pd
df = pd.read_csv("TEST_TABLE.txt", sep='\t')
print(df)
group1 = ['A1','A2','A3']
group2 = ['B1','B2','B3']
group3 = ['C1','C2','C3']
df2 = df[(df[group1] !=0).any(axis=1) & (df[group2] !=0).any(axis=1) & (df[group3] !=0).any(axis=1)]
print(df2)
The output was perfect:
A1 A2 A3 B1 B2 B3 C1 C2 C3
0 0 0 1 1 1 1 0 1 1
2 1 1 1 0 1 1 1 1 1
Now, how to modify the code such that, I can impose a threshold value for "any". i.e retain rows for each group with atleast 2 non-zeros. Hence, the final output will give
A1 A2 A3 B1 B2 B3 C1 C2 C3
2 1 1 1 0 1 1 1 1 1
Thanks in advance.
You can create boolean masks in loop by sum for count non 0 values with comparing by ge (>=) and last reduce masks:
groups = [group1,group2,group3]
df2 = df[np.logical_and.reduce([(df[g]!=0).sum(axis=1).ge(2) for g in groups])]
print(df2)
A1 A2 A3 B1 B2 B3 C1 C2 C3
2 1 1 1 0 1 1 1 1 1
Detail:
print([(df[g]!=0).sum(axis=1).ge(2) for g in groups])
[0 False
1 False
2 True
dtype: bool, 0 True
1 False
2 True
dtype: bool, 0 True
1 False
2 True
dtype: bool]

adding one to all the values in a dataframe

I have a dataframe like the one below. I would like to add one to all of the values in each row. I am new to this forum and python so i can't conceptualise how to do this. I need to add 1 to each value. I intend to use bayes probability and the posterior probability will be 0 when i multiply them. PS. I am also new to probability but others have applied the same method. Thanks for your help in advance. I am using pandas to do this.
Disease Gene1 Gene2 Gene3 Gene4
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
With this being your dataframe:
df = pd.DataFrame({
"Disease":[f"D{i}" for i in range(1,9)],
"Gene1":[0,0,0,24,0,0,0,4],
"Gene2":[0,0,17,0,0,32,0,0],
"Gene3":[25,0,0,0,0,0,0,0],
"Gene4":[0,0,16,0,0,11,0,0]})
Disease Gene1 Gene2 Gene3 Gene4
0 D1 0 0 25 0
1 D2 0 0 0 0
2 D3 0 17 0 16
3 D4 24 0 0 0
4 D5 0 0 0 0
5 D6 0 32 0 11
6 D7 0 0 0 0
7 D8 4 0 0 0
The easiest way to do this is to do
df += 1
However, since you have a column which is string (The Disease column)
This will not work.
But we can conveniently set the Disease column to be the index, like this:
df.set_index('Disease', inplace=True)
Now your dataframe looks like this:
Gene1 Gene2 Gene3 Gene4
Disease
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
And if we do df += 1 now, we get:
Gene1 Gene2 Gene3 Gene4
Disease
D1 1 1 26 1
D2 1 1 1 1
D3 1 18 1 17
D4 25 1 1 1
D5 1 1 1 1
D6 1 33 1 12
D7 1 1 1 1
D8 5 1 1 1
because the plus operation only acts on the data columns, not on the index.
You can also do this on column basis, like this:
df["Gene1"] = df["Gene1"] + 1
You can filter the df whether the underlying dtype is not 'object':
In [110]:
numeric_cols = [col for col in df if df[col].dtype.kind != 'O']
numeric_cols
Out[110]:
['Gene1', 'Gene2', 'Gene3', 'Gene4']
In [111]:
df[numeric_cols] += 1
df
Out[111]:
Disease Gene1 Gene2 Gene3 Gene4
0 D1 1 1 26 1
1 D2 1 1 1 1
2 D3 1 18 1 17
3 D4 25 1 1 1
4 D5 1 1 1 1
5 D6 1 33 1 12
6 D7 1 1 1 1
7 D8 5 1 1 1
EDIT
It looks like your df possibly has strings instead of numeric types, you can convert the dtype to numeric using convert_objects:
df = df.convert_objects(convert_numeric=True)