I have a dataframe like the one below. I would like to add one to all of the values in each row. I am new to this forum and python so i can't conceptualise how to do this. I need to add 1 to each value. I intend to use bayes probability and the posterior probability will be 0 when i multiply them. PS. I am also new to probability but others have applied the same method. Thanks for your help in advance. I am using pandas to do this.
Disease Gene1 Gene2 Gene3 Gene4
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
With this being your dataframe:
df = pd.DataFrame({
"Disease":[f"D{i}" for i in range(1,9)],
"Gene1":[0,0,0,24,0,0,0,4],
"Gene2":[0,0,17,0,0,32,0,0],
"Gene3":[25,0,0,0,0,0,0,0],
"Gene4":[0,0,16,0,0,11,0,0]})
Disease Gene1 Gene2 Gene3 Gene4
0 D1 0 0 25 0
1 D2 0 0 0 0
2 D3 0 17 0 16
3 D4 24 0 0 0
4 D5 0 0 0 0
5 D6 0 32 0 11
6 D7 0 0 0 0
7 D8 4 0 0 0
The easiest way to do this is to do
df += 1
However, since you have a column which is string (The Disease column)
This will not work.
But we can conveniently set the Disease column to be the index, like this:
df.set_index('Disease', inplace=True)
Now your dataframe looks like this:
Gene1 Gene2 Gene3 Gene4
Disease
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
And if we do df += 1 now, we get:
Gene1 Gene2 Gene3 Gene4
Disease
D1 1 1 26 1
D2 1 1 1 1
D3 1 18 1 17
D4 25 1 1 1
D5 1 1 1 1
D6 1 33 1 12
D7 1 1 1 1
D8 5 1 1 1
because the plus operation only acts on the data columns, not on the index.
You can also do this on column basis, like this:
df["Gene1"] = df["Gene1"] + 1
You can filter the df whether the underlying dtype is not 'object':
In [110]:
numeric_cols = [col for col in df if df[col].dtype.kind != 'O']
numeric_cols
Out[110]:
['Gene1', 'Gene2', 'Gene3', 'Gene4']
In [111]:
df[numeric_cols] += 1
df
Out[111]:
Disease Gene1 Gene2 Gene3 Gene4
0 D1 1 1 26 1
1 D2 1 1 1 1
2 D3 1 18 1 17
3 D4 25 1 1 1
4 D5 1 1 1 1
5 D6 1 33 1 12
6 D7 1 1 1 1
7 D8 5 1 1 1
EDIT
It looks like your df possibly has strings instead of numeric types, you can convert the dtype to numeric using convert_objects:
df = df.convert_objects(convert_numeric=True)
Related
I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?
Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0
String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0
We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0
I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
timestamp : UNIX TIMESTAMP OF EACH INTERACTION
real_date : I have added this column and converted timestamp to human date & time
** user_*iD ***
** question_*id ***
** user_*answer ***
** correct_answer **
** correct **
** elapsed_*time ***
** solving_*id ***
** bundle_*id ***
timestamp
real_date
1
1
A
A
1.0
5.00
1
b1
1547794902000
Friday, January 18, 2019 7:01:42 AM
1
2
D
D
1.0
3.00
2
b2
1547795130000
Friday, January 18, 2019 7:05:30 AM
1
5
C
C
1.0
7.00
5
b5
1547795370000
Friday, January 18, 2019 7:09:30 AM
2
10
C
C
1.0
5.00
10
b10
1547806170000
Friday, January 18, 2019 10:09:30 AM
2
1
B
B
1.0
15.0
1
b1
1547802150000
Friday, January 18, 2019 9:02:30 AM
2
15
A
A
1.0
2.00
15
b15
1547803230000
Friday, January 18, 2019 9:20:30 AM
2
7
C
C
1.0
5.00
7
b7
1547802730000
Friday, January 18, 2019 9:12:10 AM
3
12
A
A
1.0
1.00
25
b12
1547771110000
Friday, January 18, 2019 12:25:10 AM
3
10
C
C
1.0
2.00
10
b10
1547770810000
Friday, January 18, 2019 12:20:10 AM
3
3
D
D
1.0
5.00
3
b3
1547770390000
Friday, January 18, 2019 12:13:10 AM
104
6
C
C
1.0
6.00
6
b6
1553040610000
Wednesday, March 20, 2019 12:10:10 AM
104
4
A
A
1.0
5.00
4
b4
1553040547000
Wednesday, March 20, 2019 12:09:07 AM
104
1
A
A
1.0
2.00
1
b1
1553040285000
Wednesday, March 20, 2019 12:04:45 AM
I need to do some encoding , I don't know which encoding should I do and how?
What i need a next dataframe to look like this :
user_id
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
1
1
2
0
0
3
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
2
0
0
0
0
3
3
0
0
1
0
0
0
0
0
0
2
0
3
0
0
0
104
1
0
0
2
0
3
0
0
0
0
0
0
0
0
0
As you can see with the help of timestamp and real_date ; the question_id of each user is not sorted,
The new dataframe should contain which of the bundles user has interacted with, time-based sorted.
First create the final value for each bundle element using groupby and cumcount then pivot your dataframe. Finally reindex it to get all columns:
bundle = [f'b{i}' for i in range(1, 16)]
values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)
out = (
df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
.reindex(bundle, axis=1, fill_value=0)
)
Output:
>>> out
bundle_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
user_iD
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
>>> out.reset_index().rename_axis(columns=None)
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
Lacking more Pythonish experience, I'm proposing the following (partially commented) code snippet which is not optimized in any way, being based merely on elementary pandas.DataFrame API reference.
import pandas as pd
import io
import sys
data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''
df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp'])
# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')
dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
dict_answered[user_id] = 0
if len( sys.argv) > 1 and sys.argv[1]:
# supplied arg in the next line for better result readability
df_new_values = [sys.argv[1].strip('"').strip("'")
for x in range(len(df_new_headers)-1)]
else:
# zeroes (original assignment)
df_new_values = [0 for x in range(len(df_new_headers)-1)]
df_new_values.insert(0, user_id)
df_new_data.append(df_new_values)
df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)
# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
if aux[3] == 1.0:
# add 1 to number of already answered questions for current user
dict_answered[aux[1]] += 1
df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]
print( df_new)
Output examples
Example: .\SO\70751715.py
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
Example: .\SO\70751715.py .
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 . . 3 . . . . . . . . . .
1 2 1 . . . . . 2 . . 4 . . . . 3
2 3 . . 1 . . . . . . 2 . 3 . . .
3 104 1 . . 2 . 3 . . . . . . . . .
Example: .\SO\70751715.py ''
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 3
1 2 1 2 4 3
2 3 1 2 3
3 104 1 2 3
I think you are looking for LabelEncoder. First import the library:
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
Then you should be able to convert objects to category:
#CONVERT: convert objects to category
#code categorical data
label = LabelEncoder()
dataset['question_id'] = label.fit_transform(dataset['question_id']
dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])
Or just use below:
dataset.apply(LabelEncoder().fit_transform)
I have about 16 dataframes representing weekly users' clickstream data. The photos show the samples for weeks from 0-3. I want to make a new dataframe in this way: for example if a new df is w=2, then w2=w0+w1+w2. For w3, w3=w0+w1+w2+3. As you can see the datasets do not have identical id_users, but id a user does not show in a certain week. All dataframes have the same columns, but indexes are not exactly same. So how to add based on the logic where indexes match?
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
43284 1 8 0 8 5 0 0 0 2 3 1
45664 0 16 0 4 0 0 0 0 5 16 2
52014 0 0 0 5 4 0 0 0 0 2 2
53488 1 37 0 19 0 0 3 0 3 23 6
60135 0 124 0 87 3 0 24 0 8 19 14
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
concat then groupby sum
out = pd.concat([df1,df2]).groupby('id_user',as_index=False).sum()
Out[147]:
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
0 40419 0 8 0 3 4 0 6 0 1 6 0
1 43284 2 12 0 22 31 2 0 0 4 7 3
2 45664 0 25 0 19 11 0 0 0 6 22 16
3 52014 0 0 0 13 13 0 8 0 2 4 3
4 53488 1 39 0 23 0 0 7 0 3 23 6
5 60135 0 124 0 87 3 0 24 0 8 19 14
I want to one-hot encode pandas dataframe for whole columns, not for each column.
If there is a dataframe like below:
df = pd.DataFrame({'A': ['A1', 'A1', 'A1', 'A1', 'A4', 'A5'], 'B': ['A2', 'A2', 'A2', 'A3', np.nan, 'A6], 'C': ['A4', 'A3', 'A3', 'A5', np.nan, np.nan]})
df =
A B C
0 A1 A2 A4
1 A1 A2 A3
2 A1 A2 A3
3 A1 A3 A5
4 A4 NaN NaN
5 A5 A6 NaN
I want to encode it like below:
df =
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
However, if I write a code like belows, the results are like belows:
df = pd.get_dummies(df, sparse=True)
df =
A_A1 A_A4 A_A5 B_A2 B_A3 B_A6 C_A3 C_A4 C_A5
0 1 0 0 1 0 0 0 1 0
1 1 0 0 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0 0
3 1 0 0 0 1 0 0 0 1
4 0 1 0 0 0 0 0 0 0
5 0 0 1 0 0 1 0 0 0
How do I one-hot encode for whole columns?
If I use prefix = '', it also makes columns such as _A1 _A4 _A5 _A2 _A3 _A6 _A3 _A4 _A5.
(I hope to make code using pandas or numpy library, not for-loop naive code because my data are so huge; 16000000 rows, so iterative for-loop naive code will require long calculation time).
In your case
df.stack().str.get_dummies().sum(level=0)
Out[116]:
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
Or fix your pd.get_dummies with prefix
pd.get_dummies(df, prefix='',prefix_sep='').sum(level=0,axis=1)
Out[118]:
A1 A4 A5 A2 A3 A6
0 1 1 0 1 0 0
1 1 0 0 1 1 0
2 1 0 0 1 1 0
3 1 0 1 0 1 0
4 0 1 0 0 0 0
5 0 0 1 0 0 1
Quicker
# Pandas 0.24 or greater use `.to_numpy()` instead of `.values`
v = df.values
n, m = v.shape
j, cols = pd.factorize(v.ravel()) # -1 when `np.nan`
# Used to grab only non-null values
mask = j >= 0
i = np.arange(n).repeat(m)[mask]
j = j[mask]
out = np.zeros((n, len(cols)), dtype=int)
# Useful when not one-hot. Otherwise use `out[i, j] = 1`
np.add.at(out, (i, j), 1)
pd.DataFrame(out, df.index, cols)
A1 A2 A4 A3 A5 A6
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 1 0 1 0 0
3 1 0 0 1 1 0
4 0 0 1 0 0 0
5 0 0 0 0 1 1
Not quicker
This is intended to show that you can join the row values then use str.get_dummies
df.stack().groupby(level=0).apply('|'.join).str.get_dummies()
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
sklearn
from sklearn.preprocessing import MultiLabelBinarizer as MLB
mlb = MLB()
out = mlb.fit_transform([[*filter(pd.notna, x)] for x in zip(*map(df.get, df))])
pd.DataFrame(out, df.index, mlb.classes_)
A1 A2 A3 A4 A5 A6
0 1 1 0 1 0 0
1 1 1 1 0 0 0
2 1 1 1 0 0 0
3 1 0 1 0 1 0
4 0 0 0 1 0 0
5 0 0 0 0 1 1
I have 2 dataframes df1 and df2
df1;
A B C
0 11 22 55
1 66 34 54
2 0 34 66
df2;
A B C
0 11 33 455
1 0 0 54
2 0 34 766
Both dataframes have the same dimensions. I want to say if value is 0 in df2 then give that value (based on column and index) a 0 in df1.
So df1 will be
df1;
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Use DataFrame.mask:
df1 = df1.mask(df2 == 0, 0)
For better performance use numpy.where:
df1 = pd.DataFrame(np.where(df2 == 0, 0, df1),
index=df1.index,
columns=df1.columns)
print (df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Using where:
df1 = df1.where(df2.ne(0), 0)
print(df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Another way -
df1 = df1[~df2.eq(0)].fillna(0)