how to sum vlaues in dataframes based on index match - pandas

I have about 16 dataframes representing weekly users' clickstream data. The photos show the samples for weeks from 0-3. I want to make a new dataframe in this way: for example if a new df is w=2, then w2=w0+w1+w2. For w3, w3=w0+w1+w2+3. As you can see the datasets do not have identical id_users, but id a user does not show in a certain week. All dataframes have the same columns, but indexes are not exactly same. So how to add based on the logic where indexes match?
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
43284 1 8 0 8 5 0 0 0 2 3 1
45664 0 16 0 4 0 0 0 0 5 16 2
52014 0 0 0 5 4 0 0 0 0 2 2
53488 1 37 0 19 0 0 3 0 3 23 6
60135 0 124 0 87 3 0 24 0 8 19 14
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
40419 0 8 0 3 4 0 6 0 1 6 0
43284 1 4 0 14 26 2 0 0 2 4 2
45664 0 9 0 15 11 0 0 0 1 6 14
52014 0 0 0 8 9 0 8 0 2 2 1
53488 0 2 0 4 0 0 4 0 0 0 0

concat then groupby sum
out = pd.concat([df1,df2]).groupby('id_user',as_index=False).sum()
Out[147]:
id_user c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11
0 40419 0 8 0 3 4 0 6 0 1 6 0
1 43284 2 12 0 22 31 2 0 0 4 7 3
2 45664 0 25 0 19 11 0 0 0 6 22 16
3 52014 0 0 0 13 13 0 8 0 2 4 3
4 53488 1 39 0 23 0 0 7 0 3 23 6
5 60135 0 124 0 87 3 0 24 0 8 19 14

Related

User based encoding/convert with its interaction in pandas

I have this dataframe which looks like this:
user_id : Represents user
question_id : Represent question number
user_answer : which option user has opted for the specific question from (A,B,C,D)
correct_answer: What is correct answer for that specific question
correct : 1.0 it means user answer is right
elapsed_time : it represents time in minutes user took to answer that question
timestamp : UNIX TIMESTAMP OF EACH INTERACTION
real_date : I have added this column and converted timestamp to human date & time
** user_*iD ***
** question_*id ***
** user_*answer ***
** correct_answer **
** correct **
** elapsed_*time ***
** solving_*id ***
** bundle_*id ***
timestamp
real_date
1
1
A
A
1.0
5.00
1
b1
1547794902000
Friday, January 18, 2019 7:01:42 AM
1
2
D
D
1.0
3.00
2
b2
1547795130000
Friday, January 18, 2019 7:05:30 AM
1
5
C
C
1.0
7.00
5
b5
1547795370000
Friday, January 18, 2019 7:09:30 AM
2
10
C
C
1.0
5.00
10
b10
1547806170000
Friday, January 18, 2019 10:09:30 AM
2
1
B
B
1.0
15.0
1
b1
1547802150000
Friday, January 18, 2019 9:02:30 AM
2
15
A
A
1.0
2.00
15
b15
1547803230000
Friday, January 18, 2019 9:20:30 AM
2
7
C
C
1.0
5.00
7
b7
1547802730000
Friday, January 18, 2019 9:12:10 AM
3
12
A
A
1.0
1.00
25
b12
1547771110000
Friday, January 18, 2019 12:25:10 AM
3
10
C
C
1.0
2.00
10
b10
1547770810000
Friday, January 18, 2019 12:20:10 AM
3
3
D
D
1.0
5.00
3
b3
1547770390000
Friday, January 18, 2019 12:13:10 AM
104
6
C
C
1.0
6.00
6
b6
1553040610000
Wednesday, March 20, 2019 12:10:10 AM
104
4
A
A
1.0
5.00
4
b4
1553040547000
Wednesday, March 20, 2019 12:09:07 AM
104
1
A
A
1.0
2.00
1
b1
1553040285000
Wednesday, March 20, 2019 12:04:45 AM
I need to do some encoding , I don't know which encoding should I do and how?
What i need a next dataframe to look like this :
user_id
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b15
1
1
2
0
0
3
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
2
0
0
0
0
3
3
0
0
1
0
0
0
0
0
0
2
0
3
0
0
0
104
1
0
0
2
0
3
0
0
0
0
0
0
0
0
0
As you can see with the help of timestamp and real_date ; the question_id of each user is not sorted,
The new dataframe should contain which of the bundles user has interacted with, time-based sorted.
First create the final value for each bundle element using groupby and cumcount then pivot your dataframe. Finally reindex it to get all columns:
bundle = [f'b{i}' for i in range(1, 16)]
values = df.sort_values('timestamp').groupby('user_iD').cumcount().add(1)
out = (
df.assign(value=values).pivot_table('value', 'user_iD', 'bundle_id', fill_value=0)
.reindex(bundle, axis=1, fill_value=0)
)
Output:
>>> out
bundle_id b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
user_iD
1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
>>> out.reset_index().rename_axis(columns=None)
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
Lacking more Pythonish experience, I'm proposing the following (partially commented) code snippet which is not optimized in any way, being based merely on elementary pandas.DataFrame API reference.
import pandas as pd
import io
import sys
data_string = '''
user_iD;question_id;user_answer;correct_answer;correct;elapsed_time;solving_id;bundle_id;timestamp
1;1;A;A;1.0;5.00;1;b1;1547794902000
1;2;D;D;1.0;3.00;2;b2;1547795130000
1;5;C;C;1.0;7.00;5;b5;1547795370000
2;10;C;C;1.0;5.00;10;b10;1547806170000
2;1;B;B;1.0;15.0;1;b1;1547802150000
2;15;A;A;1.0;2.00;15;b15;1547803230000
2;7;C;C;1.0;5.00;7;b7;1547802730000
3;12;A;A;1.0;1.00;25;b12;1547771110000
3;10;C;C;1.0;2.00;10;b10;1547770810000
3;3;D;D;1.0;5.00;3;b3;1547770390000
104;6;C;C;1.0;6.00;6;b6;1553040610000
104;4;A;A;1.0;5.00;4;b4;1553040547000
104;1;A;A;1.0;2.00;1;b1;1553040285000
'''
df = pd.read_csv( io.StringIO(data_string), sep=";", encoding='utf-8')
# get only necessary columns ordered by timestamp
df_aux = df[['user_iD','bundle_id','correct', 'timestamp']].sort_values(by=['timestamp'])
# hard coded new headers (possible to build from real 'bundle_id's)
df_new_headers = ['b{}'.format(x+1) for x in range(15)]
df_new_headers.insert(0, 'user_iD')
dict_answered = {}
# create a new dataframe (I'm sure that there is a more Pythonish solution)
df_new_data = []
user_ids = sorted(set( [x for label, x in df_aux.user_iD.items()]))
for user_id in user_ids:
dict_answered[user_id] = 0
if len( sys.argv) > 1 and sys.argv[1]:
# supplied arg in the next line for better result readability
df_new_values = [sys.argv[1].strip('"').strip("'")
for x in range(len(df_new_headers)-1)]
else:
# zeroes (original assignment)
df_new_values = [0 for x in range(len(df_new_headers)-1)]
df_new_values.insert(0, user_id)
df_new_data.append(df_new_values)
df_new = pd.DataFrame(data=df_new_data, columns=df_new_headers)
# fill the new dataframe using values from the original one
for aux in df_aux.itertuples(index=True, name=None):
if aux[3] == 1.0:
# add 1 to number of already answered questions for current user
dict_answered[aux[1]] += 1
df_new.loc[ df_new["user_iD"] == aux[1], aux[2]] = dict_answered[aux[1]]
print( df_new)
Output examples
Example: .\SO\70751715.py
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 0 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0 0 2 0 0 4 0 0 0 0 3
2 3 0 0 1 0 0 0 0 0 0 2 0 3 0 0 0
3 104 1 0 0 2 0 3 0 0 0 0 0 0 0 0 0
Example: .\SO\70751715.py .
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 . . 3 . . . . . . . . . .
1 2 1 . . . . . 2 . . 4 . . . . 3
2 3 . . 1 . . . . . . 2 . 3 . . .
3 104 1 . . 2 . 3 . . . . . . . . .
Example: .\SO\70751715.py ''
user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
0 1 1 2 3
1 2 1 2 4 3
2 3 1 2 3
3 104 1 2 3
I think you are looking for LabelEncoder. First import the library:
#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
Then you should be able to convert objects to category:
#CONVERT: convert objects to category
#code categorical data
label = LabelEncoder()
dataset['question_id'] = label.fit_transform(dataset['question_id']
dataset['user_answer'] = label.fit_transform(dataset['user_answer'])
dataset['correct_answer'] = label.fit_transform(dataset['correct_answer'])
Or just use below:
dataset.apply(LabelEncoder().fit_transform)

Pandas Groupby and divide the dataset into subgroups based on user input and label numbers to each subgroup

Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0

How to perform set-like operations in pandas?

I need to fill a column with values, that are present in a set and not present in any other columns.
initial df
c0 c1 c2 c3 c4 c5
0 4 5 6 3 2 1
1 1 5 4 0 2 3
2 5 6 4 0 1 3
3 5 4 6 2 0 1
4 5 6 4 0 1 3
5 0 1 4 5 6 2
I need df['c6'] column that is a set-like difference operation product between a set of set([0,1,2,3,4,5,6]) and each row of df
so that the result df is
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3
Thank you!
Slightly different approach:
df['c6'] = sum(range(7)) - df.sum(axis=1)
or if you want to be more verbose:
df['c6'] = sum([0,1,2,3,4,5,6]) - df.sum(axis=1)
Use numpy setdiff1d to find the difference between the two arrays and assign the output to column c6
ck = np.array([0,1,2,3,4,5,6])
M = df.to_numpy()
df['c6'] = [np.setdiff1d(ck,i)[0] for i in M]
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3
A simple way I could think of is using a list comprehension and set difference:
s = {0, 1, 2, 3, 4, 5, 6}
s
{0, 1, 2, 3, 4, 5, 6}
df['c6'] = [tuple(s.difference(vals))[0] for vals in df.values]
df
c0 c1 c2 c3 c4 c5 c6
0 4 5 6 3 2 1 0
1 1 5 4 0 2 3 6
2 5 6 4 0 1 3 2
3 5 4 6 2 0 1 3
4 5 6 4 0 1 3 2
5 0 1 4 5 6 2 3

adding one to all the values in a dataframe

I have a dataframe like the one below. I would like to add one to all of the values in each row. I am new to this forum and python so i can't conceptualise how to do this. I need to add 1 to each value. I intend to use bayes probability and the posterior probability will be 0 when i multiply them. PS. I am also new to probability but others have applied the same method. Thanks for your help in advance. I am using pandas to do this.
Disease Gene1 Gene2 Gene3 Gene4
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
With this being your dataframe:
df = pd.DataFrame({
"Disease":[f"D{i}" for i in range(1,9)],
"Gene1":[0,0,0,24,0,0,0,4],
"Gene2":[0,0,17,0,0,32,0,0],
"Gene3":[25,0,0,0,0,0,0,0],
"Gene4":[0,0,16,0,0,11,0,0]})
Disease Gene1 Gene2 Gene3 Gene4
0 D1 0 0 25 0
1 D2 0 0 0 0
2 D3 0 17 0 16
3 D4 24 0 0 0
4 D5 0 0 0 0
5 D6 0 32 0 11
6 D7 0 0 0 0
7 D8 4 0 0 0
The easiest way to do this is to do
df += 1
However, since you have a column which is string (The Disease column)
This will not work.
But we can conveniently set the Disease column to be the index, like this:
df.set_index('Disease', inplace=True)
Now your dataframe looks like this:
Gene1 Gene2 Gene3 Gene4
Disease
D1 0 0 25 0
D2 0 0 0 0
D3 0 17 0 16
D4 24 0 0 0
D5 0 0 0 0
D6 0 32 0 11
D7 0 0 0 0
D8 4 0 0 0
And if we do df += 1 now, we get:
Gene1 Gene2 Gene3 Gene4
Disease
D1 1 1 26 1
D2 1 1 1 1
D3 1 18 1 17
D4 25 1 1 1
D5 1 1 1 1
D6 1 33 1 12
D7 1 1 1 1
D8 5 1 1 1
because the plus operation only acts on the data columns, not on the index.
You can also do this on column basis, like this:
df["Gene1"] = df["Gene1"] + 1
You can filter the df whether the underlying dtype is not 'object':
In [110]:
numeric_cols = [col for col in df if df[col].dtype.kind != 'O']
numeric_cols
Out[110]:
['Gene1', 'Gene2', 'Gene3', 'Gene4']
In [111]:
df[numeric_cols] += 1
df
Out[111]:
Disease Gene1 Gene2 Gene3 Gene4
0 D1 1 1 26 1
1 D2 1 1 1 1
2 D3 1 18 1 17
3 D4 25 1 1 1
4 D5 1 1 1 1
5 D6 1 33 1 12
6 D7 1 1 1 1
7 D8 5 1 1 1
EDIT
It looks like your df possibly has strings instead of numeric types, you can convert the dtype to numeric using convert_objects:
df = df.convert_objects(convert_numeric=True)

How do would you split this given NSString into a NSDictionary?

I have some data i aquire from some linux box and want to put it into a NSDictionary for later processing.
How wold you get this NSString into a NSDictionary like the following?
data
(
bytes
(
60 ( 1370515694 )
48 ( 812 )
49 ( 300 )
...
)
pkt
(
60 ( 380698 )
59 ( 8 )
58 ( 412 )
...
)
block
(
60 ( 5 )
48 ( 4 )
49 ( 7 )
...
)
drop
(
60 ( 706 )
48 ( 2 )
49 ( 4 )
...
)
session
(
60 ( 3 )
48 ( 1 )
49 ( 2 )
...
)
)
The data string looks like:
//time bytes pkt block drop session
60 1370515694 380698 5 706 3
48 812 8 4 2 1
49 300 412 7 4 2
50 0 0 0 0 0
51 87 2 0 0 0
52 87 2 0 0 0
53 0 0 0 0 0
54 0 0 0 0 0
55 0 0 0 0 0
56 0 0 0 0 0
57 812 8 0 0 0
58 812 8 0 0 0
59 0 0 0 0 0
0 0 0 0 0 0
1 2239 12 2 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 2882 19 2 0 0
8 4906 29 4 0 0
9 1844 15 11 0 0
10 4210 29 17 0 0
11 3370 18 4 0 0
12 3370 18 4 0 0
13 1184 7 3 0 0
14 0 0 0 0 0
15 4046 19 3 0 0
16 4956 23 3 0 0
17 2960 18 2 0 0
18 2960 18 2 0 0
19 1088 6 2 0 0
20 0 0 0 0 0
21 3261 17 3 0 0
22 3261 17 3 0 0
23 1228 6 2 0 0
24 1228 6 2 0 0
25 2628 17 2 0 0
26 4688 26 3 0 0
27 1752 13 5 0 0
28 3062 21 5 0 0
29 174 2 2 0 0
30 96 1 1 0 0
31 4351 23 5 0 0
32 0 0 0 0 0
33 4930 23 7 0 0
34 6750 31 7 0 0
35 1241 6 2 0 0
36 1241 6 2 0 0
37 3571 29 2 0 0
38 0 0 0 0 0
39 1010 5 1 0 0
40 1010 5 1 0 0
41 88859 72 3 0 1
42 90783 81 4 0 1
43 2914 19 3 0 0
44 0 0 0 0 0
45 2157 17 1 0 0
46 2157 17 1 0 0
47 78 1 1 0 0
.
Time (first column) should be the key for the sub-sub-dictionaries.
So the idea behind all that is that i can later randmly access the PKT value at a given TIME x, as well as the BLOCK amount at TIME y, and SESSION value at TIME z .. and so on..
Thanks in advance
You probably don't want a dictionary but an array containing dictionaries of all the data entries. The simplest way to parse something like this in Objective-C is to use the componentsSeparatedByString method in NSString
NSString* dataString = <Your Data String> // Assumes the items are separated by newlines
NSArray* items = [dataString componentsSeparatedByString:#"\n"];
NSMutableArray* dataDictionaries = [NSMutableArray array];
for (NSString* item in items) {
NSArray* elements = [item componentsSeparatedByString:#" "];
NSDictionary* entry = #{
#"time": [elements objectAtIndex:0],
#"bytes": [elements objectAtIndex:1],
#"pkt": [elements objectAtIndex:2],
#"block": [elements objectAtIndex:3], #"drop": [elements objectAtIndex:4],
#"session": [elements objectAtIndex:5],
};
[dataDictionaries addObject: entry];
}