Related
I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?
Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result
I have a 2 dimension variable in ampl and I want to display it. I want to change the order of the indices but I do not know how to do that! I put my code , data and out put I described what kind of out put I want to have.
Here is my code:
param n;
param t;
param w;
param p;
set Var, default{1..n};
set Ind, default{1..t};
set mode, default{1..w};
var E{mode, Ind};
var B{mode,Var};
var C{mode,Ind};
param X{mode,Var,Ind};
var H{Ind};
minimize obj: sum{m in mode,i in Ind}E[m,i];
s.t. a1{m in mode, i in Ind}: sum{j in Var} X[m,j,i]*B[m,j] -C[m,i] <=E[m,i];
solve;
display C;
data;
param w:=4;
param n:=9;
param t:=2;
param X:=
[*,*,1]: 1 2 3 4 5 6 7 8 9 :=
1 69 59 100 70 35 1 1 0 0
2 34 31 372 71 35 1 0 1 0
3 35 25 417 70 35 1 0 0 1
4 0 10 180 30 35 1 0 0 0
[*,*,2]: 1 2 3 4 5 6 7 8 9 :=
1 64 58 68 68 30 2 1 0 0
2 44 31 354 84 30 2 0 1 0
3 53 25 399 85 30 2 0 0 1
4 0 11 255 50 30 2 0 0 0
The output of this code using glpksol is like tis:
C[1,1].val = -1.11111111111111
C[1,2].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[2,2].val = -1.11111111111111
C[3,1].val = -0.915032679738562
C[3,2].val = -1.11111111111111
C[4,1].val = 0.141414141414141
C[4,2].val = 0.2003367003367
but I want the result to be like this:
C[1,1].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[3,1].val = -0.915032679738562
C[4,1].val = 0.141414141414141
C[1,2].val = -1.11111111111111
C[2,2].val = -1.11111111111111
C[3,2].val = -1.11111111111111
C[4,2].val = 0.2003367003367
any idea?
You can use for loops and printf commands in your .run file:
for {i in Ind}
for {m in mode}
printf "C[%d,%d] = %.4f\n", m, i, C[m,i];
or even:
printf {i in Ind, m in mode} "C[%d,%d] = %.4f\n", m, i, C[m,i];
I don't get the same numerical results as you, but anyway the output works:
C[1,1] = 0.0000
C[2,1] = 0.0000
C[3,1] = 0.0000
C[4,1] = 0.0000
C[1,2] = 0.0000
C[2,2] = 0.0000
C[3,2] = 0.0000
C[4,2] = 0.0000
I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0
I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.
If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df