Delete set of lines between two strings in Excel VBA - vba

I have a task to read a bunch of notepad files and in each of the file, I need to delete the set of lines which are present in between the strings "dynamics" and "end dynamics" and also delete those two strings as well.
The attached image shows the part which needs to be removed, wherever it has occurred (dynamics to end dynamics – any content can be present with these two boundaries) in that text file.
L3.3
resizeText 1
zoomLines 0
zoomArrows 0
doLasso 1
opaqueMove 1
selectDistance 30
adjustFonts 1
doubleBuffer 1
clipping 1
nCopyAreas 0
drawTextLimit 0.5
saveObjects 1
canvasBackground #050048007800
defaultForeground #000000000000
layers 1
layerName 0 0
layerName 1 1
layerName 2 2
layerName 3 3
layerName 4 4
layerName 5 5
layerName 6 6
layerName 7 7
layerName 8 8
layerName 9 9
layerName 10 10
layerName 11 11
layerName 12 12
layerName 13 13
layerName 14 14
layerName 15 15
layerName 16 16
layerName 17 17
layerName 18 18
layerName 19 19
layerName 20 20
layerName 21 21
layerName 22 22
layerName 23 23
layerName 24 24
layerName 25 25
layerName 26 26
layerName 27 27
layerName 28 28
layerName 29 29
layerName 30 30
layerName 31 31
gend
N 0
P 0 0
T -1
R 0 0
0
0 4 1 0
Name #WVP
0 1 1
!
27e
054878
-1-1-1
0
0
0
0 0
dynamics
script
//***GblSymDetails***
;DTLS; GSA_TEXT = "CIUXX"
//***ApplReplace***
//GEMTool = 1
// = ASPECTLINK
end script
end dynamics
0 0 1920 1080 0 0
N 2
P 34.7792 181.549
T 2 21071 1 0 0
0 0
R 0 0
0
0 0 3 0
Name STAT5495
0 1 1
!
27e
a5a5a5
a5a5a5
0
0
0
2 0
0 0 0
0 0 0 0 1
4
0 12.7627
11.5724 0
192.874 0.255264
185.159 12.3798
N 4
P 221.604 181.887
T 2 21071 1 0 0
0 0
R 0 0
0
0 0 5 0
Name STAT5496
0 1 1
!
27e
7c7c7c
7c7c7c
0
0
0
2 0
dynamics
script
func ip_FillColor() {
return FILLCOLOR;
}
func ip_LineColor() {
return LINECOLOR;
}
func ip_TEXT() {
return TEXT;
}
func BackColor() {
return RGB(124,124,124);
}
func ForeColor() {
return RGB(124,124,124);
}
// when ...
object.background = BackColor();
object.foreground = ForeColor();
end script
end dynamics
0 0 0
0 0 0 0 1
4
0 11.6149
6.04944 0
6.04944 54.445
0 72.5933
N 6
P 81.7124 181.888
T 2 21071 1 0 0
0 0
R 0 0
0
0 0 7 0
Name STAT5497
0 1 1
!
27e
616161
616161
0
0
0
2 0
dynamics
script
func ip_FillColor() {
return FILLCOLOR;
}
func ip_LineColor() {
return LINECOLOR;
}
func ip_TEXT() {
return TEXT;
}
func BackColor() {
return RGB(97,97,97);
}
func ForeColor() {
return RGB(97,97,97);
}
// when ...
object.background = BackColor();
object.foreground = ForeColor();
end script
end dynamics
0 1 0
0 0 0 0 1
72
93.8536 5.86588

Something like the code below. You should be able to add code to dump all the files into a folder and DIR through them reading, stripping text and saving the output files automatically. Depending on the number of files it may or may not be time-efficient to do.
Sub cleantext()
Dim lineOfText As String
Dim skipLines As Boolean
'Open files for writing
Open "D:\inputfile.txt" For Input As #1
Open "D:\outputfile.txt" For Output As #2
skipLine = False
Do Until EOF(1)
Line Input #1, lineOfText
If lineOfText = "dynamics" Then skipLines = True
If lineOfText = "end dynamics" Then skipLines = False
If Not skipLines And Not lineOfText = "end dynamics" Then Print #2, lineOfText
Loop
Close #1
Close #2
End Sub

Related

generate date feature column using pandas

I have a timeseries data frame that has columns like these:
Date temp_data holiday day
01.01.2000 10000 0 1
02.01.2000 0 1 2
03.01.2000 2000 0 3
..
..
..
26.01.2000 200 0 26
27.01.2000 0 1 27
28.01.2000 500 0 28
29.01.2000 0 1 29
30.01.2000 200 0 30
31.01.2000 0 1 31
01.02.2000 0 1 1
02.02.2000 2500 0 2
Here, holiday = 0 when there is data present - indicates a working day
holiday = 1 when there is no data present - indicated a non-working day
I am trying to extract three new columns from this data -second_last_working_day_of_month and third_last_working_day_of_month and the fourth_last_wday
the output data frame should look like this
Date temp_data holiday day secondlast_wd thirdlast_wd fouthlast_wd
01.01.2000 10000 0 1 1 0 0
02.01.2000 0 1 2 0 0 0
03.01.2000 2000 0 3 0 0 0
..
..
25.01.2000 345 0 25 0 0 1
26.01.2000 200 0 26 0 1 0
27.01.2000 0 1 27 0 0 0
28.01.2000 500 0 28 1 0 0
29.01.2000 0 1 29 0 0 0
30.01.2000 200 0 30 0 0 0
31.01.2000 0 1 31 0 0 0
01.02.2000 0 1 1 0 0 0
02.02.2000 2500 0 2 0 0 0
Can anyone help me with this?
Example
data = [['26.01.2000', 200, 0, 26], ['27.01.2000', 0, 1, 27], ['28.01.2000', 500, 0, 28],
['29.01.2000', 0, 1, 29], ['30.01.2000', 200, 0, 30], ['31.01.2000', 0, 1, 31],
['26.02.2000', 200, 0, 26], ['27.02.2000', 0, 0, 27], ['28.02.2000', 500, 0, 28],['29.02.2000', 0, 1, 29]]
df = pd.DataFrame(data, columns=['Date', 'temp_data', 'holiday', 'day'])
df
Date temp_data holiday day
0 26.01.2000 200 0 26
1 27.01.2000 0 1 27
2 28.01.2000 500 0 28
3 29.01.2000 0 1 29
4 30.01.2000 200 0 30
5 31.01.2000 0 1 31
6 26.02.2000 200 0 26
7 27.02.2000 0 0 27
8 28.02.2000 500 0 28
9 29.02.2000 0 1 29
Code
for example make secondlast_wd column (n=2)
n = 2
s = pd.to_datetime(df['Date'])
result = df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(n)
result
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
make result to secondlast_wd column
df.assign(secondlast_wd=result.astype('int'))
output:
Date temp_data holiday day secondlast_wd
0 26.01.2000 200 0 26 0
1 27.01.2000 0 1 27 0
2 28.01.2000 500 0 28 1
3 29.01.2000 0 1 29 0
4 30.01.2000 200 0 30 0
5 31.01.2000 0 1 31 0
6 26.02.2000 200 0 26 0
7 27.02.2000 0 0 27 1
8 28.02.2000 500 0 28 0
9 29.02.2000 0 1 29 0
you can change n and can get third, forth and so on..
Update for comment
chk workday(reverse index)
df.iloc[::-1, 2].eq(0) # 2 means location of 'holyday'. can use df.loc[::-1,"holiday"]
9 False
8 True
7 True
6 True
5 False
4 True
3 False
2 True
1 False
0 True
Name: holiday, dtype: bool
reverse cumsum by group(month). then when workday is +1 above value and when holyday is still same value with above.(of course in reverse index)
df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum()
9 0
8 1
7 2
6 3
5 0
4 1
3 1
2 2
1 2
0 3
Name: holiday, dtype: int64
find holiday == 0 and result == 2, that is secondlast_wd
df['holiday'].eq(0) & df.iloc[::-1, 2].eq(0).groupby(s.dt.month).cumsum().eq(2)
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: holiday, dtype: bool
This operation returns index as it was.(not reverse)
Other Way
A more understandable code would be:
s = pd.to_datetime(df['Date'])
idx1 = df[df['holiday'].eq(0)].groupby(s.dt.month, as_index=False).nth(-2).index
df.loc[idx1, 'lastsecondary_wd'] = 1
df['lastsecondary_wd'] = df['lastsecondary_wd'].fillna(0).astype('int')
same result

Display commend in ampl

I have a 2 dimension variable in ampl and I want to display it. I want to change the order of the indices but I do not know how to do that! I put my code , data and out put I described what kind of out put I want to have.
Here is my code:
param n;
param t;
param w;
param p;
set Var, default{1..n};
set Ind, default{1..t};
set mode, default{1..w};
var E{mode, Ind};
var B{mode,Var};
var C{mode,Ind};
param X{mode,Var,Ind};
var H{Ind};
minimize obj: sum{m in mode,i in Ind}E[m,i];
s.t. a1{m in mode, i in Ind}: sum{j in Var} X[m,j,i]*B[m,j] -C[m,i] <=E[m,i];
solve;
display C;
data;
param w:=4;
param n:=9;
param t:=2;
param X:=
[*,*,1]: 1 2 3 4 5 6 7 8 9 :=
1 69 59 100 70 35 1 1 0 0
2 34 31 372 71 35 1 0 1 0
3 35 25 417 70 35 1 0 0 1
4 0 10 180 30 35 1 0 0 0
[*,*,2]: 1 2 3 4 5 6 7 8 9 :=
1 64 58 68 68 30 2 1 0 0
2 44 31 354 84 30 2 0 1 0
3 53 25 399 85 30 2 0 0 1
4 0 11 255 50 30 2 0 0 0
The output of this code using glpksol is like tis:
C[1,1].val = -1.11111111111111
C[1,2].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[2,2].val = -1.11111111111111
C[3,1].val = -0.915032679738562
C[3,2].val = -1.11111111111111
C[4,1].val = 0.141414141414141
C[4,2].val = 0.2003367003367
but I want the result to be like this:
C[1,1].val = -1.11111111111111
C[2,1].val = -0.858585858585859
C[3,1].val = -0.915032679738562
C[4,1].val = 0.141414141414141
C[1,2].val = -1.11111111111111
C[2,2].val = -1.11111111111111
C[3,2].val = -1.11111111111111
C[4,2].val = 0.2003367003367
any idea?
You can use for loops and printf commands in your .run file:
for {i in Ind}
for {m in mode}
printf "C[%d,%d] = %.4f\n", m, i, C[m,i];
or even:
printf {i in Ind, m in mode} "C[%d,%d] = %.4f\n", m, i, C[m,i];
I don't get the same numerical results as you, but anyway the output works:
C[1,1] = 0.0000
C[2,1] = 0.0000
C[3,1] = 0.0000
C[4,1] = 0.0000
C[1,2] = 0.0000
C[2,2] = 0.0000
C[3,2] = 0.0000
C[4,2] = 0.0000

How can I change my index vector into sparse feature vector that can be used in sklearn?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0

how to convert pandas dataframe to libsvm format?

I have pandas data frame like below.
df
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 \
0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
1 0 1 1 1 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
5 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
7 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
[8 rows x 100 columns]
I have target variable as an array as below.
[1, -1, -1, 1, 1, -1, 1, 1]
How can I map this target variable to a data frame and convert it into lib SVM format?.
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.index.map[(equi)]
d = df[np.setdiff1d(df.columns,['indx','labels'])]
e = df.label
dump_svmlight_file(d,e,'D:/result/smvlight2.dat')er code here
ERROR:
File "D:/spyder/april.py", line 54, in <module>
df["labels"] = df.index.map[(equi)]
TypeError: 'method' object is not subscriptable
When I use
df["labels"] = df.index.list(map[(equi)])
ERROR:
AttributeError: 'RangeIndex' object has no attribute 'list'
Please help me to solve those errors.
I think you need convert index to_series and then call map:
df["labels"] = df.index.to_series().map(equi)
Or use rename of index:
df["labels"] = df.rename(index=equi).index
All together:
For difference of columns pandas has difference:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
df["labels"] = df.rename(index=equi).index
e = df["labels"]
d = df[df.columns.difference(['indx','labels'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')
Also it seems label column is not necessary:
from sklearn.datasets import dump_svmlight_file
equi = {0:1, 1:-1, 2:-1,3:1,4:1,5:-1,6:1,7:1}
e = df.rename(index=equi).index
d = df[df.columns.difference(['indx'])]
dump_svmlight_file(d,e,'C:/result/smvlight2.dat')

Complex Excel Formula in Pandas

Excel Formulas I am trying to replicate in pandas:
Click here to download workbook
* Look at columns D, E and F
entsig and exsig are manual and can be changed. In real life they would be derived from the value of another column or a comparison of two other columns
ent = 1 if entsig previous = 1 and in = 0
in = 1 if ent previous = 1 or (in previous = 1 and ex = 0)
ex = 1 if exsig previous = 1 and in previous = 1
so either ent, in, or ex will always be = 1 but never more than one of them
import pandas as pd
df = pd.DataFrame(
[[0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,1,0,0,0], [0,1,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [0,0,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0],
[0,0,0,0,0], [0,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0], [1,0,0,0,0],
[1,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0], [0,1,0,0,0]],
columns=['entsig', 'exsig','ent', 'in', 'ex'])
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
df
results in
entsig exsig ent in ex
0 0 0 0 0 0
1 1 0 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 0 0 0 1 0
5 0 1 0 1 0
6 0 1 0 0 1
7 1 0 0 0 0
8 1 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 1 0
12 0 1 0 1 0
13 0 1 0 0 1
14 0 1 0 0 0
15 0 0 0 0 0
16 0 0 0 0 0
17 1 0 0 0 0
18 1 0 1 0 0
19 1 0 0 1 0
20 1 1 0 1 0
21 0 1 0 0 1
22 0 1 0 0 0
23 0 1 0 0 0
Question
How can I make this code faster? It runs slow because it's a loop but I have not been able to come up with a solution that does not use loops. Any ideas or comments are appreciated.
If we can assume every group of 1's in entsig is followed by at least one 1 in
exsig, then you could compute ent, ex and in like this:
def ent_in_ex(df):
entsig_mask = (df['entsig'].diff().shift(1) == 1)
exsig_mask = (df['exsig'].diff().shift(1) == 1)
df.loc[entsig_mask, 'ent'] = 1
df.loc[exsig_mask, 'ex'] = 1
df['in'] = df['ent'].shift(1).cumsum().subtract(df['ex'].cumsum(), fill_value=0)
return df
If we can make this assumption, then ent_in_ex is significantly faster:
In [5]: %timeit orig(df)
10 loops, best of 3: 185 ms per loop
In [6]: %timeit ent_in_ex(df)
100 loops, best of 3: 2.23 ms per loop
In [95]: orig(df).equals(ent_in_ex(df))
Out[95]: True
where orig is the original code:
def orig(df):
for i in df.index:
df['ent'][(df.entsig.shift(1)==1) & (df['ent'].shift(1) == 0) & (df['in'].shift(1) == 0)]=1
df['ex'][(df.exsig.shift(1)==1) & (df['in'].shift(1)==1)]=1
df['in'][(df.ent.shift(1)==1) | ((df['in'].shift(1)==1) & (df['ex']==0))]=1
for j in df.index:
df['ent'][df['in'] == 1]=0
df['in'][df['ex']==1]=0
df['ex'][df['ex'].shift(1)==1]=0
return df