Extracting Sequential Pattern - awk

Can anyone help me on how to write a script to extract sequential lines?
I was able to find and get a script working to create all the permutations of the given inputs, but that's not what I need.
awk 'function perm(p,s, i) {
for(i=1;i<=n;i++)
if(p==1)
printf "%s%s\n",s,A[i]
else
perm(p-1,s A[i]", ")
}
{
A[++n]=$1
}
END{
perm(n)
}' infile
Unfortunately, I don't understand the script well enough to made a modification (not due to lack of trying).
I need to extract 2 to 5 sequential lines/word patterns.
An illustration of what I need is as follows:
Eg.
inputfile.txt:
A
B
C
D
E
F
G
outputfile.txt:
A B
B C
C D
D E
E F
F G
A B C
B C D
C D E
D E F
E F G
A B C D
B C D E
C D E F
D E F G
A B C D E
B C D E F
C D E F G

Here's a Python answer.
General algorithm:
Load all letters into a list
For n = 2..5, where n is size of the "window". You "slide" that window over the list and print those n characters.
Python is nice for this because of list slicing.
with open('input.txt') as f_in, open('output.txt', 'w') as f_out:
chars = f_in.read().splitlines()
for n in range(2, 6):
for start_window in range(len(chars) - n + 1):
f_out.write(' '.join(chars[start_window:start_window + n]))
f_out.write('\n')

awk to the rescue!
$ awk 'BEGIN{n=1}
FNR==1{n++}
{a[c++]=$0; c=c%n}
FNR>n-1{for(i=c;i<c+n-1;i++) printf "%s ",a[i%n];
print}' file{,,,}
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
multiple scans of the input file (number of commas). Used seq 9 as the input file.

Another in awk:
{ a[NR]=$0 }
END {
o[0]=ORS
for(i=2;i<=5;i++)
for(j=1;j<=length(a);j++) {
printf "%s%s", a[j], (++k==i?o[k=0]:OFS)
if(!k&&j!=length(a)) j-=(i-1)
}
}

Related

How to obtain the max. value of n-period for every row based on values of another columns in a dataframe?

Let's say I have a df:
idx c1
A 1
B 7
C 8
D 6
E 5
F 6
G 9
H 8
I 0
J 10
What's the fastest way to obtain the highest value for n-periods for every row based on c1, then create a new column for it? Eg. if it's 3-period, then it will be like this:
idx c1 new_col
A 1 0
B 7 0
C 8 0
D 6 8 (prev. 3-period, 1,7,8, 8 is the highest)
E 5 8 (prev. 3-period, 7,8,6 8 is the highest)
F 6 8 (prev. 3-period, 8,6,5 8 is the highest)
G 9 6 (prev. 3-period, 6,5,6 6 is the highest)
H 8 9 (prev. 3-period, 5,6,9 9 is the highest)
I 0 9 (prev. 3-period, 6,9,8 9 is the highest)
J 10 9 (prev. 3-period, 9,8,0 9 is the highest)
My current code is now:
list=[]
for row in range(len(df)):
if row < 3:
list.append(0)
else:
list.append(max(c1[row-3:row]))
df['new_col'] = list
This method is very slow because I have many rows and this has to loop through the whole thing. Is there a faster way to do it? Thanks.
This is just rolling and shift:
df['new_col'] = df['c1'].rolling(3).max().shift().fillna(0)
Output:
idx c1 new_col
0 A 1 0.0
1 B 7 0.0
2 C 8 0.0
3 D 6 8.0
4 E 5 8.0
5 F 6 8.0
6 G 9 6.0
7 H 8 9.0
8 I 0 9.0
9 J 10 9.0

Merging multiple files with null values using AWK

Sorry I am posting it again as i messed up in my earlier post:
I am interesting in joining multiple files (e.g., file 1 file2 file 3...) using matching values in column 1 and get this desired output. Would appreciate any help please:
file1:
A 2 3 4
B 3 7 8
C 4 6 9
file2:
A 7 6 3
C 2 4 7
D 1 6 4
file3:
A 3 2 7
B 4 7 3
E 3 6 8
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8
Here is one for awk. Tested with GNU awk, mawk, original-awk ie. awk 20121220 and Busybox awk:
$ awk '
function nn(c, b,i) {
if(c)
for(i=1;i<=c;i++)
b=b "n "
return b
}
FNR==1{nf+=(NF-1)}
{
for(i=2;i<=NF;i++)
b[$1]=b[$1] $i OFS
a[$1]=a[$1] (n[$1]<(nf-NF+1)?nn(nf-NF+1-n[$1]):"") b[$1]
n[$1]=nf+0
delete b[$1]
}
END{
for(i in a)
print i,a[i] (n[i]<(nf)?nn(nf-n[i]):"")
}' file1 file2 file3
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

Join information from table A with the information from another table B multiple times

So this maybe a simple question but I would like to learn if this can be done in one query.
Table A: contains gene information
gene start end
1 a 5 0
2 b 6 1
3 c 7 2
4 d 8 3
5 e 9 4
6 f 10 5
7 g 11 6
8 h 12 7
9 i 13 8
10 j 14 9
Table B: contains calculated gene information.
gene1 gene2 cor
1 d j -0.7600805
2 c i 0.4274278
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The result table I'm trying to get is:
gene1 gene2 cor start1 end1 start2 end2
1 d j -0.7600805 8 3 14 9
2 c i 0.4274278 7 2 13 8
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553 etc.
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The method I can think of is to join table A onto table B twice, firstly by gene1 and then by gene2, which would require for an intermediate table. Is there any simpler ways to achieve this in one step?
Yes, two joins will do it
You simply need to do this:
SELECT b.Gene1
,b.Gene2
,b.cor
,a1.Start AS Start1
,a1.End AS End1
,a2.Start AS Start2
,a2.End AS End2
FROM TableB b
INNER JOIN TableA a1
ON a1.Gene = b.Gene1
INNER JOIN TableA a2
ON a2.Gene = b.Gene2
Depending on your dbms you may need to tweek the syntax a bit

selecting highest value in a table

A file having 3 columns with tab separated. I want to select the highest value from column 3rd (among the same name of column 1st) and print as the 4th column.
input file
A l 10
A l 2
A m 6
B l 12
B m 13
B n 7
C l 9
C l 8
C n 19
Output file
A l 10 10
A l 2 10
A m 6 10
B l 12 13
B m 13 13
B n 7 13
C l 9 19
C l 8 19
C n 19 19
Could you please suggest awk,or sed command. Thanks
You can use this awk
awk 'FNR==NR {arr[$1]=arr[$1]>$3?arr[$1]:$3;next} {print $0,arr[$1]}' OFS="\t" file{,}
A l 10 10
A l 2 10
A m 6 10
B l 12 13
B m 13 13
B n 7 13
C l 9 19
C l 8 19
C n 19 19
This passes two times over the file. First time to find highest, next to print.
The file{,} make the filename double. You can also use file file instead.