restrict pattern to specified strings - awk

I have a set of strings. Lets say, (list.txt) they are:
1abc_A
2pqr_X
4ghi_Z
I also have a text file (test.txt), which looks like this:
1abc_A 2pqr_X 0.55 0.87
2pqr_X 3def_Y 0.21 0.24
4ghi_Z 1abc_A 0.98 0.75
2pqr_X 4ghi_Z 0.99 0.76
2pqr_X 2pqr_X 1.00 1.00
I need to get only those lines from test.txt, such that strings in columns 1 and 2, belong to the strings included in list.txt
In this case, my output would be as follows:
1abc_A 2pqr_X 0.55 0.87
4ghi_Z 1abc_A 0.98 0.75
2pqr_X 4ghi_Z 0.99 0.76
2pqr_X 2pqr_X 1.00 1.00
i.e all the lines in test.txt EXCEPT the 2nd line, since column 2 in 2nd line, 3def_Y is not among the list of strings specified in list.txt
How can I do this in awk?
Please note that test.txt is a large text file, of almost 7GB.
What is the fastest way to go about this problem ?
Please help .

awk 'NR==FNR{a[$0];next} ($1 in a) && ($2 in a)' list.txt test.txt
Stores the contents of list.txt as indices of an array, and then line by line of test.txt checks that it's 1st and 2nd fields are both indices of that array. Will work for any size of test.txt as it doesn't store any of test.txt in memory.

Related

equalize length of the temp data

I have two text files(file1.txt,file2.txt) that contain time_stamp in julian day in first column and temperature data in second columns. Based on the time stamp of file1.txt I have to extend the length of file2.txt by appending zero, so that the length of the file2.txt will be equals to length of the file1.txt.
Input data
cat file1.txt
023 4.5
024 6.8
025 9.8
030 2.3
125 1.4
129 5.8
168 1.0
cat file2.txt
024 1.2
025 2.3
125 1.6
output
023 0.0
024 1.2
025 2.3
030 0.0
125 1.6
129 0.0
168 0.0
In my code i am unable to insert the main portion that does the magic
I tried
import numpy as np
import pandas as pd
data1=np.loadtxt("data1.txt")
data2=np.loadtxt("data2.txt")
if data1==data2:
print('same length data')
else:
............
You can try this in awk:
awk 'FNR==NR{
f2[$1]=$0
next
}
$1 in f2{print f2[$1]; next}
{printf("%s%s%1.1f\n", $1, OFS, 0.0)}
' file2 file1
Or this in Python:
f2_data={}
with open(fn2) as f2:
for line in f2:
line=line.strip()
field1, field2=line.split()
f2_data[field1]=line
with open(fn1) as f1:
for line in f1:
field1, field2=line.strip().split()
if field1 in f2_data:
print(f2_data[field1])
else:
print(field1, '0.0')
Either prints:
023 0.0
024 1.2
025 2.3
030 0.0
125 1.6
129 0.0
168 0.0
In both cases the strategy is the same:
Make an index of file2 first to see what gets printed from that file;
Print the julian date and 0.0 to fill in for date not in file2

How to save different file from one file using value in specific column using bash

i want to save line using value from column $1 and save the line in one file using value from $1, if it has different value save it into another new file
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
115 15.7 110.8 2.06 0.72
118 11.9 111.01 -1.04 0.98
what i want is
file1=p004112.txt
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
file2=p004113.txt
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
file3=p004115.txt
115 15.7 110.8 2.06 0.72
file4=p004118.txt
118 11.9 111.01 -1.04 0.98
the file that has to change like that has namefile p004.txt p005.txt
i have tried like this
for i in `ls p????.txt|sed "s/.txt//g"`;do awk '{file=${i}$1".txt" print >> file}' ${i}.txt;done
but it doesn't work :( anyone has the solution from this problem?
Thank you
With your shown samples, please try following. Though your samples are shown as sorted 1st column but I have still used sort to sort the file with 1st column, in case your whole file is sorted with 1st column then remove sort command from following and paste Input_file at the end of awk program.
sort -k1 Input_file | awk 'prev!=$1{close(outputFile);outputFile=("p004"$1".txt")} {print > (outputFile);prev=$1}'
OR a non-oneliner form of above solution:
sort -k1 Input_file |
awk '
prev!=$1{
close(outputFile)
outputFile=("p004"$1".txt")
}
{
print > (outputFile)
prev=$1
}
'
Explanation: Simple explanation would be: Firstly sorting the Input_file with 1st column and sending its output as an Input to awk command. Then in awk program: Setting outputFile name to p004 with 1st column name appended with .txt as per need by OP and closing the output file in backend to avoid "too many opened files" error, this is done each time 1st column is changing(not equal to its previous line's value). Then printing each line into output file and setting prev value to 1st column value in each line.

Comparing two columns in two files using awk with duplicates

File 1
A4gnt 0 0 0 0.3343
Aaas 2.79 2.54 1.098 0.1456
Aacs 0.94 0.88 1.063 0.6997
Aadac 0 0 0 0.3343
Aadacl2 0 0 0 0.3343
Aadat 0.01 0 1.723 0.7222
Aadat 0.06 0.03 1.585 0.2233
Aaed1 0.28 0.24 1.14 0.5337
Aaed1 1.24 1.27 0.976 0.9271
Aaed1 15.91 13.54 1.175 0.163
Aagab 1.46 1.14 1.285 0.3751
Aagab 6.12 6.3 0.972 0.6569
Aak1 0.02 0.01 1.716 0.528
Aak1 0.1 0.19 0.561 0.159
Aak1 0.14 0.19 0.756 0.5297
Aak1 0.16 0.18 0.907 0.6726
Aak1 0.21 0 0 0.066
Aak1 0.26 0.27 0.967 0.9657
Aak1 0.54 1.65 0.325 0.001
Aamdc 0.04 0 15.461 0.0875
Aamdc 1.03 1.01 1.019 0.8817
Aamdc 1.27 1.26 1.01 0.9285
Aamdc 7.21 6.94 1.039 0.7611
Aamp 0.06 0.05 1.056 0.9136
Aamp 0.11 0.11 1.044 0.9227
Aamp 0.12 0.13 0.875 0.7584
Aamp 0.22 0.2 1.072 0.7609
File 2
Adar
Ak3
Alox15b
Ampd2
Ampd3
Ankrd17
Apaf1
Aplp1
Arih1
Atg14
Aurkb
Bcl2l14
Bmp2
Brms1l
Cactin
Camta2
Cav1
Ccr5
Chfr
Clock
Cnot1
Crebrf
Crtc3
Csnk2b
Cul3
Cx3cl1
Dnaja3
Dnmt1
Dtl
Ednra
Eef1e1
Esr1
Ezr
Fam162a
Fas
Fbxo30
Fgr
Flcn
Foxp3
Frzb
Fzd6
Gdf3
Hey2
Hnf4
The desired output would be wherever matches in the first column from both file print out all the columns in the first file (including duplicates).
I've tried
awk 'NR==FNR{a[$1]=$2"\t"$3"\t"$4"\t"$5; next} { if($1 in a) { print $0,a[$1] } }' File2 File1 > output
But for some reason I'm getting just few hits. Does anyone know why?
Read second file first, and store 1st column values in array arr as array keys, and then read first file, if 1st column of file1 exists in array arr which was created using file2, then print current row/record from file1.
awk 'FNR==NR{arr[$1];next}$1 in arr' file2 file1
Advantage:
if you use a[$1]=$2"\t"$3"\t"$4"\t"$5; next, if there's any data with same key will be replaced with previous value,
but if you use arr[$1];next, we store just unique key, and $1 in arr takes care of duplicate record even if it exists

Searching File2 with 3 columns from File 1 with awk

Does anyone know how to print "Not Found" if there is no match, such that the print output will always contain the same number of lines as File 1?
To be more specific, I have two files with four columns:
File 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
File 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 803 804 0.24
This is what I am using now:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $4}' file1.txt file2.txt
This generates the following output:
0.55
0.09
0.88
However, I want to get this:
0.55
0.09
0.88
Not Found
Could you please help?
Sorry if this is presented in a confusing manner; I have little experience with awk and am confused myself.
In a separate issue, I want to end up having a file that has the data from File 2 added on to File1, like so:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 Not Found
I was going to generate the file as before (lets call it file2-matches.txt), then use the paste command:
paste -d"\t" file1.txt file2-matches.txt > output.txt
But considering I have to do this matching for over 100 files, is there any more efficient way to do this that you can suggest?
Add an else clause:
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found

awk: divide odd columns by following even column

I want to divide all the odd columns in a file by the next even column, e.g. column1/column2, column3/column4,......, columnN/columnN+1
test1.txt
1 4 1 2 1 3
1 2 4 2 3 9
desired output
0.25 0.5 0.333
0.5 2 0.333
I tried this:
awk 'BEGIN{OFS="\t"} { for (i=2; i<NF+2; i+=2) printf $(i-1)/i OFS; printf "\n"}'
but it doesn't work.
I would like to add that my actual files have a very large and variable (but always even) number of columns and I would like something that would work on all of them.
awk '{for(i=1;i<NF;i+=2)printf "%f%s",$i/$(i+1),OFS;print "";}' input.txt
Output:
0.250000 0.500000 0.333333
0.500000 2.000000 0.333333
You can adjust printing format to your needs see here for more info.