Using awk to pull specific lines from a file

Using awk to pull specific lines from a file - awk

I have two files, one file is my data, and the other file is a list of line numbers that I want to extract from my data file. Can I use awk to read in my lines file, and then extract the lines that match the line numbers?
Example:
Data file:
This is the first line of my data
This is the second line of my data
This is the third line of my data
This is the fourth line of my data
This is the fifth line of my data
Line numbers file
1
4
5
Output:
This is the first line of my data
This is the fourth line of my data
This is the fifth line of my data
I've only ever used command line awk and sed for really simple stuff. This is way beyond me and I have been googling for an hour without an answer.

awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
simply referring to an array subscript creates the entry. Looping over the first file, while NR (record number) is equal to FNR (file record number) using the next statement stores all the line numbers in the array. After that when FNR of the second file is present in the array (true) the line is printed (which is the default action for "true").

One way with sed:
sed 's/$/p/' linesfile | sed -n -f - datafile
You can use the same trick with awk:
sed 's/^/NR==/' linesfile | awk -f - datafile
Edit - Huge files alternative
With regards to huge number of lines it is not prudent to keep whole files in memory. The solution in that case can be to sort the numbers-file and read one line at a time. The following has been tested with GNU awk:
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Run it like this:
awk -v linesfile=$linesfile -f extract.awk infile
Testing:
echo "2
4
7
8
10
13" | awk -v linesfile=/dev/stdin -f extract.awk <(paste <(seq 50e3) <(seq 50e3 | tac))
Output:
2 49999
4 49997
7 49994
8 49993
10 49991
13 49988

Here is an awk example. inputfile is loaded up front, then matching records of datafile are output.
awk \
-v RS="[\r]*[\n]" \
-v FILE="inputfile" \
'BEGIN \
{
LINES = ","
while ((getline Line < FILE))
{
LINES = LINES Line ","
}
}
LINES ~ "," NR "," \
{
print
}
' datafile

I had the same problem. This is the solution already posted by Thor:
cat datafile \
| awk 'BEGIN{getline n<"numbers"} n==NR{print; getline n<"numbers"}'
If like me you don't have a numbers file, but it is instead passed on from stdin and you don't want to generate a temporary numbers file, then this is an alternative solution:
cat numbers \
| awk '{while((getline line<"datafile")>0) {n++; if(n==$0) {print line;next}}}'

This solution...
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative:
sed -nf <(sed 's/.*/&p/' numberfile) datafile

while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt

Related

Filtering using awk returns empty files

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.

You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.

Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

Merge multiple files and split output to multiple files based in each column (post 2)

I have many csv files with exactly same format on rows and columns. In the example below I present only 2 files as input, but i have a lot files with same characteristics
The purpose is for each input file do:
Take value in row 1, 2 and 3.
example in first file
6174
15
3
Then, print first column from row 4 to 6.
Do same process for all input files and output a file with all information of all readed files.
When the process is done for all files and first column. Do the same of the rest columns
At the end the total files output created will be 4 files as there is 4 columns in each file.
Input1
Record Number 6174
Vibrator Identification 15
Start Time Error 3 us
1.6,19.5,,,
1.7,23.2,28.3,27.0
1.8,26.5,27.0,25.4
Input2
Record Number 6176
Vibrator Identification 17
Start Time Error 5 us
1.6,18.6,,,
1.5,23.5,19.7,19.2
1.3,26.8,19.2,18.5
Using the code below, I got the 4 output files as desired, although files 3-4, are not good as spected, because in the first lines there is empty values and my code does not work as supposed. Also I have an issue to get the good value in row 3 in each file.. I get us instead of a number.
output file1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
output file2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
output file3
6174,15,3,0,0,28.3,27.0
6176,17,5,0,0,19.7,19.2
output file4
6174,15,3,0,0,27.0,25.4
6176,17,5,0,0,19.2,18.5
code used
The code works almost fine, merge the csv files and output the 4 files requerides, but there is a problem for files 3-4, when there is empty values.
for f in *.csv ; do
awk -F, 'NR==1 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==2 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==3 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
sed -i 's/\r$//' a-"$f"
for i in seq $(1...4); do
awk -F, 'NR>=4{f=1} f{print '"$""$i"'} f==6{exit}' "$f" > "a""$i"-"$f"
cat a-"$f" a"$i""-""$f" >> t"$i"
sed -i 's/\r$//' t"$i"
done
for i in seq $(1...4); do
awk -v RS= -v OFS=',' -v ORS='\n' '{$1=$1}1' t"$i" > file"$i".csv
done
done
rm -f ./a* ./t*
Appreciate your help

With GNU awk for ENDFILE and automatic handling of multiple open files and assuming your posted sample output showing file3 and file4 each having more fields than file1 and file2 is a mistake:
$ cat tst.awk
BEGIN { FS=OFS=","; numHdrFlds=3 }
FNR <= numHdrFlds {
gsub(/[^0-9]/,"")
hdr = (FNR==1 ? "" : hdr OFS) $0
next
}
{
for (i=1; i<=NF; i++) {
data[i] = (FNR==(numHdrFlds+1) ? "" : data[i] OFS) ($i)+0
}
}
ENDFILE {
for ( fileNr=1; fileNr<=NF; fileNr++ ) {
print hdr, data[fileNr] > ("outputFile" fileNr)
}
}
.
$ awk -f tst.awk file1 file2
$ for i in outputFile*; do echo "$i"; cat "$i"; echo "---"; done
outputFile1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
---
outputFile2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
---
outputFile3
6174,15,3,0,28.3,27
6176,17,5,0,19.7,19.2
---
outputFile4
6174,15,3,0,27,25.4
6176,17,5,0,19.2,18.5
---

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?

Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file

An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo

If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

While using awk showing fatal : cannot open pipe ( Too many open files) error

I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ

First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.

you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.

This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF

How to do calculations over lines of a file in awk

I've got a file that looks like this:
88.3055
45.1482
37.7202
37.4035
53.777
What I have to do is isolate the value from the first line and divide it by the values of the other lines (it's a speedup calculation). I thought of maybe storing the first line in a variable (using NR) and then iterate over the other lines to obtain the values from the divisions. Desired output is:
1,9559
2,3410
2,3608
1,6420
UPDATE
Sorry Ed, my mistake, the desired decimal point is .
I made some small changes to Ed's answer so that awk prints the division of 88.3055 by itself and outputs it to a file speedup.dat:
awk 'NR==1{n=$0} {print n/$0}' tavg.dat > speedup.dat
Is it possible to combine the contents of speedup.dat and the results from another awk command without using intermediate files and in one single awk command?
First command:
awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } /Total processes/ { if (! CP) CP = $2 } END {print CP}' cg.B.".n.".log ".(n == 1 ? ">" : ">>")." processes.dat
This first command outputs:
1
2
4
8
16
Paste of the two files:
paste processes.dat speedup.dat > prsp.dat
which gives the now desired output:
1 1
2 1.9559
4 2.34107
8 2.36089
16 1.64207

$ awk 'NR==1{n=$0;next} {print n/$0}' file
1.9559
2.34107
2.36089
1.64207
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", n/$0}' file
1.9559
2.3411
2.3609
1.6421
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", int(n*10000/$0)/10000}' file
1.9559
2.3410
2.3608
1.6420
$ awk 'NR==1{n=$0;next} {x=sprintf("%.4f",int(n*10000/$0)/10000); sub(/\./,",",x); print x}' file
1,9559
2,3410
2,3608
1,6420
Normally you'd just use the correct locale to have . or , as your decimal point but your input uses . while your output uses , so I don't think that's an option.

awk '{if(n=="") n=$1; else print n/$1}' inputFile

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using awk to pull specific lines from a file - awk

Here is an awk example. inputfile is loaded up front, then matching records of datafile are output. awk \ -v RS="[\r]*[\n]" \ -v FILE="inputfile" \ 'BEGIN \ { LINES = "," while ((getline Line < FILE)) { LINES = LINES Line "," } } LINES ~ "," NR "," \ { print } ' datafile

This solution... awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile ...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative: sed -nf <(sed 's/.*/&p/' numberfile) datafile

while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt

Related

Filtering using awk returns empty files

Merge multiple files and split output to multiple files based in each column (post 2)

awk: print each column of a file into separate files

While using awk showing fatal : cannot open pipe ( Too many open files) error

How to do calculations over lines of a file in awk

Categories

Resources