Filtering using awk returns empty files - awk

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.

You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.

Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

Related

Merge multiple files and split output to multiple files based in each column (post 2)

I have many csv files with exactly same format on rows and columns. In the example below I present only 2 files as input, but i have a lot files with same characteristics
The purpose is for each input file do:
Take value in row 1, 2 and 3.
example in first file
6174
15
3
Then, print first column from row 4 to 6.
Do same process for all input files and output a file with all information of all readed files.
When the process is done for all files and first column. Do the same of the rest columns
At the end the total files output created will be 4 files as there is 4 columns in each file.
Input1
Record Number 6174
Vibrator Identification 15
Start Time Error 3 us
1.6,19.5,,,
1.7,23.2,28.3,27.0
1.8,26.5,27.0,25.4
Input2
Record Number 6176
Vibrator Identification 17
Start Time Error 5 us
1.6,18.6,,,
1.5,23.5,19.7,19.2
1.3,26.8,19.2,18.5
Using the code below, I got the 4 output files as desired, although files 3-4, are not good as spected, because in the first lines there is empty values and my code does not work as supposed. Also I have an issue to get the good value in row 3 in each file.. I get us instead of a number.
output file1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
output file2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
output file3
6174,15,3,0,0,28.3,27.0
6176,17,5,0,0,19.7,19.2
output file4
6174,15,3,0,0,27.0,25.4
6176,17,5,0,0,19.2,18.5
code used
The code works almost fine, merge the csv files and output the 4 files requerides, but there is a problem for files 3-4, when there is empty values.
for f in *.csv ; do
awk -F, 'NR==1 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==2 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
awk -F, 'NR==3 {n=split($NF,f," ");print f[n]}' "$f" >> a-"$f"
sed -i 's/\r$//' a-"$f"
for i in seq $(1...4); do
awk -F, 'NR>=4{f=1} f{print '"$""$i"'} f==6{exit}' "$f" > "a""$i"-"$f"
cat a-"$f" a"$i""-""$f" >> t"$i"
sed -i 's/\r$//' t"$i"
done
for i in seq $(1...4); do
awk -v RS= -v OFS=',' -v ORS='\n' '{$1=$1}1' t"$i" > file"$i".csv
done
done
rm -f ./a* ./t*
Appreciate your help
With GNU awk for ENDFILE and automatic handling of multiple open files and assuming your posted sample output showing file3 and file4 each having more fields than file1 and file2 is a mistake:
$ cat tst.awk
BEGIN { FS=OFS=","; numHdrFlds=3 }
FNR <= numHdrFlds {
gsub(/[^0-9]/,"")
hdr = (FNR==1 ? "" : hdr OFS) $0
next
}
{
for (i=1; i<=NF; i++) {
data[i] = (FNR==(numHdrFlds+1) ? "" : data[i] OFS) ($i)+0
}
}
ENDFILE {
for ( fileNr=1; fileNr<=NF; fileNr++ ) {
print hdr, data[fileNr] > ("outputFile" fileNr)
}
}
.
$ awk -f tst.awk file1 file2
$ for i in outputFile*; do echo "$i"; cat "$i"; echo "---"; done
outputFile1
6174,15,3,1.6,1.7,1.8
6176,17,5,1.6,1.5,1.3
---
outputFile2
6174,15,3,19.5,23.2,26.5
6176,17,5,18.6,23.5,26.8
---
outputFile3
6174,15,3,0,28.3,27
6176,17,5,0,19.7,19.2
---
outputFile4
6174,15,3,0,27,25.4
6176,17,5,0,19.2,18.5
---

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

Merge files with different columns and rows linux

I have some files that I need join, I have looked for some solutions but they do not fit what I need, I have the following files
a.csv
date |A|B|C|D
15-03-2017|1|3|9|4
and
b.csv
date |A|C|D|E
16-03-2017|2|9|3|4
And I would like to get the next output:
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4
Any insights or suggestions are appreciated!
EDIT:
Thanks for all
These file examples are not always the same
sometimes it can have between 10 and 50 columns and between 1 and 30 rows (dates)
something like this...
awk 'BEGIN {FS=OFS="|"}
FNR==1 {split($0,h); next}
{c++;
for(i=1;i<=NF;i++)
{a[h[i],c]=$i;
hall[h[i]]}}
END {for(k in hall) printf "%s", k OFS;
print "";
for(i=1;i<=c;i++)
{for(k in hall) printf "%s", ((k,i) in a?a[k,i]:0) OFS;
print ""}}' file1 file2
A|B|C|D|E|date |
1|3|9|4|0|15-03-2017|
2|0|9|3|4|16-03-2017|
you can reorder the columns with some additional code, but perhaps a better solution will show up...
A slightly simpler solution using sed:
sed 's/$/|0/;1s/0/E/' a.csv; sed '1d;s/|/|0|/2' b.csv
I am expanding my answer into a script for readability. You can use the script with the awk -f option. Starting with a BEGIN statement: specify | as your field separator, map an index to each header label using associative array and print the full header. For each file, map labels to indices on line 1. Then map labels to data on line 2 and replace empty data fields with "0". Print the filled out line and clear the arrays for the next file.
BEGIN{
# field separator
FS="|"
# index:label mapping
map[1]="date "; map[2]="A"; map[3]="B"
map[4]="C"; map[5]="D"; map[6]="E"
# print full header
print "date |A|B|C|D|E"
}
# first line of each file, create index:label mapping
FNR==1{
for (i=1;i<=NF;i++)
label[i]=$i
}
# next line of the file, create label:data mapping
FNR==2{
for (i=1;i<=NF;i++)
data[label[i]]=$i
# cycle through index:label mapping and print data
# for each label or "0" if there is no data
printf("%s", data["date "])
for (i=2;i<=6;i++) {
(data[map[i]]) ? s=data[map[i]] : s=0
printf("|%s", s)
}
print "" # print empty string for newline
# delete arrays to start from scratch on the following file
delete label
delete data
}
Result on the two example files:
$ awk -f joiner.awk a.csv b.csv
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4
I solved this problem
I changed the way to get the data
Something like this:
today=$(date +%Y%m%d)
echo "DataBase "$(date +%d/%m/%Y)>/jdb"$today".txt
du -s $(ls -l|grep ^d|awk '{print $9})|awk '{print $2" "$1" "}'>>/jdb"$today".txt
the output is like:
jdb_20170507.txt:
database 07/05/2017
jdb_A 4345
jdb_CFX 7654
jdb_ZZXD 97865
jdb_20170508.txt:
database 08/05/2017
jdb_A 9876
jdb_CFX 7545
jdb_ZXCFG 2344
for this examples in jdb_20170508.txt was deleted the jdb_ZZXD database and created the jdb_ZXCFG database
with this structure I can use JOIN command:
x=0
touch jdbaux$x.txt
for jdbfile in $(ls -1t|grep jdb2)
do
y=$(($x+1))
join -a1 -a2 -e0 -o auto --nocheck-order jdbaux$x.txt $jdbfile >jdbaux$y.txt
rm jdbaux$x.txt
x=$(($x+1))
done
This is my recursive JOIN rustic option for all the archives of the month
-a1= file one
-a2=file two
-e0=replace missing input fields with 0
-o auto= output automatic format
--nocheck-order =do not check that the input is correctly sorted
the output is like:
jdb_sizes201705.txt:
database 07/05/2017 08/05/2017
jdb_A 4345 9876
jdb_CFX 7654 7545
jdb_ZZXD 97865 0
jdb_ZXCFG 0 2344
and the last step is a pivote
cat jdb_sizes201705.txt |awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}'
Obtaining the expected output
database jdb_A jdb_CFX jdb_ZZXD jdb_ZXCFG
07/05/2017 4345 7654 97865 0
08/05/2017 9876 7545 0 2344
I know it's not the best solution but it works!
Thanks!

How to do calculations over lines of a file in awk

I've got a file that looks like this:
88.3055
45.1482
37.7202
37.4035
53.777
What I have to do is isolate the value from the first line and divide it by the values of the other lines (it's a speedup calculation). I thought of maybe storing the first line in a variable (using NR) and then iterate over the other lines to obtain the values from the divisions. Desired output is:
1,9559
2,3410
2,3608
1,6420
UPDATE
Sorry Ed, my mistake, the desired decimal point is .
I made some small changes to Ed's answer so that awk prints the division of 88.3055 by itself and outputs it to a file speedup.dat:
awk 'NR==1{n=$0} {print n/$0}' tavg.dat > speedup.dat
Is it possible to combine the contents of speedup.dat and the results from another awk command without using intermediate files and in one single awk command?
First command:
awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } /Total processes/ { if (! CP) CP = $2 } END {print CP}' cg.B.".n.".log ".(n == 1 ? ">" : ">>")." processes.dat
This first command outputs:
1
2
4
8
16
Paste of the two files:
paste processes.dat speedup.dat > prsp.dat
which gives the now desired output:
1 1
2 1.9559
4 2.34107
8 2.36089
16 1.64207
$ awk 'NR==1{n=$0;next} {print n/$0}' file
1.9559
2.34107
2.36089
1.64207
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", n/$0}' file
1.9559
2.3411
2.3609
1.6421
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", int(n*10000/$0)/10000}' file
1.9559
2.3410
2.3608
1.6420
$ awk 'NR==1{n=$0;next} {x=sprintf("%.4f",int(n*10000/$0)/10000); sub(/\./,",",x); print x}' file
1,9559
2,3410
2,3608
1,6420
Normally you'd just use the correct locale to have . or , as your decimal point but your input uses . while your output uses , so I don't think that's an option.
awk '{if(n=="") n=$1; else print n/$1}' inputFile

Using awk to pull specific lines from a file

I have two files, one file is my data, and the other file is a list of line numbers that I want to extract from my data file. Can I use awk to read in my lines file, and then extract the lines that match the line numbers?
Example:
Data file:
This is the first line of my data
This is the second line of my data
This is the third line of my data
This is the fourth line of my data
This is the fifth line of my data
Line numbers file
1
4
5
Output:
This is the first line of my data
This is the fourth line of my data
This is the fifth line of my data
I've only ever used command line awk and sed for really simple stuff. This is way beyond me and I have been googling for an hour without an answer.
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
simply referring to an array subscript creates the entry. Looping over the first file, while NR (record number) is equal to FNR (file record number) using the next statement stores all the line numbers in the array. After that when FNR of the second file is present in the array (true) the line is printed (which is the default action for "true").
One way with sed:
sed 's/$/p/' linesfile | sed -n -f - datafile
You can use the same trick with awk:
sed 's/^/NR==/' linesfile | awk -f - datafile
Edit - Huge files alternative
With regards to huge number of lines it is not prudent to keep whole files in memory. The solution in that case can be to sort the numbers-file and read one line at a time. The following has been tested with GNU awk:
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Run it like this:
awk -v linesfile=$linesfile -f extract.awk infile
Testing:
echo "2
4
7
8
10
13" | awk -v linesfile=/dev/stdin -f extract.awk <(paste <(seq 50e3) <(seq 50e3 | tac))
Output:
2 49999
4 49997
7 49994
8 49993
10 49991
13 49988
Here is an awk example. inputfile is loaded up front, then matching records of datafile are output.
awk \
-v RS="[\r]*[\n]" \
-v FILE="inputfile" \
'BEGIN \
{
LINES = ","
while ((getline Line < FILE))
{
LINES = LINES Line ","
}
}
LINES ~ "," NR "," \
{
print
}
' datafile
I had the same problem. This is the solution already posted by Thor:
cat datafile \
| awk 'BEGIN{getline n<"numbers"} n==NR{print; getline n<"numbers"}'
If like me you don't have a numbers file, but it is instead passed on from stdin and you don't want to generate a temporary numbers file, then this is an alternative solution:
cat numbers \
| awk '{while((getline line<"datafile")>0) {n++; if(n==$0) {print line;next}}}'
This solution...
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative:
sed -nf <(sed 's/.*/&p/' numberfile) datafile
while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt