subtract columns from different files with awk - awk

I have two folders A1 and A2. The names and the number of files are same in these two folders. Each file has 15 columns. Column 6 of each file in folder 'A1' needs to substrate from the column 6 of each file in folder 'A2'. I would like to print column 2 and 6(after subtraction) from each file to a folder A3 with the same filenames. How can I do this with awk?
f1.txt file in folder A1
RAM AA 159.03 113.3 122.9 34.78 116.3
RAM BB 151.24 70 122.9 142.78 66.4
RAM CC 156.70 80 86.2 70.1 54.8
f1.txt file in folder A2
RAM AA 110.05 113 122.9 34.78 116.3
RAM BB 150.15 70 122.9 140.60 69.4
RAM CC 154.70 89.2 86.2 72.1 55.8
desired output
AA 0
BB 2.18
CC -2

Try this:
paste {A1,A2}/f1.txt | awk '{print $2,$6-$13}'
In bash: {A1,A2}/f1.txt will expand to A1/f1.txt A2/f1.txt (It's just a shortcut. Never mind.)
I use paste command to merge files vertically.
The awk command is quite simple here.

One way using awk:
awk 'FNR==NR { array[$1]=$2; next } { if ($1 in array) print $1, array[$1] - $2 > "A3/f1.txt" }' ~/A1/f1.txt ~/A2/f1.txt
FIRST EDIT:
Assuming an equal number of files in both directories (A1 and A2) with filenames paired in the way you describe:
for i in A1/*; do awk -v FILE=A3/${i/A1\//} 'FNR==NR { array[$1]=$2; next } { if ($1 in array) print $1, array[$1] - $2 > FILE }' A1/${i/A1\//} A2/${i/A1\//}; done
You will need to create the directory A3 first, or you'll get an error.
SECOND EDIT:
awk 'FNR==NR { array[$2]=$6; next } { if ($2 in array) print $2, array[$2] - $6 > "A3/f1.txt" }' ~/A1/f1.txt ~/A2/f1.txt
THIRD EDIT:
for i in A1/*; do awk -v FILE=A3/${i/A1\//} 'FNR==NR { array[$2]=$6; next } { if ($2 in array) print $2, array[$2] - $6 > FILE }' A1/${i/A1\//} A2/${i/A1\//}; done

Related

Count rows and columns for multiple CSV files and make new file

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'
With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.
Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5
You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields
I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR
Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

how to fetch the data of column from txt file without column name in unix?

This is student.txt file:
RollNo|Name|Marks
123|Raghu|80
342|Maya|45
561|Gita|56
480|Mohan|71
I want to fetch data from "Marks" column and I used awk command
awk -F "|" '{print $3}' student.txt
and it gives the output like
Marks
80
45
56
71
This output included column name that is "Marks" but I only want to fetch the data and show the output like
80
45
56
71
Add a condition to your awk script to print the third field if the input record number is greater 1:
awk -F'|' 'FNR>1{print $3}' student.txt
Could you please try following, this has NOT hard coded field number for Marks keyword,in headers it will look for that string and will print only those columns which are under Marks, so even your Marks column is in any field number this should work fine. Written and test in https://ideone.com/Ufq5E2 link.
awk '
BEGIN{ FS="|" }
FNR==1{
for(i=1;i<=NF;i++){
if($i=="Marks"){ field=i }
}
next
}
{
print $field
}
' Input_file
to fetch data from "Marks" column - -use[ing] awk:
$ awk -F\| '
FNR==1 {
for(i=1;i<=NF;i++)
if($i=="Marks")
next
exit
}
{
print $i
}' file
80
45
56
71
Another example:
awk -F'|' 'NR != 1 {print $3}' input_file
a non-awk alternative
$ sed 1d file | cut -d\| -f3
80
45
56
71

Maintaining the separator in awk output

I would like to subset a file while I keep the separator in the subsetted output using ´awk´ in bash.
That´s what I am using:
The input file is created in R language with:
inp <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
inp <- read.table(text=inp, header=F)
write.table(inp, "inp.txt", col.names=F, row.names=F, quote=F, sep="\t")
(So fields are separated by tabs)
The code in bash:
awk {'print $1 $2 $3'} inp.txt
The result:
AX-11125
AX-22456
AX-333445
Please note that my columns were merged in the awkoutput (and I would like it to be tab delimited as the input file). Probably it is a simple syntax problem, but I would be grateful to any ideas.
Use
awk -v OFS='\t' '{ print $1, $2, $3 }'
or
awk '{ print $1 "\t" $2 "\t" $3 }'
Written one after another without an operator between them, variables in awk are concatenated - $1 $2 $3 is no different from $1$2$3 in this respect.
The first solution sets the output field separator OFS to a tab, then uses the comma operator to print separated fields. The second solution simply sprinkles tabs in there directly, and everything is concatenated as it was before.

how to use AWK to merge two columns from each of 10 files

I have 10 files that have the same tab-delimited column structure. I need to merge the columns 8 and 9 from each of the files. I came up with the following AWK code, but it only merge two files at a time. I am looking for help with merging all 10 files same time, I am not sure if this is feasible.
All my file names following the same pattern (s1s2.txt, s3s4.txt, s5s6.txt, .... s19s20.txt)
#!/bin/bash
awk '
BEGIN {
#load array with contents of the first file
while ( getline < "s1s2.txt" > 0)
{
s1s2_counter++
f1_8[s1s2_counter] = $8
f1_9[s1s2_counter] = $9
}
}
{OFS="\t"}
{ #output the columns 8 and 9 from the first file before the second file
print f1_8[NR],f1_9[NR], $8, $9
} ' s3s4.txt
awk -F'\t' '{a[FNR] = a[FNR] (NR==FNR?"":FS) $8 FS $9} END{for (i=1;i<=FNR;i++) print a[i]}' s1s2.txt .... s19s20.txt
Using getline is usually the wrong approach, see http://awk.info/?tip/getline.

awk print line of file2 based on condition of file1

I have two files:
cat file1:
0 xxx
1 yyy
1 zzz
0 aaa
cat file2:
A bbb
B ccc
C ddd
D eee
How do I get the following output using awk:
B ccc
C ddd
My question is, how do I print lines from file2 only if a certain field in file1 (i.e. field 1) matches a certain value (i.e. 1)?
Additional information:
Files file1 and file2 have an equal number of lines.
Files file1 and file2 have millions of lines and cannot be read into memory.
file1 has 4 columns.
file2 has approximately 1000 columns.
Try doing this (a bit obfuscated):
awk 'NR==FNR{a[NR]=$1}NR!=FNR&&a[FNR]' file1 file2
On multiples lines it can be clearer (reminder, awk works like this : condition{action} :
awk '
NR==FNR{arr[NR]=$1}
NR!=FNR && arr[FNR]
' file1 file2
If I remove the "clever" parts of the snippet :
awk '
if (NR == FNR) {arr[NR]=$1}
if (NR != FNR && arr[FNR]) {print $0}
' file1 file2
When awk find a condition alone (without action) like NR!=FNR && arr[FNR], it print by default on STDOUT implicitly is the expressions is TRUE (> 0)
Explanations
NR is the number of the current record from the start of input
FNR is the ordinal number of the current record in the current file (so NR is different than FNR on the second file)
arr[NR]=$1 : feeding array arr with indice of the current NR with the first column
if NR!=FNR we are in the next file and if the value of the array if 1, then we print
No as clean as a awk solution
$ paste file2 file1 | sed '/0/d' | cut -f1
B
C
You mentioned something about millions of lines, in order to just do a single pass through the files, I'd resort to python. Something like this perhaps (python 2.7):
with open("file1") as fd1, open("file2") as fd2:
for l1, l2 in zip(fd1, fd2):
if not l1.startswith('0'):
print l2.strip()
awk '{
getline value <"file2";
if ($1)
print value;
}' file1