awk '{print $<number>}' but without knowing <number> before hand - sql

Printing a specific column per line using pipe to awk is fine.
But how do I do it if I do not know which column it is, except I have to get the column who's first row matches something.
Example.
Title1 Title2 TargetTitle Title3
x y z a
b c d e
The above table, I want to filter out only:
z
d
BUT, two problems
1) I don't know exactly the column number
2) I don't want first row (not a big deal, I can just sed lines 2 to $).
Thanks.

You can build your output using awk like this:
awk -v OFS='\t' 'NR>1{for (i=1; i<=NF; i++) {
if ($i=="b"||$i=="d") $i=""; printf "%s%s", $i, (i==NF)?ORS:OFS}}' file
x y z a
c e

To filter out one column, you could use something like this:
awk -v title="TargetTitle" 'NR==1 { for (i=1;i<=NF;++i) if ($i==title) col=i }
NR>1 { for (i=1;i<=NF;++i) if (i!=col) printf "%s%s", $i, (i<NF?OFS:ORS)}' file
Output:
x y a
b c e
If you want to add more space between each column in the output, you can change the value of the OFS variable or change the first format specifier from %s to %4s, for example.
If you want to only print one column, you can do something like this:
awk -v title="TargetTitle" 'NR==1 { for (i=1;i<=NF;++i) if ($i==title) col=i }
NR>1 { print $col }' file
Output:
z
d

Related

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

linux csv file concatenate columns into one column

I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through.
I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
will become:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row.
Examples (by request):
How many columns are in the row?
sed 's/[^,]//g' | wc -c
Get the first 10 columns:
cut -d, -f1-10
Get the last 4 columns:
rev | cut -d, -f1-4 | rev
Concatenate columns 10 and 11, showing columns 1-10 after that:
awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution:
awk 'BEGIN{ FS=OFS="," }
{
diff = NF - 14;
for (i=1; i <= NF; i++)
printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)?
"": (i == NF? ORS : ","))
}' file
The output:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub():
$ cat tst.awk
BEGIN{ FS="," }
match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) {
$0 = a[1] gensub(/,/,"","g",a[3]) a[5]
}
{ print }
$ awk -f tst.awk file
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing
$ cat ip.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
$ awk -F, '{print NF}' ip.txt
16
18
22
$ perl -F, -lane '$n = $#F - 4;
print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F])
' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
-F, -lane split on , results saved in #F array
$n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns)
join helps to stitch array elements together with specified string
#F[0..8] array slice with first 9 elements
#F[9..$n] and #F[$n+1..$#F] the other slices as needed
Borrowing from Ed Morton's regex based solution
$ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
$n=$#F-13 magic number
^([^,]*,){9}\K first 9 fields
([^,]*,){$n} fields to change
$&=~tr|,||dr use tr to delete the commas
e this modifier allows use of Perl code in replacement section
this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed
sed -E '
s/,/\n/9g
:A
s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/
tA
s/\n/,/g
' infile
First variant - with awk
awk -F, '
{
for(i = 1; i <= NF; i++) {
OFS = (i > 9 && i < NF - 4) ? "" : ","
if(i == NF) OFS = "\n"
printf "%s%s", $i, OFS
}
}' input.txt
Second variant - with sed
sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt
or, more straightforwardly (without loop) and probably faster.
sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt
Testing
Input
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u
Output
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r
a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers:
$ cat input.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
Concatenating columns:
$ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' -
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,,
1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,,
1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4
anatoly#anatoly-workstation:cbs$ cat input.txt

Print the 1st and every nth column of a text file using awk

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.
My file has a format that looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177
A B C D E F G H I J K L M N O P Q R S T ...
X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...
I am hoping to generate an output txt file (also separated with tab) that contains:
1 2 3 5 18 ...
A B C E R ...
X Y X X Y ...
The current awk code I have looks like (I am using cygwin to use the code):
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
But the result I am getting shows something like:
123518...ABCER...XYXXY...
When opened with excel program, the results are all mashed into 1 single cell.
In addition, when I try to include code
for (i=0;i<=3;i++) printf "%s ",$i
in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.
Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:
for (i=0;i<=3;i++) printf "%s ",$i
the first iteration prints $0 which is the whole record.
You're on the right track with:
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.
The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.
You want something like:
awk '
BEGIN { FS=OFS="\t" }
{
for (i=1; i<=3; i++) {
printf "%s%s", (i>1?OFS:""), $i
}
for (i=5; i<=NF; i+=14) {
printf "%s%s", OFS, $i
}
print ""
}
' file
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
In awk using conditional operator in for:
$ awk 'BEGIN { FS=OFS="\t" }
{
for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 )))
printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS)
}' file
1 2 3 5 19
A B C E S
X Y X X X
In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.
I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.
$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )"
$ awk -F\\t "{print $cols}" test.txt

Given 2 or more tsv files and only a terminal, how can you calculate the average difference of dates that only appear in both?

Let's say we have some files:
File 1 (date, value):
20130510\t50000
20130520\t3400
20130601\t4500
File 2 (date, something, value):
20130511\tx\t123
20130520\ty\t456
20130601\tz\t789
We want the average of the difference in values associated with the dates that appear in both files.
20130520 and 20130601 appear in both (need some kind of filter)
the difference in values is abs(3400-456) and abs(4500-789)
the average is (abs(3400-456)+abs(4500-789))/2.0
I can easily do this in Python, but how about with awk in the terminal?
You could try:
awk -f a.awk file1 file2
where a.awk is:
BEGIN {FS="\t"}
NR==FNR{
x[$1]=$2; next
}
$1 in x {
y[$1]=$3
}
END{
for (i in y) {
s=s+abs(x[i]-y[i])
j++
}
print s/j
}
function abs(x){return ((x < 0.0) ? -x : x)}
Output:
3327.5
Using awk
awk 'NR==FNR{a[$1]=$2;next}
{if ($1 in a) { s+=sqrt((a[$1]-$3)*(a[$1]-$3));i++}}
END{print s/i}' file1 file2
Explanation
No need define FS to "\t", because white space has included tab
sqrt((x-y)*(x-y)) can be easily used for ABS function.

sum same column across multiple files using awk ?

I want to add the 3rd column of 5 files such that the new file will have the same 2nd col and the sum of the 3rd col of the 5 files.
I tried something like this:
$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
freqdat048.dat`enter code here`$ cat freqdat044.dat | awk '{n=$3; getline <"freqdat046.dat";print $2" " n+$3}' > freqtrial1.dat
The files names:
freqdat044.dat
freqdat045.dat
freqdat046.dat
freqdat047.dat
freqdat049.dat
freqdat050.dat
And saved in output file the contain only $2 and the new col form the summation of the 3rd
awk '{x[$2] += $3} END {for(y in x) print y,x[y]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat
This does not necessarily print lines as they appear in the first file. If you want to preserve that sorting, then you have to save that ordering somewhere:
awk 'FNR==NR {keys[FNR]=$2; cnt=FNR} {x[$2] += $3} END {for(i=1; i<=cnt; ++i) print keys[i],x[keys[i]]}' freqdat044.dat freqdat045.dat freqdat046.dat freqdat047.dat freqdat049.dat freqdat050.dat