Stored each of the first 2 blocks of lines in arrays - awk

I've sorted it by using Google Sheet, but its gonna takes a long time, so I figured it out, to settle it down by awk.
input.txt
Column 1
2
2
2
4
4
Column 2
562
564
119
215
12
Range
13455,13457
13161
11409
13285,13277-13269
11409
I've tried this script, so it's gonna rearrange the value.
awk '/Column 1/' RS= input.txt
(as referred in How can I set the grep after context to be "until the next blank line"?)
But it seems, it's only gonna take one matched line
It should be sorted by respective lines.
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409
it should be something like that, the "comma" will be repeating the value from Column 1 and Column 2
etc:
Range :
13455,13457
Result :
562Value2#13455
562Value2#13457

idk what sorting has to do with it but it seems like this is what you're looking for:
$ cat tst.awk
BEGIN { FS=","; recNr=1; print "Result:" }
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print b[lineNr] "Value" a[lineNr] "#" $i
}
}
$ awk -f tst.awk input.txt
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409

Related

Using If-Statement to assign Variable in awk

I am trying to create an awk script file that takes an input file and converts the fourth column of information for the first three lines into a single row.
For example, if input.txt looks like this:
XX YY val1 1234
XX YY val2 2345
XX YY val3 3456
stuff random garbage junk extrajunk
useless 343059 random3
I want to print the fourth column for rows 1, 2 and 3 into a single row:
1234 2345 3456
I was trying to do this by using if/else-if statements so my file looks like this right now:
#!/usr/bin/awk -f
{
if ($1 == "XX" && $3 == "val1")
{
var1=$4;
}
else if ($1 == "XX" && $3 == "val2")
{
var2=$4;
}
else if ($1 == "XX" && $3 == "val3")
{
var3=$4;
}
}
END{
print var1,var2,var3
and then I would print the variables on one line. However, when I try to implement this, I get syntax errors pointing to the "=" symbol in the var2=$4 line.
EDIT
Solved, in my real file I had named the variables funky (yet descriptive) names and that was messing it all up. - Oops.
Thanks
you can write something like this
$ awk '$1=="XX"{if($3=="val1") var1=$4
else if($3=="val2") var2=$4
else if($3=="val3") var3=$4}
// .. do something with the vars ....
however, if you just want to print the fourth column of the first 3 lines
$ awk '{printf "%s ", $4} NR==3{exit}' ile
1234 2345 3456
Try this instead:
#!/bin/env bash
awk '
$1 == "XX" { var[$3] = $4 }
END { print var["val1"], var["val2"], var["val3"] }
' "$#"
There's almost certainly a much simpler solution depending on your real requirements though, e.g. maybe:
awk '
{ vars = (NR>1 ? vars OFS : "") $4 }
NR == 3 { print vars; exit }
' "$#"
. For ease of future enhancements if nothing else, don't call awk from a shebang, just call it explicitly.

how swap lines with awk with only a single pass and limited memory use?

in a previous post, this answer was shown: answer user2138595, though beautiful , the problem is that you should read the input file twice.
I wish to make a GNU awk script to read input only once.
cat swap_line.awk
you get
BEGIN {
if(init > end){
exit 1;
}
flag = 1;
memory_init = "";
memory = ""
}
{
if (NR != init && NR != end){
if(flag==1){
print $0;
}else{
memory = memory""$0"\n";
}
}else if(end == init){
print $0;
}else if(NR == init){
flag = 0;
memory_init = $0;
}else{
#NR == end
print $0;
printf("%s",memory);
print memory_init;
flag = 1;
}
}
END {
#if end is greater than the number of lines of the file
if(flag == 0){
printf("%s",memory);
print memory_init;
}
}
The scripts works well
cat input
1
2
3
4
5
awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
awk -v init=2 -v end=2 -f swap_line.awk input
1
2
3
4
5
awk -v init=2 -v end=8 -f swap_line.awk input
1
3
4
5
2
QUESTION
how could i make a script in a better way ? because, I do not like to use the memory variable, since for large files can have problems, for example if the input file is 10 million lines and want to do a swap between line 1 and line 10 million, I store 9,999,998 lines in memory variable
#JoseRicardoBustosM. it is impossible to do it in one pass in awk without saving the lines from the init to one before the end line in memory. Just think about the impossibility of getting a line N lines ahead of what you've already read to miraculously show up in place of the current line. The best solution for this is definitely a simple 2-pass approach of saving the lines in the first pass and using them in the 2nd. I am including all solutions that involve grep-ing in advance or using a getline loop in the "2"-pass approach bucket.
FWIW here's the way I'd really do it (this IS a 2-pass approach):
$ cat swap_line.awk
BEGIN { ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR { if (NR==end) tl=$0; next }
FNR==init { hd=$0; $0=tl; nr=NR-FNR; if (nr<end) next }
FNR==end { $0=hd }
FNR==nr { if (nr<end) $0 = $0 ORS hd }
{ print }
.
$ awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
$ awk -v init=2 -v end=2 -f swap_line.awk input
1
2
3
4
5
$ awk -v init=2 -v end=8 -f swap_line.awk input
1
3
4
5
2
Note that if you didn't have that very specific requirement for how to handle an "end" that's past the end of the file then the solution would simply be:
$ cat swap_line.awk
BEGIN { ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR { if (NR==end) tl=$0; next }
FNR==init { hd=$0; $0=tl }
FNR==end { $0=hd }
{ print }
and if you really want something to think about (again, just for the sunny day cases):
$ cat swap_line.awk
NR==init { hd=$0; while ((getline<FILENAME)>0 && ++c<end); }
NR==end { $0=hd }
{ print }
$ awk -v init=2 -v end=4 -f swap_line.awk input
1
4
3
2
5
I would still consider that last one as a "2"-pass approach and I wouldn't do it if I didn't fully understand all the caveats listed at http://awk.info/?tip/getline.
I think you are working too hard. This makes no attempt to deal with extreme cases (eg, if end is greater than the number of lines, the initial line will not be printed, but that can easily be handled in an END block), because I think handling the edge cases obscures the idea. Namely, print until you reach the line you want swapped out, then store data in a file, then print the line to swap, the stored data, and the initial line, and then print the rest of the file:
$ cat swap.sh
#!/bin/sh
trap 'rm -f $T1' 0
T1=$(mktemp)
awk '
NR<init { print; next; }
NR==init { f = $0; next; }
NR<end { print > t1; next; }
NR==end { print; system("cat "t1); print f; next; }
1
' init=${1?} end=${2?} t1=$T1
$ yes | sed 10q | nl -ba | ./swap.sh 4 8
1 y
2 y
3 y
8 y
5 y
6 y
7 y
4 y
9 y
10 y
I agree that 2 passes are required. The first pass can be done with a tool(s) that is designed specifically for the task:
# $init and $end have been defined
endline=$( tail -n "+$end" file | head -n 1 )
awk -v init="$init" -v end="$end" -v endline="$endline" '
NR == init {saved = $0; $0 = endline}
NR == end {$0 = saved}
{print}
' file
Hide the details away in a function:
swap_lines () {
awk -v init="$1" \
-v end="$2" \
-v endline="$(tail -n "+$2" "$3" | head -n 1)" \
'
NR == init {saved = $0; $0 = endline}
NR == end {$0 = saved}
1
' "$3"
}
seq 5 > file
swap_lines 2 4 file
1
4
3
2
5

Sum up from line "A" to line "B" from a big file using awk

aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.

How can I subtract to each column its mean using awk?

I have a file such as the following (but with thousands of rows and hundreds of columns)
1 2 1
1 2 2
3 2 3
3 2 6
How can I subtract to each column/field its mean using awk, in order to obtain such a thing?
-1 0 -2
-1 0 -1
1 0 0
1 0 3
Thank you very much for your help.
The most close solution http://www.unix.com/shell-programming-scripting/102293-normalize-dataset-awk.html does not seem to do the job "element by element". Of course it performs another operation, but the generic concept is "perform an operation on each column using a value calculated on that column"
With awk in two passes:
awk '
NR==FNR {
for (i=1;i<=NF;i++) {
a[i]+=$i
}
next
}
{
for (y=1;y<=NF;y++) {
printf "%2d ", $y-=(a[y]/(NR-FNR))
}
print ""
}' file file
With awk in one pass:
awk '{
for (i=1;i<=NF;i++) {
a[i]+=$i;
b[NR,i]=$i
}
}
END {
for (i=1;i<=NR;i++) {
for (j=1;j<=NF;j++) {
printf "%2d ",b[i,j]-=(a[j]/NR)
}
print ""
}
}' file
import sys, numpy as np
a = np.array([i.strip().split() for i in open(sys.argv[1])],dtype =float)
for i in a - np.mean(a,axis=0): print ' '.join(map(str, i))
Usage : python script.py inputFile

Awk merge the results of processing two files into a single file

I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).
The input files contain:
file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
To parse the first file I do this:
awk '
{
s = NF
center = $1
}
{
printf "%s\t %d\n", center, s
}
' file1
To parse the second file I do this:
awk '
/^>/ {
if (count != "")
printf "%s\t %d\n", seq_id, count
count = 0
seq_id = $0
next
}
NF {
long = length($0)
count = count+long
}
END{
if (count != "")
printf "%s\t %d\n", seq_id, count
}
' file2
My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?
I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.
Content of script.awk (re-using most of your existing code):
NR==FNR {
s[NR] = NF
center[NR] = $1
next
}
/^>/ {
seq_id[++y] = $0
++i
next
}
NF {
long[i] += length($0)
}
END {
for(x=1;x<=length(s);x++) {
printf "%s\t %d\t %d\n", center[x], s[x], long[x]
}
}
Test:
$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 4 200
ST695_116193610:4:2206:10596:165949 3 0