command working for small data but not for big data - awk

I'm trying to select the list of ID's where specific position is not empty (for ex; position 29,30,103 and 104). If the position is empty it should be rejected, tried with awk its working well with less data (<100) but all the ID's are getting selected with big data (>1000000).
Please provide suggestion.
awk '
{FS=",";$0=$0;
if ($29!="" && $30!="" && $323!="" && $324!= "") print "ID", NR, "selected" }
' file.csv
this cmd works only with small data, please advice where i'm missing out.

I've generated a file with 1024456 records. Each record has 152 fields separated by commas and each record is generated with a probability of getting empty fields across the records fields.
Sample of the data:
,,29378,,,,10154,,,,,,,,6118,,29,15384,,,,27106,30693,,,,2021,,,,,30609,,15148,,,3406,10181,,,,178,,,,,,,,31308,10049,,,14783,,,,,26032,,,,21999,,,15978,,,,,,12975,22933,,,18981,,,,,,21590,21196,,,,,,14680,,18167,9839,,,5282,,,27112,,,1264,,,22086,,,,,,,,,,,,,,18940,,,11353,,,29966,32569,2495,,11841,,25529,,15423,,,,2799,,15511,,,3010,,,4359,,,,,,12244,18968,13926
As expected, explicitly avoiding retokenizing every record will yield better results:
for run in {1..10}; do \
/usr/bin/time --format='%C took %e seconds' \
awk -F"," \
'{if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
file1.txt > /dev/null;
done
awk (...) took 3.36 seconds
awk (...) took 3.35 seconds
awk (...) took 3.78 seconds
awk (...) took 3.48 seconds
awk (...) took 3.58 seconds
awk (...) took 3.75 seconds
awk (...) took 3.49 seconds
awk (...) took 3.53 seconds
awk (...) took 3.47 seconds
awk (...) took 3.93 seconds
Than the original solution the OP provided:
for run in {1..10}; do \
/usr/bin/time --format='%C took %e seconds' \
awk \
'{FS=",";$0=$0; if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
file1.txt > /dev/null;
done
awk (...) took 9.04 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 9.14 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 8.76 seconds
awk (...) took 9.72 seconds
awk (...) took 9.29 seconds
awk (...) took 9.17 seconds
PS: I've run each solution 10 times to average out the results and I also made a slight change to the OP code as per the requirement (only one of the fields must be empty for the record to be selected).

Related

Sum specific column value until a certain value is reached

I want to print first column's value until its reached a certain value, like;
43 12.00 53888
29 10.00 36507
14 9.00 18365
8 8.00 10244
1 7.00 2079
1 9.50 1633
0 6.00 760
I would like the output to be:
val = 90
43 12.00 53888
29 10.00 36507
14 9.00 18365
Could you please try following, written and tested with shown samples. Explicitly putting exit in condition when 1st column's sum goes greater than mentioned value to avoid unnecessary reading rest of the Input_file.
awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val{print}{prev+=$1}' Input_file
OR
awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val; {prev+=$1}' Input_file
Explanation: Adding detailed explanation for above.
awk -v val="90" ' ##Starting awk program from here and mentioning variable val as 90 here.
($1+prev)>val{ ##Checking condition if first field and prev variable sum is greater than val then do following.
exit ##exit from program to save some time.
}
($1+prev)<=val; ##Checking condition if sum of $1 and prev is lesser than or equal to val then print the current line.
{
prev+=$1 ##keep adding 1st field to prev variable here.
}
' Input_file ##Mentioning Input_file name here.
Perl to the rescue!
perl -sape ' $s += $F[0] ; exit if $s > $vv' -- -vv=90 file
-s enables setting variables from the command line, -vv=90 sets the $vv variable to 90
-p processes the input line by line, it prints each line after processing
-a splits each line on whitespace and populates the #F array
The variable $s is used to hold the running sum. The line is printed only when the sum is less than $vv, once the sum is too large, the program exits.
Consider small one-line awk
Revised: (Sep 2020): Modified to take into account Bruno's comments, going for readable solution, see kvantour for compact solution.
awk -v val=85 '{ s+= $1 ; if ( s > val ) exit ; print }'
Original Post: (Aug 2020)
awk -v val=85 '{ s += $1 ; if ( s <= val ) print }'
Or even
awk -v val=85 '{ s+= $1 } s <= val'
Consider an even smaller awk which is very much in line with the solution of dash-o
awk -v v=90 '((v-=$1)<0){exit}1' file
or the smallest:
awk -v v=90 '0<=(v-=$1)' file

awk: print each column of a file into separate files

I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use
for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done
But I am getting errors
awk: illegal field $(), name "i"
input record number 1, file input.txt
source line number 1
How to correctly use $i inside the {print }?
Following single awk may help you too here:
awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data:
$ cat foo
11 12 13
21 22 23
Then the awk:
$ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo
and results:
$ ls data*
data2 data3
$ cat data2
11 12
21 22
The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call:
$ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish :
for i in {2..99}; do
awk -v x=$i '{print $1" " $x }' input.txt > data${i}
done
Note
the -v switch of awk to pass variables
$x is the nth column defined in your variable x
Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time

How to sum first 100 rows of a specific column using Awk?

How to sum first 100 rows of a specific column using Awk? I wrote
awk 'BEGIN{FS="|"} NR<=100 {x+=$5}END {print x}' temp.txt
But this is taking lot of time to process; is there any other way which gives result quickly?
Just exit after the required first 100 records:
awk -v iwant=100 '{x+=$5} NR==iwant{exit} END{print x+0}' test.in
Take it out for a spin:
$ for i in {1..1000}; do echo 1 >> test.in ; done # thousand of records
$ awk -v iwant=100 '{x+=$1} NR==iwant{exit} END{print x+0}' test.in
100
'{x+=$5} NR==iwant{exit} END{print x+0}'
you can always trim the input and use the same script
head -100 file | awk ... your script here ...

I have a script code which works, but how do I make this script code more "elegant"?

Some background. I have two files (A and B) which contain data I need to extract.
For file A, I only need the last two lines which look like this:
RMM: 17 -0.221674395053E+01 0.59892E-04 0.00000E+00 31 0.259E-03
1 F= -.22167440E+01 E0= -.22167440E+01 d E =-.398708E-10 mag= 2.0000
I need to extract the following numbers:
-1st Line, 2nd field (17)
-1st Line 4th field (0.59892E-04)
-2nd Line, 1st field (1)
-2nd Line, 3rd field (-.22167440E+01)
-2nd Line, 5th field (-.22167440E+01)
-2nd Line, 8th field (-.398708E-10)
-2nd Line, 10th field (2.0000)
For file B, I only need the last 11 lines which look like this:
Total CPU time used (sec): 0.364
User time (sec): 0.355
System time (sec): 0.009
Elapsed time (sec): 1.423
Maximum memory used (kb): 9896.
Average memory used (kb): 0.
Minor page faults: 2761
Major page faults: 4
Voluntary context switches: 24
I need to extract the following numbers:
-1st line, 6th field (0.364)
-2nd line, 4th field (0.355)
-3rd line, 4th field (0.009)
-4th line, 4th field (1.423)
-6th line, 5th field (9896.)
-7th line, 5th field (0.)
My output should be like this:
mainfolder1[tab/space]subfolder1[tab/space][all the extracted info separated by tab]
mainfolder2[tab/space]subfolder2[tab/space][all the extracted info separated by tab]
mainfolder3[tab/space]subfolder3[tab/space][all the extracted info separated by tab]
...
mainfoldern[tab/space]subfoldern[tab/space][all the extracted info separated by tab]
Now this is my script code:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
vdata=$(tail -n2 ./$main/$sub/A | awk -F'[ =]+' NR==1'{a=$2;b=$4;next}{print s,a,$2,$4,$6,$9, $11}')
ctime=$(tail -n11 ./$main/$sub/B |head -n1|awk '{print $6}')
utime=$(tail -n10 ./$main/$sub/B |head -n1|awk '{print $4}')
stime=$(tail -n9 ./$main/$sub/B |head -n1|awk '{print $4}')
etime=$(tail -n8 ./$main/$sub/B |head -n1|awk '{print $4}')
maxmem=$(tail -n6 ./$main/$sub/B |head -n1|awk '{print $5}')
avemem=$(tail -n5 ./$main/$sub/B |head -n1|awk '{print $5}')
c=$(echo $sub| cut -c 2-)
echo "$m $c $vdata $ctime $utime $stime $etime $maxmem $avemem"
done
done > output
Now, the fourth line, the vdata part, was actually a "recycled" line from a previous forum question. I do not fully understand it. I wanted my file B codes to be as a elegant as that awk code for file A. How do I do it? Thank you! :)
awk 'NR==1{print $6} NR==2{print $4} NR==3{print $4} ...'
You could simplify a bit with:
NR==2 || NR==3 || NR==4
but that seems hard to maintain. Or you could use an array:
awk 'BEGIN{a[1]=6;a[2]=4...} NR in a{ print $a[NR]}'
But I think you really just want:
awk '{print $NF}' ORS=\\t
(You don't want the 6th field from row 1. You want the last field.)
Rather than trying to collect the output into variables just to be echoed, add ORS=\\t to get tab separated output, and just let it print to stdout of the script.
For file B try something like:
tail -n11 B | awk -F':' '{ print $2 }'
if you need to retain the values and then echo, you could do something like:
array=($(tail -n11 B | awk -F':' '{ print $2 }'))
for value in "${array[#]}"
do
echo $value
done
You should look into find and xargs since every time you write a loop in shell just to manipulate text you have the wrong approach BUT to keep it simple and retain your original structure, it sounds like you could use something like:
for m in ./*/; do
main=$(basename "$m")
for s in "$m"*/; do
sub=$(basename "$s")
fileA="${main}/${sub}/A"
fileB="${main}/${sub}/B"
awk -v sizeA=$(wc -l < "$fileA") -v sizeB=$(wc -l < "$fileB") '
NR==FNR {
if ( FNR == (sizeA-1) ) { split($0,p) }
if ( FNR == sizeA ) { split($0,a) }
next
}
{ b[sizeB + 1 - FNR] = $NF }
END {
split(FILENAME,f,"/")
print f[1], f[2], p[2], p[4], a[1], a[3], a[5], a[8], a[10], b[11], b[10], b[9], b[8], b[6], b[5]
}
' "$fileA" "$fileB"
done
done > output
Note that the above only opens each "B" file 1 time instead of 6.

how to get the output of 'system' command in awk

I have a file and a field is a time stamp like 20141028 20:49:49, I want to get the hour 20, so I use the system command :
hour=system("date -d\""$5"\" +'%H'")
the time stamp is the fifth field in my file so I used $5. But when I executed the program I found the command above just output 20 and return 0 so hour is 0 but not 20, so my question is how to get the hour in the time stamp ?
I know a method which use split function two times like this:
split($5, vec, " " )
split(vec[2], vec2, ":")
But this method is a little inefficient and ugly.
so are there any other solutions? Thanks
Another way using gawk:
gawk 'match($5, " ([0-9]+):", r){print r[1]}' input_file
If you want to know how to manage externall process output in awk:
awk '{cmd="date -d \""$5"\" +%H";cmd|getline hour;print hour;close(cmd)}' input_file
You can use the substr function to extract the hour without using system command.
for example:
awk {'print substr("20:49:49",1,2)}'
will produce output as
20
Or more specifically as in question
$ awk {'print substr("20141028 20:49:49",10,2)}'
20
substr(str, pos, len) extracts a substring from str at position pos and lenght len
if the value of $5 is 20141028 20:49:49,
$ awk {'print substr($5,10,2)}'
20