command working for small data but not for big data - awk
I'm trying to select the list of ID's where specific position is not empty (for ex; position 29,30,103 and 104). If the position is empty it should be rejected, tried with awk its working well with less data (<100) but all the ID's are getting selected with big data (>1000000).
Please provide suggestion.
awk '
{FS=",";$0=$0;
if ($29!="" && $30!="" && $323!="" && $324!= "") print "ID", NR, "selected" }
' file.csv
this cmd works only with small data, please advice where i'm missing out.
I've generated a file with 1024456 records. Each record has 152 fields separated by commas and each record is generated with a probability of getting empty fields across the records fields.
Sample of the data:
,,29378,,,,10154,,,,,,,,6118,,29,15384,,,,27106,30693,,,,2021,,,,,30609,,15148,,,3406,10181,,,,178,,,,,,,,31308,10049,,,14783,,,,,26032,,,,21999,,,15978,,,,,,12975,22933,,,18981,,,,,,21590,21196,,,,,,14680,,18167,9839,,,5282,,,27112,,,1264,,,22086,,,,,,,,,,,,,,18940,,,11353,,,29966,32569,2495,,11841,,25529,,15423,,,,2799,,15511,,,3010,,,4359,,,,,,12244,18968,13926
As expected, explicitly avoiding retokenizing every record will yield better results:
for run in {1..10}; do \
/usr/bin/time --format='%C took %e seconds' \
awk -F"," \
'{if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
file1.txt > /dev/null;
done
awk (...) took 3.36 seconds
awk (...) took 3.35 seconds
awk (...) took 3.78 seconds
awk (...) took 3.48 seconds
awk (...) took 3.58 seconds
awk (...) took 3.75 seconds
awk (...) took 3.49 seconds
awk (...) took 3.53 seconds
awk (...) took 3.47 seconds
awk (...) took 3.93 seconds
Than the original solution the OP provided:
for run in {1..10}; do \
/usr/bin/time --format='%C took %e seconds' \
awk \
'{FS=",";$0=$0; if ($29!="" || $30!="" || $92!="" || $132!= "") print "ID", NR, "selected" }' \
file1.txt > /dev/null;
done
awk (...) took 9.04 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 9.14 seconds
awk (...) took 8.93 seconds
awk (...) took 9.05 seconds
awk (...) took 8.76 seconds
awk (...) took 9.72 seconds
awk (...) took 9.29 seconds
awk (...) took 9.17 seconds
PS: I've run each solution 10 times to average out the results and I also made a slight change to the OP code as per the requirement (only one of the fields must be empty for the record to be selected).
Related
Sum specific column value until a certain value is reached
I want to print first column's value until its reached a certain value, like; 43 12.00 53888 29 10.00 36507 14 9.00 18365 8 8.00 10244 1 7.00 2079 1 9.50 1633 0 6.00 760 I would like the output to be: val = 90 43 12.00 53888 29 10.00 36507 14 9.00 18365
Could you please try following, written and tested with shown samples. Explicitly putting exit in condition when 1st column's sum goes greater than mentioned value to avoid unnecessary reading rest of the Input_file. awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val{print}{prev+=$1}' Input_file OR awk -v val="90" '($1+prev)>val{exit} ($1+prev)<=val; {prev+=$1}' Input_file Explanation: Adding detailed explanation for above. awk -v val="90" ' ##Starting awk program from here and mentioning variable val as 90 here. ($1+prev)>val{ ##Checking condition if first field and prev variable sum is greater than val then do following. exit ##exit from program to save some time. } ($1+prev)<=val; ##Checking condition if sum of $1 and prev is lesser than or equal to val then print the current line. { prev+=$1 ##keep adding 1st field to prev variable here. } ' Input_file ##Mentioning Input_file name here.
Perl to the rescue! perl -sape ' $s += $F[0] ; exit if $s > $vv' -- -vv=90 file -s enables setting variables from the command line, -vv=90 sets the $vv variable to 90 -p processes the input line by line, it prints each line after processing -a splits each line on whitespace and populates the #F array The variable $s is used to hold the running sum. The line is printed only when the sum is less than $vv, once the sum is too large, the program exits.
Consider small one-line awk Revised: (Sep 2020): Modified to take into account Bruno's comments, going for readable solution, see kvantour for compact solution. awk -v val=85 '{ s+= $1 ; if ( s > val ) exit ; print }' Original Post: (Aug 2020) awk -v val=85 '{ s += $1 ; if ( s <= val ) print }' Or even awk -v val=85 '{ s+= $1 } s <= val'
Consider an even smaller awk which is very much in line with the solution of dash-o awk -v v=90 '((v-=$1)<0){exit}1' file or the smallest: awk -v v=90 '0<=(v-=$1)' file
awk: print each column of a file into separate files
I have a file with 100 columns of data. I want to print the first column and i-th column in 99 separate files, I am trying to use for i in {2..99}; do awk '{print $1" " $i }' input.txt > data${i}; done But I am getting errors awk: illegal field $(), name "i" input record number 1, file input.txt source line number 1 How to correctly use $i inside the {print }?
Following single awk may help you too here: awk -v start=2 -v end=99 '{for(i=start;i<=end;i++){print $1,$i > "file"i;close("file"i)}}' Input_file
An all awk solution. First test data: $ cat foo 11 12 13 21 22 23 Then the awk: $ awk '{for(i=2;i<=NF;i++) print $1,$i > ("data" i)}' foo and results: $ ls data* data2 data3 $ cat data2 11 12 21 22 The for iterates from 2 to the last field. If there are more fields that you desire to process, change the NF to the number you'd like. If, for some reason, a hundred open files would be a problem in your system, you'd need to put the print into a block and add a close call: $ awk '{for(i=2;i<=NF;i++){f=("data" i); print $1,$i >> f; close(f)}}' foo
If you want to do what you try to accomplish : for i in {2..99}; do awk -v x=$i '{print $1" " $x }' input.txt > data${i} done Note the -v switch of awk to pass variables $x is the nth column defined in your variable x Note2 : this is not the fastest solution, one awk call is fastest, but I just try to correct your logic. Ideally, take time to understand awk, it's never a wasted time
How to sum first 100 rows of a specific column using Awk?
How to sum first 100 rows of a specific column using Awk? I wrote awk 'BEGIN{FS="|"} NR<=100 {x+=$5}END {print x}' temp.txt But this is taking lot of time to process; is there any other way which gives result quickly?
Just exit after the required first 100 records: awk -v iwant=100 '{x+=$5} NR==iwant{exit} END{print x+0}' test.in Take it out for a spin: $ for i in {1..1000}; do echo 1 >> test.in ; done # thousand of records $ awk -v iwant=100 '{x+=$1} NR==iwant{exit} END{print x+0}' test.in 100 '{x+=$5} NR==iwant{exit} END{print x+0}'
you can always trim the input and use the same script head -100 file | awk ... your script here ...
I have a script code which works, but how do I make this script code more "elegant"?
Some background. I have two files (A and B) which contain data I need to extract. For file A, I only need the last two lines which look like this: RMM: 17 -0.221674395053E+01 0.59892E-04 0.00000E+00 31 0.259E-03 1 F= -.22167440E+01 E0= -.22167440E+01 d E =-.398708E-10 mag= 2.0000 I need to extract the following numbers: -1st Line, 2nd field (17) -1st Line 4th field (0.59892E-04) -2nd Line, 1st field (1) -2nd Line, 3rd field (-.22167440E+01) -2nd Line, 5th field (-.22167440E+01) -2nd Line, 8th field (-.398708E-10) -2nd Line, 10th field (2.0000) For file B, I only need the last 11 lines which look like this: Total CPU time used (sec): 0.364 User time (sec): 0.355 System time (sec): 0.009 Elapsed time (sec): 1.423 Maximum memory used (kb): 9896. Average memory used (kb): 0. Minor page faults: 2761 Major page faults: 4 Voluntary context switches: 24 I need to extract the following numbers: -1st line, 6th field (0.364) -2nd line, 4th field (0.355) -3rd line, 4th field (0.009) -4th line, 4th field (1.423) -6th line, 5th field (9896.) -7th line, 5th field (0.) My output should be like this: mainfolder1[tab/space]subfolder1[tab/space][all the extracted info separated by tab] mainfolder2[tab/space]subfolder2[tab/space][all the extracted info separated by tab] mainfolder3[tab/space]subfolder3[tab/space][all the extracted info separated by tab] ... mainfoldern[tab/space]subfoldern[tab/space][all the extracted info separated by tab] Now this is my script code: for m in ./*/; do main=$(basename "$m") for s in "$m"*/; do sub=$(basename "$s") vdata=$(tail -n2 ./$main/$sub/A | awk -F'[ =]+' NR==1'{a=$2;b=$4;next}{print s,a,$2,$4,$6,$9, $11}') ctime=$(tail -n11 ./$main/$sub/B |head -n1|awk '{print $6}') utime=$(tail -n10 ./$main/$sub/B |head -n1|awk '{print $4}') stime=$(tail -n9 ./$main/$sub/B |head -n1|awk '{print $4}') etime=$(tail -n8 ./$main/$sub/B |head -n1|awk '{print $4}') maxmem=$(tail -n6 ./$main/$sub/B |head -n1|awk '{print $5}') avemem=$(tail -n5 ./$main/$sub/B |head -n1|awk '{print $5}') c=$(echo $sub| cut -c 2-) echo "$m $c $vdata $ctime $utime $stime $etime $maxmem $avemem" done done > output Now, the fourth line, the vdata part, was actually a "recycled" line from a previous forum question. I do not fully understand it. I wanted my file B codes to be as a elegant as that awk code for file A. How do I do it? Thank you! :)
awk 'NR==1{print $6} NR==2{print $4} NR==3{print $4} ...' You could simplify a bit with: NR==2 || NR==3 || NR==4 but that seems hard to maintain. Or you could use an array: awk 'BEGIN{a[1]=6;a[2]=4...} NR in a{ print $a[NR]}' But I think you really just want: awk '{print $NF}' ORS=\\t (You don't want the 6th field from row 1. You want the last field.) Rather than trying to collect the output into variables just to be echoed, add ORS=\\t to get tab separated output, and just let it print to stdout of the script.
For file B try something like: tail -n11 B | awk -F':' '{ print $2 }' if you need to retain the values and then echo, you could do something like: array=($(tail -n11 B | awk -F':' '{ print $2 }')) for value in "${array[#]}" do echo $value done
You should look into find and xargs since every time you write a loop in shell just to manipulate text you have the wrong approach BUT to keep it simple and retain your original structure, it sounds like you could use something like: for m in ./*/; do main=$(basename "$m") for s in "$m"*/; do sub=$(basename "$s") fileA="${main}/${sub}/A" fileB="${main}/${sub}/B" awk -v sizeA=$(wc -l < "$fileA") -v sizeB=$(wc -l < "$fileB") ' NR==FNR { if ( FNR == (sizeA-1) ) { split($0,p) } if ( FNR == sizeA ) { split($0,a) } next } { b[sizeB + 1 - FNR] = $NF } END { split(FILENAME,f,"/") print f[1], f[2], p[2], p[4], a[1], a[3], a[5], a[8], a[10], b[11], b[10], b[9], b[8], b[6], b[5] } ' "$fileA" "$fileB" done done > output Note that the above only opens each "B" file 1 time instead of 6.
how to get the output of 'system' command in awk
I have a file and a field is a time stamp like 20141028 20:49:49, I want to get the hour 20, so I use the system command : hour=system("date -d\""$5"\" +'%H'") the time stamp is the fifth field in my file so I used $5. But when I executed the program I found the command above just output 20 and return 0 so hour is 0 but not 20, so my question is how to get the hour in the time stamp ? I know a method which use split function two times like this: split($5, vec, " " ) split(vec[2], vec2, ":") But this method is a little inefficient and ugly. so are there any other solutions? Thanks
Another way using gawk: gawk 'match($5, " ([0-9]+):", r){print r[1]}' input_file If you want to know how to manage externall process output in awk: awk '{cmd="date -d \""$5"\" +%H";cmd|getline hour;print hour;close(cmd)}' input_file
You can use the substr function to extract the hour without using system command. for example: awk {'print substr("20:49:49",1,2)}' will produce output as 20 Or more specifically as in question $ awk {'print substr("20141028 20:49:49",10,2)}' 20 substr(str, pos, len) extracts a substring from str at position pos and lenght len if the value of $5 is 20141028 20:49:49, $ awk {'print substr($5,10,2)}' 20