awk / split to return lines with a certain value in a certain column - create blocks of 100,000

awk / split to return lines with a certain value in a certain column - create blocks of 100,000 - awk

I have a csv file where the third column is a number. Some of the entries don't have a value in this column.
I want to pull 100k blocks from the file, but only entries with a valid value for that column.
I could use split, but how do I make it check that column for a value?

$ cat test.txt
1,2,3,get me
4,5,,skip me
6,7,8,get me
9,10,11,stop before me
$ awk -F, '$3!="" && ++i<=2' test.txt
1,2,3,get me
6,7,8,get me

If your trying to verify whether or not the third field within a record has a value and output its contents if it does, you could try the following:
awk -F , '{ if($3 != ""){print $3} }'
This could also be written as:
awk -F , '$3 != ""{print $3}'

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref

I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)

You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'

One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Awk Remove lines if one column matches another column, and keep line if max value from another column

I have a file of ~8,000 lines. I am trying to remove the lines where when the 5th column matches (in this case ga2016mldlzd), but keep only the line with the max value in the 6th column. For example, if given this:
-25.559,129.8529,6674.560547,2.0,ga2016mldlzd,6
-25.5596,129.8565,6902.750651,2.0,ga2016mldlzd,7
-25.5450,129.830,969.8079427,2.0,ga2016mldlzd,8
-25.5450,129.834,57.04752604,2.0,ga2016mldlzd,9
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
remove all lines except the final line with 10 as the max value, to get this. I'm stumped as to how this could be done either in awk or sed?
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If tried this:
awk -F, '!a[$5]++'
but I want to keep last column e.g., the column with '10', rather than the column with '6'. Thanks

Keep track of the max and line associated with that max and print at the end:
awk -F, '
{
if ($6>max[$5]) {
max[$5]=$6
tl[$5]=$0
}
}
END{
for (l in tl) print tl[l]
}' file
Prints:
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
The order of the file will be lost; ie, the groups may be reordered compared to the original file.
If you are dealing with a file with many different keys for $5 and not all of them could fit in memory, you could sort into blocks grouped by the fifth field and then by the numeric value of the sixth. Then have awk print the last line every time the fifth field changes. Since it is sorted, that will be the max:
sort -t , -k 5,5 -k 6n file |
awk -F, '
FNR==1{lf=$5;ll=$0}
lf!=$5{print ll}
{ll=$0; lf=$5}
END{print $0}'
# same print out
The second there will be way slower but way less memory for a large number of $5 uniq values.

If you want to maintain original order of lines then use this awk:
awk -F, 'NR==FNR {if ($6 > max[$5]) max[$5] = $6; next} $5 in max && max[$5] == $6' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If you want to filter for ga2016mldlzd while maintaining original order of lines then use this awk:
awk -F, '
NR==FNR {
if ($5 == "ga2016mldlzd" && $6 > max[$5]) {
max[$5] = $6
n = FNR
}
next
}
FNR == n' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10

awk conditional statement based on a value between colon

I was just introduced to awk and I'm trying to retrieve rows from my file based on the value on column 10.
I need to filter the data based on the value of the third value if ":" was used as a separator in column 10 (last column).
Here is an example data in column 10. 0/1:1,9:10:15:337,0,15.
I was able to extract the third value using this command awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
This returns the value 10 but how can I return other rows (not just the value in column 10) if this third value is less than or greater than a specific number?
I tried this awk '{if($10 -F ":" "/1/ ($3<10))" print $0;}' file.txt but it returns a syntax error.
Thanks!

Your code:
awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
should be just 1 awk script:
awk '$10 ~ /1/ { split($10,f,/:/); print f[3] }' file.txt
but I'm not sure that code is doing what you think it does. If you want to print the 3rd value of all $10s that contain :s, as it sounds like from your text, that'd be:
awk 'split($10,f,/:/) > 1 { print f[3] }' file.txt
and to print the rows where that value is less than 7 would be:
awk '(split($10,f,/:/) > 1) && (f[3] < 7)' file.txt

How to use awk script to generate a file

I have a very large compressed file(dataFile.gz) that I want to generate another file using cat and awk. So using cat to view the contents and then piping it to awk to generate the new file.
The contents of compressed as like below
Time,SequenceNumber,MsgType,MsgLength,CityOrign,RTime
7:20:13,1,A,34,Tokyo,0
7:20:13,2,C,35,Nairobi,7:20:14
7:20:14,3,E,30,Berlin,7:20:15
7:20:16,4,A,34,Berlin,7:20:17
7:20:17,5,C,35,Denver,0
7:20:17,6,D,33,Helsinki,7:20:18
7:20:18,7,F,37,Tokyo,0
….
….
….
For the new file, I want to generate, I only want the Time, MsgType and RTime. Meaning columns 0,2 and 5. And for column 5, if the value is 0, replace it with the value at column 0. i.e replace RTime with Time
Time,MsgType,RTime
7:20:13,A,7:20:13
7:20:13,C,7:20:14
7:20:14,E,7:20:15
7:20:16,A,7:20:17
7:20:17,C,7:20:17
7:20:17,D,7:20:18
7:20:18,F,7:20:18
This is my script so far:
#!/usr/bin/awk -f
BEGIN {FS=","
print %0,%2,
if ($5 == "0") {
print $0
} else {
print $5
}
}
My question is, will this script work and how do I call it. Can I call it on the terminal like below?
zcat dataFile.gz | <awk script> > generatedFile.csv

awk index starts with 1 and $0 represents full record. So column numbers would be 1, 3, 6.
You may use this awk:
awk 'BEGIN{FS=OFS=","} !$6{$6=$1} {print $1, $3, $6}' file
Time,MsgType,RTime
7:20:13,A,7:20:13
7:20:13,C,7:20:14
7:20:14,E,7:20:15
7:20:16,A,7:20:17
7:20:17,C,7:20:17
7:20:17,D,7:20:18
7:20:18,F,7:20:18

Could you please try following. A bit shorter version of #anubhava sir's solution. This one is NOT having assignment to 6th field it only checks if that is zero or not and accordingly it prints the values.
awk 'BEGIN{FS=OFS=","} {print $1, $3, $6==0?$1:$6}' Input_file

how to insert new row in 1st position with single quotes with awk

I got very limited knowledge with awk.
I got big csv files (500.000 lines) with following lines format:
'0000011197118123','136',,'35993706', '33745', '22052', 'appsflyer.com'
'0000011194967123','136',,'35282806', '74518', '30317', 'crashlytics.com'
'0000011199022123’,’139',,'01363100', '8776250', '373671', 'whatsapp.com'
............
I need to cut first 8 digit from first column and add date field, as a new first column, (date should be the day-1 date) like following:
'2016/03/12','97118123','136',,'35993706','33745','22052','appsflyer.com'
'2016/03/12','94967123','136',,'35282806','74518','30317','crashlytics.com'
'2016/03/12','99022123’,’139',,'01363100','8776250','373671','whatsapp.com'
Thanks a lot for your time.
M.Tave

You can do something similar to:
awk -F, -v date="2016/03/12" 'BEGIN{OFS=FS}
{sub(/^.{8}/, "'\''", $1)
s="'\''"date"'\''"
$1=s OFS $1
print }' csv_file
I did not understand how you a determining your date, so i just used a string.
Based on comments, you can do:
awk -v d="2016/03/12" 'sub(/^.{8}/,"'\''"d"'\'','\''")' csv_file

$ awk -v d='2016/03/12' '{print "\047" d "\047,\047" substr($0,10)}' file
'2016/03/12','97118123','136',,'35993706', '33745', '22052', 'appsflyer.com'
'2016/03/12','94967123','136',,'35282806', '74518', '30317', 'crashlytics.com'
'2016/03/12','99022123’,’139',,'01363100', '8776250', '373671', 'whatsapp.com'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk / split to return lines with a certain value in a certain column - create blocks of 100,000 - awk

I have a csv file where the third column is a number. Some of the entries don't have a value in this column. I want to pull 100k blocks from the file, but only entries with a valid value for that column. I could use split, but how do I make it check that column for a value?

$ cat test.txt 1,2,3,get me 4,5,,skip me 6,7,8,get me 9,10,11,stop before me $ awk -F, '$3!="" && ++i<=2' test.txt 1,2,3,get me 6,7,8,get me

If your trying to verify whether or not the third field within a record has a value and output its contents if it does, you could try the following: awk -F , '{ if($3 != ""){print $3} }' This could also be written as: awk -F , '$3 != ""{print $3}'

Related

Countif like function in AWK with field headers

Awk Remove lines if one column matches another column, and keep line if max value from another column

awk conditional statement based on a value between colon

How to use awk script to generate a file

how to insert new row in 1st position with single quotes with awk

Categories

Resources