The input file the date block change every each 4 lines (column 1). Example for days 061218 and 061418, but not in the case for date 061318, which contends 8 lines.
Then in the case where the date does not change after 5 lines,like the example on date 061318 in that case the values of the second part lines 5-8 need to be added to the END ond the lines 1-4. To get correctly in the output file desired.
Input file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654
061318,0,0
061318,114,60
061318,SD/05,F1/R0
061318,2666
061318,0
061318,1
061318,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
Output file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654,2666
061318,0,0,0
061318,114,60,1
061318,SD/05,F1/R0,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
I tried:
awk -F, '{a[$1]=a[$1]?a[$1]","$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Thanks in advance
If your Input_file is same as shown sample(which you mentioned in your comments it is) then could you please try following.
awk '
BEGIN{
FS=OFS=","
}
prev!=$1 && prev{
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
prev=count=""
}
{
prev=$1
sub(/[^,]*,/,"")
if(count==4){
count=1
}
else{
count++
}
a[prev,count]=a[prev,count]?a[prev,count] OFS $0:$0
}
END{
if(prev){
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
}
}' Input_file
Change above a[prev,count] line to a[prev,count]=(a[prev,count]?a[prev,count] OFS:"")$0 in Ed Morton sir's style too, to shorten and make it compatible to other awks too.
Related
I have a series of paired files, tab separated.
I want to compare line by line each pair and keep in file B only the fields that contain a match with the paired file A.
Example file A:
a b
d c
Example file B:
f>543 h<456 b>536 d>834 v<75345 a>12343
t>4562 c>623 f>3246 h>1345 d<52312
Desired output:
b>536 a>12343
c>623 d<52312
So far I have tried:
Convert files B in one-liner files:
cat file B | sed 's/\t/\n/g' > file B.mod
Grep one string in file A from file B, print the matching line and the next line, convert the output from 2 line back to single tab separated line:
cat file B.mod | grep -A1 (string) | awk '{printf "%s%s",$0,NR%2?"\t":"\n" ; }'
...but this failed since I realized that the matches can be in different order in A and B, as in the example above.
I'd appreciate some help as this goes far beyond my bash skills.
With your shown samples, please try following awk code.
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[FNR,$i]
}
next
}
{
val=""
for(i=1;i<=NF;i++){
if((FNR,substr($i,1,1)) in arr){
val=(val?val OFS:"")$i
}
}
print val
}
' filea fileb
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk Program from here.
FNR==NR{ ##Checking condition FNR==NR which will be true when filea is being read.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
arr[FNR,$i] ##Creating array with index of FNR,current field value here.
}
next ##next will skip all further statements from here.
}
{
val="" ##Nullify val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
if((FNR,substr($i,1,1)) in arr){ ##checking condition if 1st letter of each field with FNR is present in arr then do following.
val=(val?val OFS:"")$i ##Creating val which has current $i value in it and keep adding values per line here.
}
}
print val ##Printing val here.
}
' filea fileb ##Mentioning Input_file names here.
Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt
awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.
I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?
Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.
sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.
i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field5, grouping by years, and also months..
for every month (of a year), printing just the line with the maximum of probability
input file: (comma separated)
year,month,lat,lng,probability
0,0,40,331,1.00000
0,2,38,334,0.01111
0,2,38,334,0.05511
0,4,38,335,0.06667
0,8,38,336,0.16667
1,2,39,334,0.12222
1,2,39,335,0.04444
1,4,39,336,0.02222
1,4,40,333,0.14444
1,4,40,334,0.12222
2,6,40,335,0.06667
2,6,40,336,0.14444
output file desired
months,lat,lng
2,38,334
4,38,335
8,38,336
14,40,333
16,40,336
thank you everyone for the help
There are inconsistencies in your example. If by 'group' you mean a group defined by $1,$2 needs to have more than one entry, that explains why 0,40,331 is not included. But then why is 4,38,335 included?
Anyway, you ask for a start and here it is:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 { print $1*12+$2,$3,$4}
' file file
Prints:
month,lat,lng
0,40,331
2,38,334
4,38,335
8,38,336
14,39,334
16,40,333
30,40,336
Notice that the script traverses the file twice (by using file twice on the command line). The first time is to find the max of the group defined by $1,$2 and the second time to print that line.
If you only want groups included, count them:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
cnt[$1 OFS $2]++
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 && cnt[$1 OFS $2]>1 { print $1*12+$2,$3,$4}
' file file
month,lat,lng
2,38,334
14,39,334
16,40,333
30,40,336
I acknowledge that is different than your example, but I think your example needs more explanation.
thank you all, and thanks #dawg for the help
i want to give a feedback of my final code:
#!/bin/bash
awk 'BEGIN{FS=OFS=","}
NR==1{print "months",$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1,$2])
max[$1,$2]=$5
next
}
{if (max[$1,$2] == $5)
print $1*12+$2,$3,$4;}' example.csv example.csv `