awk seperate field based on length - awk

I have a field containing rows similar to:
HEJ;DU;NORDEN;13322;90
ER;HER;NOGEN;334333;1
I want to output a file where $4 (which can be 5 or 6 digits) is split into two seperate fields, depending on the lenght
if 5 the split should be 3-2, if 6 the split should be 3-3
So the output should be
HEJ;FRA;NORDEN;133;22;90
ER;HER;NOGEN;334;333;1
Does anyone have a good suggestion on how to make that seperation ?
I have been toying around with awk and gsub, and it works if I do it just for the field, but then the hazzle is to get it back aligned with the other fields, and I haven't managed to realize how I can embed the gsub function into an expression where it only touches one column of data ?

You can use the substr function.
first = substr($4,1,3)
second = substr($4,4)
$4 = $first ";" $second
You don't need a conditional, since the first part is always 3 digits long.

EDIT: More simpler approach.
awk -F";" '{sub(/^.../,"&" OFS,$4)} 1' OFS=";" Input_file
Not checking conditions like column's length is 5 or 6, in case you want to do it then we could add those too in above code.
Could you please try following and let me know if this helps you.
awk -F";" -v s1=";" '
{
$4=length($4)==5?substr($4,1,3) s1 substr($4,4):length($4)==6?substr($4,1,4) s1 substr($4,5):$4
}
1' OFS=";" Input_file

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

Replace case of 2nd column of dataset within awk?

Im trying command
awk 'BEGIN{FS=","}NR>1{tolower(substr($2,2))} {print $0}' emp.txt
on below data but not working
- M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,AJIT,D003,8-Mar-07,8-Sep-07,70000
M004,SHARVARI,D004,28-Mar-07,28-Mar-08,120000
M005,ADITYA,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
Expected output
M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
With your shown samples in GNU awk please try following awk code. Its using GNU awk's match function, where I am using regex (^[^,]*,.)([^,]*)(.*) which is creating 3 capturing groups and storing values into an array named arr(whose indexes are 1,2,3 and so on depending upon number of capturing groups created). Then if this condition is fine then printing array elements where using tolower function to Lower the spellings on 2nd element of arr to get expected output.
awk '
FNR==1{
print
next
}
match($0,/(^[^,]*,.)([^,]*)(.*)/,arr){
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
You need to assign the result of tolower() to something, it doesn't operate in place. And in this case, you need to concatenate it with the first character of the field and assign that back to the field.
$2 = substr($2, 1, 1) tolower(substr($2, 2));
To get comma separators in the output file, you need to set OFS. So you need:
BEGIN {OFS=FS=","}
mawk, gawk, or nawk :
awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || $_ = substr( toupper($(+_)=\
tolower($_)), --_,_) substr($++_,_)'
M_ID,M_name,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999

Using awk pattern to file filter data

I have the folling file(named /tmp/test99) which containd the rows:
"0","15","wall15"
123132,09808098,"0","15"
I am trying to filter the rows that contains "0" in the 3rd place, and "15" in 4th place (like in the second row)
I tried running:
cat /tmp/test99 | awk '/"0","15"/{print>"/tmp/0_15_file.out"} '
but instead of getting only the second row, I get also the first row starting with "0","15".
Could you please help with the pattern ?
Thanks:)
You may check if Fields 3 and 4 are equal to some hardcoded value using
awk -F, '$3=="\"0\"" && $4=="\"15\""'
Set the field separator to a comma and then, if Field 3 is "0" and Field 4 is "15" print the line, else discard.
See the online demo:
s='"0","15","wall15"
123132,09808098,"0","15"'
awk -F, '$3=="\"0\"" && $4=="\"15\""' <<< "$s"
# => 123132,09808098,"0","15"
Could you please try following.(comment on your effort, you need NOT to use cat with awk it could read Input_file by itself)
awk -F, '$3!~/\"0\"/ && $4!~/\"15\"/' Input_file

Comparing corresponding values of two lines in a file using awk [duplicate]

This question already has answers here:
Finding max value of a specific date awk
(3 answers)
Closed 6 years ago.
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
I have the line position of two lines in a file say line1 and line2. These lines may be anywhere in the file but I can access the line position using a search keyword based on name(the first word) in each line
20160801 means yyyymmdd and has an associated value separated by |
I need to compare the values associated with each of the date for the given two lines.
I am a newbie in awk. I am not understanding how to compare these two lines at the same time.
Your question is not at all clear. Perhaps the first step is to clearly articulate 1) What is the problem I am trying to solve; 2) what tools or data do I have to solve it?
The only hints specific to your question I can offer (since your problem statement is not clearly articulated) are these:
In awk, you can compare two different files by using the test FNR==NR which is only true on the first file.
You can find the key words by using a regular expression of the form /^name1/ which means lines that start with that pattern
You can split on a delimiter in awk by setting the field separator to that delimiter -- in this case (I think) it sounds like that is | but you are also comparing white space delimited fields inside of those fields?
You can compare by saving the data from the first line and comparing with the data from the second line in the other file once you can articulate what 'compare' means to you.
Wrapping that up, given:
$ cat /tmp/f1.txt
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
$ cat /tmp/f2.txt
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
You can find the lines in question like so:
$ awk -F"|" '/^name/ && FNR==NR {print $1}' f1.txt f2.txt
name1 20160801
$ awk -F"|" '/^name/ && FNR<NR {print $1}' f1.txt f2.txt
name2 20160801
(I have only printed the first field for clarity)
Then use that to compare. Save the first in an associative array and then compare the second when found.

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.