Add a field to the current record before processing in awk - awk

I want to use a large awk script that was designed to take a particular input. For example "city zipcode street housenumber", so $2 is zipcode, etc...
Now the input is provided to me in a new format. In my example "city" is now missing. The new file is "zipcode street housenumber" (not for real, just trying to make the example simple)
but I happen to know that the city is a constant for that input (which is why it's not in the dataset). So if I run it through the original script $2 is now street, and everything is one field off.
I could first process the input file to prepend the city name to each line (using awk, sed, or whatever), then run it through the original script, but I would prefer to run only one script that supports both formats. I could add a command-line option that tells it the city, but I don't know how to insert it in front of the current record at the top of the script so that the rest of the script can be unchanged. It looks like I can change a field but what I want to do is "shift" the fields right so I can modify $1.
Did I mention I am a complete awk novice? (Perl is my poison.)

I think I fixed my own problem, I'm doing the following (haven't figured out how to do this conditionally based on a command line option, but it should be easy to find tutorials for that)
NF+=1;
for(i=NF; i>1; --i) $(i)=$(i-1);
$1="Vancouver";
I had the loop wrong in my comment above, but the basic idea of manipulating NF and copying fields into each others seems to work

Something in the lines of this should do it. First some missed test data:
$ cat file
1 2 3 4
2 3 4
The awk:
$ awk -v c=V '{ # define external var
if(NF==3) # if record has only three fields
$0=v FS $0 # prepend the var to the record
print $1 # print first field
}' file
Output:
1
V

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Comparing column of two files

I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$
for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238
The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1

Duplicate results obtained for awk command to replace field value

I have file that contain records of animals and birds in our shop, a small snippet of which is given below:
Dogs: 4
Cats: 10
Parrots: 5
I want a one liner with awk if possible to increase number of cats by 1 and decrease number of cubs by 3. The command should be one liner and not using pipes. Because with pipes i have obtained desired results but i need to complete this in single awk command.
Output required is:
Dogs: 4
Cats: 11
Parrots: 2
I was using below command to do this:
Try 1:
awk -F ':' '{OFS=":"}; NR==2{$2=$2+1}1; NR==3{$2=$2-3}1' file
Try 2
awk -F ':' '{OFS=":"}; /Cats/ {$2=$2+1}1; /Parrots/ {$2=$2-3}1' file
Output of both commands is:
Dogs: 4
Dogs: 4
Cats: 11
Cats: 11
Parrots: 5
Parrots: 2
Results were duplicating. I would have done uniq to filter duplicates but the problem is the third row of Cubs is showing original as well as changed value. While second row is showing two times changed value.
Also i wanted to know that as sed has ability to do multiple changes in one single command using -e option or ; like sed -e '1d' -e '$d' file or sed '1d;$d' file, do awk has such capability. Because i tried same thing but i got duplicated results.
Please changes and clarification on my command are invited.
awk -F ':' 'BEGIN {OFS=":"} /Cats/ {$2=$2+1} /Parrots/ {$2=$2-3} 1' file
The above increases Cats by 1 and decreases Parrots by 3.
Your command prints twice because the pattern 1 with empty action appears twice in your command.
I've moved setting of OFS to a BEGIN block, so that it does not do that for every line. I've included the pattern 1 only once at the end so that the default print($0) action runs only once, after all modifications are done.
It is printing twice because you are asking it to print twice. The meaning of 1; is "this is a true condition which will match on every line; do the default action at this point (that is, print the current input record)". If you only want that to happen once, only do it once (probably at the very end of the script).
As you have actually already discovered, the processng model of Awk is quite similar to that of sed; the entire script is applied to each input line (or, in Awk, more generally, each record, where by default a record is a line, but you can easily change that) in turn, and whatever changes you make to the input record during the script's execution will be visible in subsequent lines of the script. However, as Awk is much more general than sed, many possible script actions do not directly affect the current input line; you could manipulate variables, call functions, and generally do anything a proper grown-up programming language can do (but be nice to sed even though it's so little; it does the one thing it does really well, albeit obscurely).

Using awk on a folder and adding file name to output rows

I should start by thanking you all for all the work you put into the answers on this site. I have spent many hours reading through them but have not found anything fitting my question yet. Hence my own post.
I have a folder with multiple subfolders and txt-files within those. In column 7 of those files, there are gene names (I do genetics for a living :)). These are the string I am trying to extract. Shortly, I would like to search the whole folder for any rows within any of the files that contain a particular gene name/string. I have been using grep for this, writing something like:
grep -r GENE . > GENE.txt
Simple, but I need to be able to tweak the search further and it seems that then awk is the way to go.
So I tried using awk. I wrote something like this:
awk '$7 == "GENENAME"' FOLDER/* > GENENAME.txt
This works well (and now I can specify that the string has to be in a particular column, this I can not do with grep, right?).
However, in contrast to grep, which writes the file name at the start of each row, I now can not directly see which file which row in my output file comes from (which mostly defeats the point of the search). This, adding the name of the origin file somewhere to each row, seems like something that should absolutely be doable, but I am not able to figure it out.
The files I am searching within change (or rather get more numerous), but otherwise my search will always be for some specific string in column 7 of the same big folder. How can I get this working?
Thank you in advance,
Elisabet E
You can use FNR (FNR means file number of record) to print the row number and FILENAME to print the file's name, then you get the matching lines from which file and which row, for instance:
sample.csv:
aaa 123
bbb 456
aaa 789
command:
awk '$1 =="aaa"{print $0, FNR, FILENAME}' sample.csv
The output is:
aaa 123 1 sample.csv
aaa 789 3 sample.csv
Sounds like you're looking for:
awk '$7 == "GENENAME"{print FILENAME, $0}' FOLDER/*
If not then edit your question to clarify with sample input and expected output.

Linux parsing space delimited log files

I need to parse apache-access log files which has 16 space delimited columns, that is,
xyz abc ... ... home?querystring
I need to count total number of hits for each page in that file, that is, total number of home page hits ignoring querystring
For few lines the url is column 16 and for other its 14 or 15. Hence I need to parse each line in reverse order (get the last column, ignore query string of the last column, aggregate page hits)
I am new to linux, shell scripting. How do I approach this, do I have to look into awk or shell scripting. Can you give a small sample code that would perform such task.
ANSWER: perl one liner solved the problem
perl -lane | scalar array
Well for starters, if you are only interested in working on columns 14-16, I would start by running
cut -d\ -f14-16 <input_file.log> | awk '{ one = match($1,/www/)
two = match($2,/www/)
three = match($3,/www/)
if (one)
print $1
else if(two)
print $2
else if(three)
Note: there are two spaces after the d\
You can then pretty easily just count up the urls that you see. I also think this would be solved a lot easier using a few lines of python or perl.
You can read line by line of input using the read bash command:
while read my_variable; do
echo "The text is: $my_variable"
done
To get input from a specific file, use the input redirect <:
while read my_variable; do
echo "The text is: $my_variable"
done < my_logfile
Now, to get the last column, you can use the ${var##* } construction. For example, if the variable my_var is the string some_file_name, then ${my_var##*_} is the same string, but whith everything before (and including) the last _ deleted.
We come up with:
while read line; do
echo "The last column is: ${line##* }"
done < my_logfile
If you want to echo it to another file, use the >> redirect:
while read line; do
echo "The last column is: ${line##* }" >> another_file
done < my_logfile
Now, to take away the querystring, you can use the same technique:
while read line; do
last_column="${line##* }"
url="${last_column%%\?*}"
echo "The last column without querystring is: $url" >> another_file
done < my_logfile
This time, we have %%?* instead of ##*? because we want to delete what's after the first ?, instead of before the last. (Note that I have escaped the character ?, which is special to bash.) You can read all about it here.
I didn't understand where to get the page hits, but I think the main idea is there.
EDIT: Now the code works. I had forgotten the do bash keywork. Also, we need to use >> instead of > in order not to overwrite the another_file every time we do echo "..." > another_file. By using >>, we append to the file. I have also corrected the %% instead of ##.
It's hard to say without a few lines of concrete sample input and expected output, but it sounds like all you need is:
awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
For example:
$ cat file
xyz abc ... ... http://www.google.com?querystring
xyz abc ... ... some other http://www.google.com?querystring1
xyz abc ... some stuff we ignore http://yahoo.com?querystring1
$
$ awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
http://www.google.com 2
http://yahoo.com 1