Using grep / sed replace values from list with semi colons, new lines as seperators - awk

I am fairly new grep, sed, and awk. I have used them in them past to extract lines and/or replace things from exact lists.
In this case I am confused on how to go about it. I have a two csv files.
My first csv file is names that are separted by spaces and semi colons.
Name,
Frank ,
Frank; John; Rob; ,
John; Nick; ,
The second csv is with location and names
Location, Name,
France, Frank,
John, New Jersey,
Nick, Germany,
Rob, Japan,
I would like the output to add the location as a column next to the name.
Name, Location,
Frank , France,
Frank; John; Rob; , France; New Jersey; Japan,
John; Nick; , New Jersey; Germany,
How can I search through the 2nd csv file by line and treat each name as unique to extract its respective location? Then output it so it keeps the information per line with semi colons..
What I have done do far is:
cat file1.csv | cut -f1 | tr ';' '\t' > file-test.tsv
Thank you.

Your files are formatted somewhat strangely. Comma delimited overall, and individual fields delimited with semicolons, but sometimes with a trailing semicolon and sometimes not.
Also, at the time this answer is written, your second file still has "Location, Name" for the first data row, and "Name, Location" for all the rest. I'm assuming that the actual file is "Location, Name" on every row.
Here's how I'm approaching it:
Make one pass through the 2nd file and create a mapping from name to location
Make one pass through the 1st file and apply the mapping
Here is my solution, using just awk:
# use delimiter of zero or more spaces on either side of a comma
awk -F ' *, *' '
# First line of first file processed; set flag variable
FNR == 1 && NR == 1 {mapfile = 1;}
# Lines 2+ in the map file: save the mapping
mapfile && FNR > 1 {map[$2] = $1;}
# First line of second file; print header and reset flag
FNR == 1 && NR > 1 {print "Name, Location,"; mapfile = 0;}
# Process lines 2+ in the name file (i.e. not the map file)
!mapfile && FNR > 1 {
data = $0;
sub(/ *, *$/,"",data); # remove trailing comma
sub(/ *; *$/,"",data); # remove trailing semicolon
# create "names" array of length "num"
num = split(data,names,/ *; */);
locs = ""; # init location string to empty
for (i = 1; i <= num; i++)
{
locs = locs map[names[i]] "; ";
}
sub(/; $/,",",locs); # change last semicolon to comma
# print original line from name file, and append locations
print $0 " " locs;
}' file2.csv file1.csv
Some more explanation:
NR = "Number of Row" being processed. This starts at 1 and increments forever, regardless of how many files are processed by awk
FNR = "File Number of Row". This starts over at 1 with every file being processed
So when both are 1, the first line of the map file is being processed.
When FNR is 1 but NR is greater than 1, the 2nd file is being processed.
Also,
awk can use regular expressions as delimiters, so I've told it to use a comma with zero or more spaces on either side as the delimiter ( *, *).
$0 = entire line
$1, $2, etc are the individual fields of each line when split using the specified delimiter.
The rest of the logic should be self-evident from the code and comments within the script.
When processing your files in this order
file2.csv = your second file, but with "location, name" order on all rows
file1.csv = your first file
the output is:
Name, Location,
Frank , France,
Frank; John; Rob; , France; New Jersey; Japan,
John; Nick; , New Jersey; Germany,

Assuming the lines of your 2nd file are actually always in location, name order instead of sometimes one, sometimes the other as in the example in your question here's how to output the data you want:
$ cat tst.awk
BEGIN { FS=" *, *"; OFS=" , " }
NR == FNR {
name2loc[$2] = $1
next
}
{
for (i=1; i<=NF; i++) {
n = split($i,names,/ *; */)
for (j=1; j<=n; j++) {
locs = (j>1 ? locs "; " : "") name2loc[names[j]]
}
}
print $1, locs
}
.
$ awk -f tst.awk file2 file1
Name , Location
Frank , France
Frank; John; Rob; , France; New Jersey; Japan;
John; Nick; , New Jersey; Germany;
Massage the output format to suit whatever you really want your output to look like.

Related

Loop through files in a directory and select rows based on column value using awk for large files

I have 15 text files (each about 1.5 - 2 GB) in a folder, each with about 300,000 to 500,000 rows and about 250 columns, each with a header row with column names. I also have a list of five values ("a123", "b234", "c345", "d456", and "e567"). (These are arbitrary values and the values are not in order and they do not have any relation with each other)
For each of the five values, I would like to query in each of 15 text files and select the rows if "COL_ABC" or "COL_DEF" equals the value. ("COL_ABC" and "COL_DEF" are arbitrary names and the column names do not have any relation with each other.) I do not know which column number is "COL_ABC" or "COL_DEF". They differ between each file because each file has a different number of columns, but "COL_ABC"/"COL_DEF" would be named "COL_ABC"/"COL_DEF" in each of the files. Additionally, some of the files have both "COL_ABC" and "COL_DEF" but others have only "COL_ABC". If only "COL_ABC" exists, I would like to do the query on "COL_ABC" but if both exists, I would like to do the query on both columns (i.e. check if "a123" is present in other "COL_ABC" or "COL_DEF" and select the row if true).
I'm very new to awk, so forgive me if this is a simple question. I am able to only do simple filtering such as:
awk -F "\t" '{ if(($1 == "1") && ($2 == "2")) { print } }' file1.txt
For each of the fifteen files, I would like to print the results to a new file.
Typically I could do this in R but my files are too big to be read into R. Thank you!
Assuming:
The input filenames have the form as "*.txt".
The columns are separated by a tab character.
Each of five values are compared with the target column (COL_ABC or COL_DEF) one by one and individual
result files are created according to the value. Then 15 x 5 = 75 files will be created. (If this is not what you want, please let me know.)
Then would you please try:
awk -F"\t" '
BEGIN {
values["a123"] # assign values
values["b234"]
values["c345"]
values["d456"]
values["e567"]
}
FNR==1 { # header line
for (i in values) { # loop over values
if (outfile[i] != "") close(outfile[i]) # close previous file
outfile[i] = "result_" i "_" FILENAME # filename to create
print > outfile[i] # print the header
}
abc = def = 0 # reset the indexes
for (i = 1; i <= NF; i++) { # loop over the column names
if ($i == "COL_ABC") abc = i # "COL_ABC" is found: assign abc to the index
else if ($i == "COL_DEF") def = i # "COL_DEF" is found: assign def to the index
}
next
}
{
for (i in values) {
if (abc > 0 && $abc == i || def > 0 && $def == i)
print > outfile[i] # abc_th column or def_th column matches i
}
}
' *.txt
If your 15 text files are located in the directory, e.g. /path/to/the/dir/ and you want to specify the directory as an argument, change the *.txt in the last line to /path/to/the/dir/*.txt.
for f in file*.txt; do
awk -F'\t' '
BEGIN {
n1="COL_DEF"
n2="COL_ABC"
val["a123"]
val["b234"]
val["c345"]
val["d456"]
val["e567"]
}
NR==1 {
for(i=1; i<=NR; i++)
col[$i]=i
c=col[n1]
if(!c) c=col[n2]
next
}
$c in val { print }
' "$f" > "$f.new"
done
we don't really need to set n1, n2 (we could use the string values directly) but it keeps all definitions in one place
awk doesn't have a very nice way to declare all elements of an entire array at once, so we set val elements individually (alternatively, for simple values we could use split)
on the first line of the file (NR==1), we store the header names, then immediately look up the ones we care about and store the index in c : we choose the first of col[n2] or col[n1] that is defined (non-zero) to be the column index to be searched
next skips the remaining awk actions for this line
then for every remaining line we check if the value in the relevant column is one of the values in val and, if so, print that line.
The awk script is enclosed in a bash for loop and we write output to a new file based on the loop variable. (This could all be done in awk itself, but this way is easy enough.)

AWK script- Not showing data

I'm trying to create a variable to sum columns 26 to 30 and 32.
SO far I have this code which prints me the hearder and the output format like I want but no data is being shown.
#! /usr/bin/awk -f
BEGIN { FS="," }
NR>1 {
TotalPositiveStats= ($26+$27+$28+$29+$30+$32)
}
{printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n",
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats
}
NR==1 {
print "EndYear,Rk,G,Date,Years,Days,Age,Tm,HOme,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats" }#header
Input data:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
Output expected:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,35
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4,34
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,54
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7,38
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2,29
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,36
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3,51
This script will be called like gawk -f script.awk <filename>.
Currently when calling this is the output (It seems to be calculating the variable but the rest of fields are empty)
awk is well suited to summing columns:
awk 'NR>1{$(NF+1)=$26+$27+$28+$29+$30+$32}1' FS=, OFS=, input-file > tmp
mv tmp input-file
That doesn't add a field in the header line, so you might want something like:
awk '{$(NF+1) = NR>1 ? ($26+$27+$28+$29+$30+$32) : "TotalPositiveStats"}1' FS=, OFS=,
An explanation on the issues with the current printf output is covered in the 2nd half of this answer (below).
It appears OP's objective is to reformat three of the current fields while also adding a new field on the end of each line. (NOTE: certain aspects of OPs code are not reflected in the expected output so I'm not 100% sure what OP is looking to generate; regardless, OP should be able to tweak the provided code to generate the desired result)
Using sprintf() to reformat the three fields we can rewrite OP's current code as:
awk '
BEGIN { FS=OFS="," }
NR==1 { print $0, "TotalPositiveStats"; next }
{ TotalPositiveStats = ($26+$27+$28+$29+$30+$32)
$17 = sprintf("%.3f",$17) # FG_PCT
if ($20 != "") $20 = sprintf("%.3f",$20) # 3P_PCT
$23 = sprintf("%.3f",$23) # FT_PCT
print $0, TotalPositiveStats
}
' raw.dat
NOTE: while OP's printf shows a format of %.2f % for the 3 fields of interest ($17, $20, $23), the expected output shows that the fields are not actually being reformatted (eg, $17 remains %.3f, $20 is an empty string, $23 remains %.2f); I've opted to leave $20 as blank otherwise reformat all 3 fields as %.3f; OP can modify the sprintf() calls as needed
This generates:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5,40
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1.000,3,2,5,5,2,1,3,4,21,19.4,37
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9,57
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1.000,2,2,4,5,3,1,6,5,25,14.7,44
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.750,3,2,5,5,1,1,2,4,17,13.2,31
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9,41
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.750,4,4,8,5,3,2,5,2,33,29.3,56
NOTE: in OP's expected output it appears the last/new field (TotalPositiveStats) does not contain the value from $30 hence the mismatch between the expected results and this answer; again, OP can modify the assignment statement for TotalPositiveStats to include/exclude fields as needed
Regarding the issues with the current printf ...
{printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%.2f %,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, %s\n",
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,TotalPositiveStats}
... is referencing (awk) variables that have not been defined (eg, EndYear, Rk, G). [NOTE: one exeception is the very last variable in the list - TotalPositiveStats - which has in fact been defined earlier in the script.]
The default value for undefined variables is the empty string ("") or zero (0), depending on how the awk code is referencing the variable, eg:
printf "%s", EndYear => EndYear is treated as a string and the printed result is an empty string; with an output field delimiter of a comma (,) this empty strings shows up as 2 commas next to each other (,,)
printf "%.2f %", FG_PCT => FG_PCT is treated as a numeric (because of the %f format) and the printed result is 0.00 %
Where it gets a little interesting is when the (undefined) variable name starts with a numeric (eg, 3P) in which case the P is ignored and the entire reference is treated as a number, eg:
printf "%s", 3P => 3P is processed as 3 and the printed result is 3
This should explain the 5 static values (0.00 %, 3, 3, 3.00 % and 0.00 %) printed in all output lines as well as the 'missing' values between the rest of the commas (eg, ,,,,).
Obviously the last value in the line is an actual number, ie, the value of the awk variable TotalPositiveStats.

How to print the user specified fields

I am writing a AWK script that is going to have the user input the fields and have the script count the amount of times each word appears in that field. I have the code set up so that it already so that it prints out all of the fields and the amount of times each word occurs but I am trying to have only the user specified fields get counted. The user will be inputting CSV files so I am setting the FS to a comma.
Knowing that AWK assumes that all arguments are that are inputted are going to be a file, I set the arguments to an array and then delete them from ARGV array so it will not throw an error.
#!/usr/bin/awk -f
BEGIN{ FS = ",";
for(i = 1; i < ARGC-1; i++){
arg[i] = ARGV[i];
delete ARGV[i];
}
}
{
for(i=1; i <=NF; i++)
words[($i)]++
}
END{
for( i in words)
print i, words[i];
}
So if the user inputs a CSV file such as...
A,B,D,D
Z,C,F,G
Z,A,C,D
Z,Z,C,Q
and the user wants to have only field 3 counted for the output should be...
C 3
F 1
Or if the user 1 and 3 for the fields...
A 2
B 1
C 1
Z 4
Could you please try following(I have written this on mobile so couldn't test it).
awk -v fields="1,3" '
BEGIN{
FS=OFS=","
num=split(fields,array,",")
for(j=1;j<=num;j++){
a[array[j]]
}
}
{
for(i=1;i<=NF;i++){
if(i in a){
count[$i]++
}
}
}
END{
for(h in count){
print h,count[h]
}
}
' Input_file
I believe this should work for parsing multiple Input_files too. If needed you could try passing multiple files to it.
Explanation: Following is only for explanation purposes.
-v fields="1,3" creating a variable named fields whose value is user defined, it should be comma separated, for an example I have taken 1 and 3, you could keep it as per Your need too.
BEGIN{......} starting BEGIN section here where mentioning field separator and output field separator as Comma for all lines of Input_file(s). Then using split I am splitting variable fields to an array named array whose delimiter is comma. Variable num is having length of fields variable in it. Starring a for loop from 1 to till value of num. In it creating an array named a whose index is value of array whose index is variable j value.
MAIN Section: now starting a for loop which traverse through all of the fields of lines. Then it checks if any field number is coming into array named a which we created in BEGIN section, if yes then it is creating an array named count with index of current column + taking its count too. Which we need as per OP's requirement.
Finally in this program's END section traversing through array count and printing it's indexes with their counts.
Another:
$ awk -F, -v p="1,2" '{ # parameters in comma-separated var
split(p,f) # split parameters to fields var
for(i in f) # for the given fields
c[$f[i]]++ # count chars in them
}
END { # in the end
for(i in c)
print i,c[i] # output chars and counts
}' file
Output for fields 1 and 2:
A 2
B 1
C 1
Z 4

How to append last column of every other row with the last column of the subsequent row

I'd like to append every other (odd-numbered rows) row with the last column of the subsequent row (even-numbered rows). I've tried several different commands but none seem to do the task I'm trying to achieve.
Raw data:
user|396012_232|program|30720Mn|
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|
|391983_233.batch|batch|30720Mn|5050424K
I'd like to take the last field in the "batch" lines and append the line above it with the last field in the "batch" line.
Desired output:
user|396012_232|program|30720Mn|5108656K
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|36426336K
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|5050424K
|391983_233.batch|batch|30720Mn|5050424K
The "batch" lines would then be discarded from the output, so in those lines there is no preference if the line is cut or copied or changed in any way.
Where I got stumped, my attempts to finish the logic were embarrassingly illogical:
awk 'BEGIN{OFS="|"} {FS="|"} {if ($3=="batch") {a=$5} else {} ' file.data
Thanks!
If you do not need to keep the lines with batch in Field 3, you may use
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; $3=="batch" { print prev $5 }' file.data
or
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; NR%2==0 { print prev $5 }' file.data
See the online awk demo and another demo.
Details
BEGIN{OFS=FS="|"} - sets the field separator to pipe
NR%2==1 { prev=$0 }; - saves the odd lines in prev variable
$3=="batch" - checks if Field 3 is equal to batch (probably, with this logic you may replace it with NR%2==0 to get the even line)
{ print prev $5 } - prints the previous line and Field 5.
You may consider also a sed option:
sed 'N;s/\x0A.*|\([^|]*\)$/\1/' file.data > newfile
See this demo
Details
N; - adds a newline to the pattern space, then appends the next line of
input to the pattern space and if there is no more input then sed
exits without processing any more commands
s/\x0A.*|\([^|]*\)$/\1/ - replaces with Group 1 contents a
\x0A - newline
.*| - any 0+ chars up to the last | and
\([^|]*\) - (Capturing group 1): any 0+ chars other than |
$ - end of line
if your data in 'd' file try gnu awk:
awk 'BEGIN{FS="|"} {if(getline n) {if(n~/batch/){b=split(n,a,"|");print $0 a[b]"\n"n} } }' d

Selecting a field after a string using awk

I'm very new to awk having just been introduced to it over the weekend.
I have a question that I'm hoping someone may be able to help me with.
How would one select a field that follows a specific string?
How would I expand this code to select more than one field following a specific string?
As an example, for any given line in my text file I have something like
2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.
Some attributes are very consistent so I can easily extract these fields. For example 2, 10 and the date. However, there is often a lot of variable text before the next field that I wish to extract. Hence the question. Using awk can I extract the next field following a string? For example I'm interested in the fields following the /distance/ or /time/ string in combination with $1, $3, $4, $5.
Your help will be greatly appreciated.
Andy
Using awk you can select the field following a string. Here is an example:
echo '2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.' |
awk '{
for(i=1; i<=NF; i++) {
if ( i ~ /^[1345]$/ ) {
extract = (extract ? extract FS $i : $i)
}
if ( $i ~ /distance|time/ ) {
extract = (extract ? extract FS $(i+1): $(i+1))
}
}
print extract
}'
2 10 19/4/2014 school 800m 2:20:22
What we are doing here is basically allowing awk to split on default delimiter. We create a for loop to iterate over all fields. NF stores number of fields for a given line. So we start from 1 and go all the way to the end.
In our first conditional block, we just inspect the field number. If it is 1 or 3 or 4 or 5, we create a variable called extract which concatenates the values of these fields separated by the field separator.
In our second conditional block, we check if the value of the field is either distance or time. If it is we again append to our variable but this time instead of the current value, we do $(i+1) which is basically the value of the next field or you can say value of a field that follows a specific string.
When you have name = value situations like you do here, it's best to create an array that maps the names to the values and then just print the values for the names you're interested in, e.g.:
$ awk '{for (i=1;i<=NF;i++) v[$i]=$(i+1); print $1, $3, $4, $5, v["distance"], v["time"]}' file
2 10 19/4/2014 school 800m 2:20:22
Basic:
awk '{
for (i = 6; i <= NF; ++i) {
if ($i == "distance") distance = $(i + 1)
if ($i == "time") time = $(i + 1)
}
print $1, $3, $4, $5, distance, time
}' file
Output:
2 10 19/4/2014 school 800m 2:20:22
But it's not enough to get all other significant texts which is still part of the school name after $5. You should add another condition.
The better solution is to have another delimiter besides spaces like tabs and use \t as FS.