i have file contains urls plus params like following
https://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
https://example.com/endpoint/endpoint2/?param1=123
https://example.com/endpoint/endpoint2/
https://example.com/endpoint/endpoint5/"//i.example.com/00/s/Nzk5WDEwMjQ=/z/47IAAOSwBu5hXIKF
and i need to filter only urls with unique params
the desired output
http://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
i managed to filter only urls with params with grep
grep -E '(\?[a-zA-Z0-9]{1,9}\=)'
but i need to filter params in the same time so i tried with awk with the same regex
but it gives error
awk '{sub(\?[a-zA-Z0-9]{1,9}\=)} !seen[$0]++'
update
i am sorry for editing the desired output but when i tried the scripts i figured out that their a lot of carbege in my file need to filter too.
i tried #James Brown with some editing and it looks good till the end line it dose not filter it unfortunately
awk -F '?|&' '$2&&!a[$2]++'
and to be more clear why the that output is good for me
it chosed the 1 st line because it has at least param1
2nd line because it has at least param3
3 line because it has at least param2
the comparison method here is choose just unique parameter whatever it concatenate with others with & char or not
Edited version after the reqs changes some:
$ awk -F? '{ # ? as field delimiter
split($2,b,/&/) # split at & to get whats between ? and &
if(b[1]!=""&&!a[b[1]]++) # no ? means no $2
print
}' file
Output as expected. Original answer was:
A short one:
$ awk -F? '$2&&!a[$2]++' file
Explained: Split records at ? (-F?) and if there is a second field ($2) and (&&) it is unique this far by counting the instances of the parameters in the array a (!a[$2]++), output it.
EDIT: Following solution may help when query string has ? as well as & present in it and we want to consider both of them for removing duplicates.
awk '
/\?/{
match($0,/\?[^&]*/)
val=substr($0,RSTART,RLENGTH)
match($0,/&.*/)
if(!seen[val]++ && !seen[substr($0,RSTART,RLENGTH)]++){
print
}
}' Input_file
2nd solution: (Following solution may help when we don't have & parameters in query string) With your shown samples, please try following awk program.
awk 'match($0,/\?.*$/) && !seen[substr($0,RSTART,RLENGTH)]++' Input_file
OR above could be shorten to as follows:(as per Ed sir's suggestions):
awk 's=index($0,"?") && !seen[substr($0,s)]++' Input_file
Explanation: Simple explanation would be, using match function of awk which matches everything from ? to till end of line value. Then adding an AND condition to it to make sure we get only unique values out of all matched values in all lines.
With gnu awk, you could also match the url till the first occurrence of the question mark, and then capture what follows using your initial pattern for the first parameter ([a-zA-Z0-9]{1,9}=[^&]+) followed by matching any character except the &
Then you can use the !seen[$0]++ part with the value of capture group 1.
awk '
match($0, /https?:\/\/[^?]+\?([a-zA-Z0-9]{1,9}=[^&]+)/, arr) && !seen[arr[1]]++
' file
Output
https://example.com/endpoint/?param1=123¶m2=1212
https://example.com/endpoint/?param3=123¶m1=98989
https://example.com/endpoint/endpoint3/?param2=123
Using awk you can check that the string starts with the protocol and contains a question mark.
Then to get the first parameter only, you can split on ? and & and use the second part of the split for seen
awk '
/^https?:\/\/[^?]*\?/ && split($0, arr, /[?&]/) > 1 && !seen[arr[2]]++
' file
I have a file dump which has the records of individuals:
.....Detail....account=xxxxx,......state=yyyyy,....
.....Detail....account=aaaaa,......state=bbbbb,....
What would be a way to extract the 2 phrases concatenated together using awk,sed or grep?
Would it be possible in a single-pass command line?
Expected output(delimiter does not matter):
xxxxx-yyyyy
aaaaa-bbbbb
awk -F'[=,]' '{print $2"-"$4}' file
xxxxx-yyyyy
aaaaa-bbbbb
The details about the input data are a bit vague, but the following sed filter will probably have the desired effect, and could most likely be tweaked to do so otherwise:
s/.*account=\([^,]*\).*state=\([^,]*\),.*/\1-\2/
#IUnknown: I believe that .....(dots) in your Input_file are the data, could you please try following and let me know if this helps.
awk '{for(i=1;i<=NF;i++){if($i ~ /=/){split($i, A,"=");Q=Q?Q"-"A[2]:A[2]}};print Q;Q=""}' Input_file
It considers that you need only those values which are having = in them and you want their second values, let me know if this helps you.
I'm trying to parse out all of the lines in between different headers and footers to different files using an awk script in a for loop. For example, I have a file with a list of mismatches with sample-name headers (compiled.csv) that looks like this:
19-T00,,,,,,,,,,,,,,,,
1557,WT,,,,,,,,,,,,,,,
6,109-G->A,110-G->A,,,,,,,,,,,,,,
3,183-G->A,,,,,,,,,,,,,,,
19-T10,,,,,,,,,,,,,,,,
642,WT,,,,,,,,,,,,,,,
206,24->G,,,,,,,,,,,,,,,
19-T21,,,,,,,,,,,,,,,,
464,24->G,,,,,,,,,,,,,,,
19-TSpl,,,,,,,,,,,,,,,,
2219,24->G,,,,,,,,,,,,,,,
20-T00,,,,,,,,,,,,,,,,,,
...
...
My goal for the lines above would be to pass all the lines from the 19-T00 to the 2219,24->G,,,,,,,,,,,,,,, in a sample output file called sample-19.csv.
The sample names all share the pattern [0-9][0-9]-T*. And my approach to doing this first was based on creating an array with all 20 sample names (i.e. 19, 20, 21...). I am trying to execute the following loop, and output files are created but they are blank.
for i in {0,19}
do a="$i"
b=`echo $i+1 | bc`
header="${array[$a]}-T"; footer="${array[$b]}-T"
name=`echo $header | cut -d"-" -f1`
awk -F, -v start="$header" -v finish="$footer" '/^start*/,/^finish*/' compiled.csv >"sample-"$name".csv"
done
If I do this manually with the one-liner:
awk '/^19-T*/,/^20-T*/' compiled.csv >sample-19.csv it works fine. So I think there may be a problem in the variable passing, but I don't know how to fix it.
I know there are some other threads discussing the header-footer approach using awk, but I just think my syntax needs some help. If anyone has any advice by way of more experienced eyes, it would be much appreciated. Let me know if anything isn't clear.
Thanks,
Matt
All you need is something like this (untested):
awk '
/^[0-9][0-9]-T00,/ {
close(out)
out = "sample-" $0
sub(/-T00.*/,".csv",out)
}
{ print > out }
' compiled.csv
If you're ever again considering processing text with a shell loop make sure to read why-is-using-a-shell-loop-to-process-text-considered-bad-practice first
using awk
awk --posix '/[0-9]{2}-T00/{split($0,a,"-"); name=a[1]} {print $0>"sample-"name".cas"}' file
Output will be two files "sample-19.csv" and "sample-20.csv" for your contents