sed/awk + regex delete duplicate lines where first field matches (ip address)

sed/awk + regex delete duplicate lines where first field matches (ip address) - awk

I need a solution to delete duplicate lines where first field is an IPv4 address.For example I have the following lines in a file:
192.168.0.1/text1/text2
192.168.0.18/text03/text7
192.168.0.15/sometext/sometext
192.168.0.1/text100/ntext
192.168.0.23/othertext/sometext
So all it matches in the previous scenario is the IP address. All I know is that the regex for IP address is:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
It would be nice if the solution is one line and as fast as possible.

If, the file contains lines only in the format you show, i.e. first field is always IP address, you can get away with 1 line of awk:
awk '!x[$1]++' FS="/" $PATH_TO_FILE
EDIT: This removes duplicates based only on IP address. I'm not sure this is what the OP wanted when I wrote this answer.

If you don't need to preserve the original ordering, one way to do this is using sort:
sort -u <file>

The awk that ArjunShankar posted worked wonders for me.
I had a huge list of items, which had multiple copies in field 1, and a special sequential number in field 2. I needed the "newest" or highest sequential number from each unique field 1.
I had to use sort -rn to push them up to the "first entry" position, as the first step is write, then compare the next entry, as opposed to getting the last/most recent in the list.
Thank ArjunShankar!

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

How to get the part of a string after the last occurrence of certain character?

I would like to have the substring after the last occurrence of a certin character.
Now I found here how to get the first, second or so parts, but I need only the last part.
The input data is a list of file directories:
c:\dir\subdir\subdir\file.txt
c:\dir\subdir\subdir\file2.dat
c:\dir\subdir\file3.png
c:\dir\subdir\subdir\subdir\file4.txt
Unfortunately this is the data I have to work it, otherwise I could list it using command prompt.
The problem is that the number of the directories are always changing.
My code based on the previous link is:
select (regexp_split_to_array(BTRIM(path),'\\'))[1] from myschema.mytable
So far I've tried some things in the brackets that came in to my mind. For example [end], [-1] etc.
Non of them are working. Is there a way to get the last part without rearranging my strings backwards, and getting the first part, then turning it back?

You can use regexp_matches():
select (regexp_matches(path, '[^\\]+$'))[1]
Here is a db<>fiddle.

Targeting a string for deletion with grep, sed, awk (or cut)

I am trying to parse some logs to gain the user agent and account id per line. I have already managed to pull the user agent and a string which contains the account id all on the same line.
The next step is to extract the account id from its longer string. I thought this would be fairly simple as I will know the start of the string and there are / slashes for the delimiter but the user agent also contains slashes and have varied number of fields.
The log file currently looks something like the following example but there are hundreds to thousands of lines to parse. Luckily I am working off a partition with plenty of space to spare.
USER_AGENT_PART ACCOUNT_ID_Part_/plus/path/to/stuff/they/access
some user agent/1.3 KnownString1_32d4-56e-009f98/some/stuff/here
user/agent KnownString1_12d3-345e-4c534/more/stuff/here
User/Agent cURL/1.5.0 KnownString2_12d34e56/stuff/things/stuff/stuff
one/User Agent/2.0 KnownString1_12d3_456e_7g8/more/random/stuff/stuff
So the goal is to keep the user agent part and the account id part and drop the path of the stuff they are accessing in the last string. But I can't use / or spaces as general delimiters because many user agents have / and various amounts of spaces in their name.
Also, the different types of user agents is way more than this little sample I have here. There are anywhere from 25 - 50 distinct types depending on the log. So it doesn't seem worth it to target the user agent and try to exclude it.
It seems the logical way to start is by targeting the part of the account ID which is a known string (KnownString1 or KnownString2) and grab everything from there (which is unknown numbers and letters with dashes) up until the first / of that account string.
Then I would delete the first / (In the account ID string) and everything after. I expect I will need to do this in two passes to utilize the two known parts of the user IDs.
This seemed like it would be easy but I just can't wrap my head around how to start targeting that last string. I don't even have a good example of something that is close to working because I don't know how to target the last string by delimiters without catching the same delimiters in the user agent part.
Any ideas?
Edit: Every line will have an account id that starts with one of two common KnownString_ in it but then is followed by a series of unknown digits and dashes until it gets to the first /. So I don't need to search for lines containing that before targeting the string.
Edit2: My original examples of the Account ID did not reflect there were letters mixed in with the numbers.
Edit3: Thanks to the responses from oguz ismail and kesubagu I was able to solve this using egrep. Looks like I was trying to make things more complicated than they were. I also realized I need to revisit grep as its capable of doing far more than what I tend to use it for.
This is what I ended up using which worked in one pass:
egrep -o ".+(KnownString1|KnownString2)_[^/]+" logfile > logfile2

Using grep:
$ grep -o '.*KnownString[^/]*' file
some user agent/1.3 KnownString1_32d4-56e-009f98
user/agent KnownString1_12d3-345e-4c534
User/Agent cURL/1.5.0 KnownString2_12d34e56
one/User Agent/2.0 KnownString1_12d3_456e_7g8
.* matches everything before KnownString, and [^/]* matches everything after KnownString until the first /.

You can use egrep with the -o option which will only output the part of that matches the provided regex, so you could do something like this
cat test | egrep -o ".+(KnownString1|KnownString2)_[_0-9-]+"
where the test file contains the input you've given, the output in this case was
some user agent/1.3 KnownString1_324-56-00998
user/agent KnownString1_123-345-4534
User/Agent cURL/1.5.0 KnownString2_123456
one/User Agent/2.0 KnownString1_123_456_78

Regex: match line if previous line satisfies a criteria

What's a regex that will match lines whose previous line starts with a set of characters?
I'm trying to parse M3U files, and I need to match the lines whose preceding line starts with #EXTINF: So if we take this example:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:10
#EXTINF:11.54
ASMIK_tid_0000250058_m.600000-00000.ts
#EXTINF:8.51
ASMIK_tid_0000250058_m.600000-00001.ts
#EXTINF:11.76
ASMIK_tid_0000250058_m.600000-00002.ts
#EXTINF:10.05
ASMIK_tid_0000250058_m.600000-00003.ts
etc...
I only want to extract these lines:
ASMIK_tid_0000250058_m.600000-00000.ts
ASMIK_tid_0000250058_m.600000-00001.ts
ASMIK_tid_0000250058_m.600000-00002.ts
ASMIK_tid_0000250058_m.600000-00003.ts
I've tried variations on this answer and this: (?#EXT.*\n) but had no luck...

Firstly you have to be sure that the function you are using is matching the whole file instead of line by line, otherwise this is impossible.
Then you would need to specify a lookbehind:
(?<=#EXTINF.*\r\n).*
If your regex implementation does not support lookbehinds OR repetition inside of a lookbehind, you can use two capture groups instead:
(#EXTINF.*\r\n)(.*)
Obviously you would simply ignore the first capture group, but keep all of the data in the second capture group.
If you need to manually specify that the . does not match newlines, you can specify the mode at the beginning of the regex: (?-s)

DCL sort - different start positions

I have a DCL script that creates a .txt file that looks something like this
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
anotherline,line,00000015
I need to sort the file by the 3rd column highest to lowest
ex:
anotherline,line,00000015
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
Is it best to use the sort command, if so everything i've seen required a position number, how can this be done if each line would have a different start position?
If sort is a bad way to handle this is there something else or can I somehow handle this while writing the lines to the file.
I've only been working with VMS/DCL for a few weeks now so i'm not fimilar with all of the commands yet.
Thanks!

As you already noticed, the VMS sort expects fields with a fixed start position within a record. You can not specify a field by a separator. If you want to use the VMS sort you have to make sure your third field starts at the same column, for all records. In other words, you have to pad preceding fields. If you have control on how the file is created, this may work for you. If you don't or you don't know how big the string in front of the sort field will be, this may not be a workaround. Maybe changing the order of the fields is an option.
On the other hand, you may find GNV installed on your system. Then you can try to use its sort, which is a GNU style sort. That is, $ mcr gnv$gnu:[bin]sort -t, -k3 -r x.txt may get you the wanted results.

VMS Sort is indeed not really equipped for this.
Reformatting as you did is about the only way.
If you so not have access to GNV sort on the OpenVMS system then perhaps you have, or can install PERL? Is is somewhat easier to install.
In perl there are of course many ways.
For example using an anonymous sort function ( $a is first arg, $b second; <> reads all input )
$ perl -e "print sort { 0+(split /,/,$b)[1] <=> 0+(split /,/,$a)[2]} <>" x.x
where the 0 + forces numeric evaluation. For (fixed length?) string compare use:
$ perl -e "print sort { (split /,/,$b)[2] cmp (split /,/,$a)[2]} <>" x.x
hth,
Hein.enter code here

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas