Recursively search directory for occurrences of each string from one column of a .csv file - awk

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Related

Targeting a string for deletion with grep, sed, awk (or cut)

I am trying to parse some logs to gain the user agent and account id per line. I have already managed to pull the user agent and a string which contains the account id all on the same line.
The next step is to extract the account id from its longer string. I thought this would be fairly simple as I will know the start of the string and there are / slashes for the delimiter but the user agent also contains slashes and have varied number of fields.
The log file currently looks something like the following example but there are hundreds to thousands of lines to parse. Luckily I am working off a partition with plenty of space to spare.
USER_AGENT_PART ACCOUNT_ID_Part_/plus/path/to/stuff/they/access
some user agent/1.3 KnownString1_32d4-56e-009f98/some/stuff/here
user/agent KnownString1_12d3-345e-4c534/more/stuff/here
User/Agent cURL/1.5.0 KnownString2_12d34e56/stuff/things/stuff/stuff
one/User Agent/2.0 KnownString1_12d3_456e_7g8/more/random/stuff/stuff
So the goal is to keep the user agent part and the account id part and drop the path of the stuff they are accessing in the last string. But I can't use / or spaces as general delimiters because many user agents have / and various amounts of spaces in their name.
Also, the different types of user agents is way more than this little sample I have here. There are anywhere from 25 - 50 distinct types depending on the log. So it doesn't seem worth it to target the user agent and try to exclude it.
It seems the logical way to start is by targeting the part of the account ID which is a known string (KnownString1 or KnownString2) and grab everything from there (which is unknown numbers and letters with dashes) up until the first / of that account string.
Then I would delete the first / (In the account ID string) and everything after. I expect I will need to do this in two passes to utilize the two known parts of the user IDs.
This seemed like it would be easy but I just can't wrap my head around how to start targeting that last string. I don't even have a good example of something that is close to working because I don't know how to target the last string by delimiters without catching the same delimiters in the user agent part.
Any ideas?
Edit: Every line will have an account id that starts with one of two common KnownString_ in it but then is followed by a series of unknown digits and dashes until it gets to the first /. So I don't need to search for lines containing that before targeting the string.
Edit2: My original examples of the Account ID did not reflect there were letters mixed in with the numbers.
Edit3: Thanks to the responses from oguz ismail and kesubagu I was able to solve this using egrep. Looks like I was trying to make things more complicated than they were. I also realized I need to revisit grep as its capable of doing far more than what I tend to use it for.
This is what I ended up using which worked in one pass:
egrep -o ".+(KnownString1|KnownString2)_[^/]+" logfile > logfile2
Using grep:
$ grep -o '.*KnownString[^/]*' file
some user agent/1.3 KnownString1_32d4-56e-009f98
user/agent KnownString1_12d3-345e-4c534
User/Agent cURL/1.5.0 KnownString2_12d34e56
one/User Agent/2.0 KnownString1_12d3_456e_7g8
.* matches everything before KnownString, and [^/]* matches everything after KnownString until the first /.
You can use egrep with the -o option which will only output the part of that matches the provided regex, so you could do something like this
cat test | egrep -o ".+(KnownString1|KnownString2)_[_0-9-]+"
where the test file contains the input you've given, the output in this case was
some user agent/1.3 KnownString1_324-56-00998
user/agent KnownString1_123-345-4534
User/Agent cURL/1.5.0 KnownString2_123456
one/User Agent/2.0 KnownString1_123_456_78

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

AIX: remove the last symbols (CRLF) from a file

There is a large file where the last symbols are \r\n. I need to remove them. It seems to be equivalent to removing the last line(?).
UPD: no, it's not: a file have only one line, which ends with \r\n.
I know two ways, but both don't work for AIX:
sed 's/\r\n$//' file # I don't why it doesn't work
head -c-2 # head doesn't work with negative numbers
Is there any solution for AIX? A lot of large files must be processed, so performance is important.
Usually, if you need to edit a file via a script in place, I use ed due to historical reasons. For example:
ed - /tmp/foo.txt <<EOF
g/^$/d
w
q
EOF
ed is more than a bit cantankerous. Note also that you did not really remove the empty lines at the bottom of the file but rather all of the empty lines. With ed and some practice you can probably achieve deleting only the empty lines at the bottom of the file. e.g. go to the bottom of the file, search up for a non-empty line, then move down a line and delete from that point to the end of the file. ed command scripts act (pretty much) as you would expect.
Also, if they really do have \r\n, then those are not going to be considered empty lines but rather lines with a control-M (\r) in them. You may need to adjust your pattern if that is the case.
My answer https://stackoverflow.com/a/46083912/3220113 to the duplicate question should work here too. Another solution is using
awk ' (NR>1) { print s }
{s=$0}
END { printf("%s",substr($2, 1, length($2)-1) ) }
' inputfile

Find duplicate records with only text case difference

I have a log file with 8M entries/records with URLs. I'd like to find duplicate URLs (same URLs) with the only difference being their type / text case.
Example:
origin-www.example.com/this/is/hard.html
origin-www.example.com/this/is/HARD.html
origin-www.example.com/this/is/Hard.html
In this case, there are three duplicates with case sensitivity.
Output should be just the count -c and a new file with the duplicates.
Use the typical awk '!seen[$0]++' file trick combined with tolower() or toupper() to make all lines be in the same case:
$ awk '!seen[tolower($0)]++' file
origin-www.example.com/this/is/hard.html
For a different output and counters whatsoever, provide a valid desired output.

DCL sort - different start positions

I have a DCL script that creates a .txt file that looks something like this
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
anotherline,line,00000015
I need to sort the file by the 3rd column highest to lowest
ex:
anotherline,line,00000015
something,somethingelse,00000004
somethingdifferent,somethingelse1,00000002
Is it best to use the sort command, if so everything i've seen required a position number, how can this be done if each line would have a different start position?
If sort is a bad way to handle this is there something else or can I somehow handle this while writing the lines to the file.
I've only been working with VMS/DCL for a few weeks now so i'm not fimilar with all of the commands yet.
Thanks!
As you already noticed, the VMS sort expects fields with a fixed start position within a record. You can not specify a field by a separator. If you want to use the VMS sort you have to make sure your third field starts at the same column, for all records. In other words, you have to pad preceding fields. If you have control on how the file is created, this may work for you. If you don't or you don't know how big the string in front of the sort field will be, this may not be a workaround. Maybe changing the order of the fields is an option.
On the other hand, you may find GNV installed on your system. Then you can try to use its sort, which is a GNU style sort. That is, $ mcr gnv$gnu:[bin]sort -t, -k3 -r x.txt may get you the wanted results.
VMS Sort is indeed not really equipped for this.
Reformatting as you did is about the only way.
If you so not have access to GNV sort on the OpenVMS system then perhaps you have, or can install PERL? Is is somewhat easier to install.
In perl there are of course many ways.
For example using an anonymous sort function ( $a is first arg, $b second; <> reads all input )
$ perl -e "print sort { 0+(split /,/,$b)[1] <=> 0+(split /,/,$a)[2]} <>" x.x
where the 0 + forces numeric evaluation. For (fixed length?) string compare use:
$ perl -e "print sort { (split /,/,$b)[2] cmp (split /,/,$a)[2]} <>" x.x
hth,
Hein.enter code here