How can I make nawk evauluate two if conditions as one? - awk

For the data set:
12345 78945
12345 45678
I need the output to be
12345 45678
Long story short, sometimes two values are representing the same object due to certain practices instituted above my pay grade. So with a list of known value-pairs that represent the same object, I need awk to filter these from the output. In the above situation 12345 and 78945 represent the same object, so it should be filtered out. The remaining rows are errors that should be brought to my attention.
My attempted code
cat data | nawk '(($1!="12345")&&($2!="78945"))'
produces an empty set as output. So either I'm committing a logical error in my mind or a syntactical one whereby nawk is evaluating each condition individually as if written like cat data | nawk '($1!="12345")&&($2!="78945")', thus filtering out both since both fail the first condition.
I'm sure it's just my unfamiliarity with how nawk resolves such things. Thanks in advance for any assistance. And for reasons, this has to be done in nawk.

There is no line in your sample data for which $1!="12345" is true so there can be no condition where that && anything else can be true. Think about it. This has nothing to do with awk - it's simple boolean logic.
Try any of these instead, whichever you feel is clearer:
nawk '($1!="12345") || ($2!="78945")' data
nawk '!(($1=="12345") && ($2=="78945"))' data
nawk '($1=="12345") && ($2=="78945"){next} 1' data
nawk '($1" "$2) != "12345 78945"' data
nawk '!/^[ \t]*12345[ \t]+78945([ \t+]|$)/' data
Also google UUOC to understand why I got rid of cat data |.

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Huge file with 55000 rows * 1800 Columns - need to delete only specific column with a partial pattren

I have a huge file (cancer Gene expression data- ~ 2 GBs .csv file) with 55000 rows and ~ 1800 Columns. so my table looks like this:
TCGA-4N-A93T-01A-11R-A37K-07, **TCGA-5M-AAT4-11A-11R-A41B-07**, TCGA-5M-AATE-01A-11R-A41B-07, TCGA-A6-2677-01B-02R-A277-07, **TCGA-A6-2677-11A-01R-0821-07**
for example in Column TCGA-5M-AAT4-11A-11R-A41B-07 at the fourth position I have -11A, Now my problem is I have to delete the entire columns which have -11A at 4th position (xx-xx-xx-11A-xx-xx-xx).This has to search all 1800 columns and keep only those columns which do not have -11A at a fourth position.
Can you please help me what command should i use to get the required data.
I am a biologist and have limited Experience in coding
EDITED:
I have a data file collected from 1800 Breast cancer patient, the table has got 55000 gene names as rows and 1800 samples as the columns. (55000 * 1800 matrix file)Few samples designed by our lab were faulty and we have to remove those from our analysis. Now, I have identified those samples and I wanted to remove them from my file1.csv. xx-xx-xx-11A-xx-xx-xx are the faulty samples, i need to identify only those samples and remove them from the file.csv. The samples which show 11A in the fourth place of the column name. I can do this in R but it takes too long for me to process. Thanks in advance, sorry for annoying.
Try this
#! /usr/local/bin/gawk -f
# blacklist_columns.awk
# https://stackoverflow.com/questions/49578756
# i.e. TCGA-5M-AAT4-11A-11R-A41B-07
BEGIN{
PATTERN="TCGA-..-....-11A-...-....-.."
}
$0 ~ ".*" PATTERN ".*"{ # matches rows with the pattern
for(col=1;col<=NF; col++)
# find column(s) in the row with the patten
if($col ~ PATTERN){
blacklist[col]++ # note which column
}
}
END{ # output the list collected
n = asorti(blacklist)
for(i=1;i<=n;i++)
bl=bl "," blacklist[i]
print substr(bl, 2)
}
# Usage try ... :
# BLACKLIST=blacklist_columns.awk table.tab
#
# cut --complement -f $BLACKLIST table.tab > table_purged.tab
You can't do it in one pass so you might as well let an existing tool
do the second pass especially since you are more on the wet side.
The script should spit out a list of columns it thinks you should skip
you can feed that list as an argument to the program cut
and get it to only keep the columns not mentioned.
Edit(orial):
Thank you for your sentiment Wojciech Kaczmarek I could not agree more.
There is also a flip side where some biologist discount "coders" which I find annoying. The paper being working on here may include some water cooler collaborator but fail to mention technical help on a show stopper (hey, they fixed it must not have been any big deal).
Not sure what you really asking for, this script will delete row by row the fields which has the "11A" in the 4th position (based on - delim).
$ awk -F', *' -v OFS=', ' '{for(i=1;i<=NF;i++)
{split($i,a,"-");
if(a[4]=="11A") $i=""}}1' input > output
If you're asking to remove the entire column for all rows not just the found row, this is not it. Also not tested but perhaps will give you ideas...

Is a /start/,/end/ range expression ever useful in awk?

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.

How to skip records that turn on/off the range pattern?

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.
You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.

How do I create a sub arrary in awk?

Given a list like:
Dog bone
Cat catnip
Human ipad
Dog collar
Dog collar
Cat collar
Human car
Human laptop
Cat catnip
Human ipad
How can I get results like this, using awk:
Dog bone 1
Dog collar 2
Cat catnip 2
Cat collar 1
Human car 1
Human laptop 1
Human ipad 2
Do I need a sub array? It seems to me like a need an array of "owners" which is populated by arrays of "things."
I'd like to use awk to do this, as this is a subscript of another program in awk, and for now, I'd rather not create a separate program.
By the way, I can already do it using sort and grep -c, and a few other pipes, but I really won't be able to do that on gigantic data files, as it would be too slow. Awk is generally much faster for this kind of thing, I'm told.
Thanks,
Kevin
EDIT: Be aware, that the columns are actually not next to eachother like this, in the real file, they are more like column $8 and $11. I say this because I suppose if they were next to eachother I could incorporate an awk regex ~/Dog\ Collar/ or something. But I won't have that option. -thanks!
awk does not have multi-dimensional arrays, but you can manage by constructing 2D-ish array keys:
awk '{count[$1 " " $2]++} END {for (key in count) print key, count[key]}' | sort
which, from your input, outputs
Cat catnip 2
Cat collar 1
Dog bone 1
Dog collar 2
Human car 1
Human ipad 2
Human laptop 1
Here, I use a space to separate the key values. If your data contains spaces, you can use some other character that does not appear in your input. I typically use array[$a FS $b] when I have a specific field separator, since that's guaranteed not to appear in the field values.
GNU Awk has some support for multi-dimensional arrays, but it's really just cleverly concatenating keys to form a sort of compound key.
I'd recommend learning Perl, which will be fairly familiar to you if you like awk, but Perl supports true Lists of Lists. In general, Perl will take you much further than awk.
Re your comment:
I'm not trying to be superior. I understand you asked how to accomplish a task with a specific tool, awk. I did give a link to the documentation for simulating multi-dimensional arrays in awk. But awk doesn't do that task well, and it was effectively replaced by Perl nearly 20 years ago.
If you ask how to cross a lake on a bicycle, and I tell you it'll be easier in a boat, I don't think that's unreasonable. If I tell you it'll be easier to first build a bridge, or first invent a Star Trek transporter, then that would be unreasonable.