group lines with a tag - awk

Each written line belongs to the first tag upward.
For instance: ' The trucker is drunk ' belongs to AN8. I want to group all written lines belonging to the corresponding tag. Lines should remain in the same order.
input:
AN9
the cow is eating way too much
AN8
The trucker is drunk
AN9
The field are running out of herbs.
AN8
the truck is not going that staight
well of course the road is in curve
AN9
and
another line
AN8
The cop needs to check this out
AN9
now the cow is soooo big dude !
output:
AN9
the cow is eating way too much
The field are running out of herbs.
and
another line
now the cow is soooo big dude !
AN8
The trucker is drunk
the truck is not going that staight
well of course the road is in curve
The cop needs to check this out

Here is one awk
awk '/^AN/ {id=$0;next} {a[id]=a[id]"\n"$0} END {for (i in a) print i,a[i]}' file
AN8
The trucker is drunk
the truck is not going that staight
well of course the road is in curve
The cop needs to check this out
AN9
the cow is eating way too much
The field are running out of herbs.
and
another line
now the cow is soooo big dude !
If line starts with AN, use the info in the line as ID for array a.
Then store all lines in array a. Finally print all data in array a

Related

segmenting bs4.element.Tag

Is it possible to segment a bs4.element.Tag into several bs4.element.Tag?
You can think of an application as the following:
1- The original bs4.element.Tag contains a paragraph.
2- We want to segment the paragraph in the original bs4.element.Tag into sentences and get a bs4.element.Tag corresponding to each sentence.
Example:
paragraphs = soup.find_all('p') gives all the paragraphs in an HTML file.
Suppose a paragraph (which is also a bs4.element.Tag instance) is the following:
<p><i>Le Bassin Aux Nymphéas</i>, 1919. Monet's late series of water lily paintings are among his best-known works.
I would like to turn this bs4.element.Tag instance (which is also a paragraph) into 2 bs4.element.Tag instances as the following (one for each sentence):
First bs4.element.Tag should correspond to the first sentence:
<i>Le Bassin Aux Nymphéas</i>, 1919.
Second bs4.element.Tag should correspond to the second sentence:
Monet's late series of water lily paintings are among his best-known works.

How to awk to read a dictionary and replace words in a file?

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):
The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences
Each sentence in "source-A" is on its own line and terminates with a newline (\n)
We have a dictionary/conversion file ("converse-B") that looks like this:
aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes
"converse-B" is a two column, tab delimited file.
Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)
How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?
For example, the "output-C" would look like this:
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
The tricky part is the term potato.
If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.
In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.
An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.
sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.
$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences
file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.
NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

I need to figure out how to delimit a string in SQL

So, I'm at work atm and I had a co-worker create some SQL code for me to extract out text from a larger description field. The problem I'm running into is it doesn't stop extracting where I need it to. I need it to stop after it either sees the word "Specifications:" or when it finds two CRLF back to back. This would allow it to grab out only the "Features" which is what I'm trying for.
Here's an example of the current code:
SELECT IN_Desc, Replace(IN_Desc, Left(IN_Desc, InStr(IN_Desc, "- ") - 1), "")
FROM Inventory
WHERE IN_MfgName = "BERK"
Here's an example of the text it's looking through:
Gulp! has 400 times more scent dispersion than ordinary plastic bait.
The extreme scent dispersion greatly expands the strike zone allowing
you to catch more fish! Even more impressive, the natural formulation
of Gulp! out fishes live bait in head to head field tests. Berkley
Gulp! truly is the next generation in soft bait!
Features:
Ideal on jigs or as a trailer
Favorite for all SW species when targeting big fish
Proven tail action design swims under all conditions
Expand your strike zone with 400x more scent dispersion than plastic baits
15 years of Gulp! evolution…the best keeps getting better
Specifications:
Bait Length: 6"
Color: White
Quantity: Per 4
Packaging: Bag
Desired output:
Ideal on jigs or as a trailer
Favorite for all SW species when targeting big fish
Proven tail action design swims under all conditions
Expand your strike zone with 400x more scent dispersion than plastic baits
15 years of Gulp! evolution…the best keeps getting better
Thanks to everyone in advance for any and all help.
This is a bit ugly, but it seems to do the trick. It may need some tweaking to get exactly what you want, but this will get everything between Features and the next double carriagereturn/linefeed:
Mid(yourfield,InStr(1,yourfield, "Features:")+Len("Features: "),InStr(InStr(1,yourfield, "Features:")+Len("Features: "),yourfield, Chr(13) & Chr(10) & Chr(13) & Chr(10)))
I'm certain that it could be written prettier, but my access is rusty as hell. I feel like a VBA UDF would be a lot cleaner and then you could employ regex to really strip thing this apart.

Huge file with 55000 rows * 1800 Columns - need to delete only specific column with a partial pattren

I have a huge file (cancer Gene expression data- ~ 2 GBs .csv file) with 55000 rows and ~ 1800 Columns. so my table looks like this:
TCGA-4N-A93T-01A-11R-A37K-07, **TCGA-5M-AAT4-11A-11R-A41B-07**, TCGA-5M-AATE-01A-11R-A41B-07, TCGA-A6-2677-01B-02R-A277-07, **TCGA-A6-2677-11A-01R-0821-07**
for example in Column TCGA-5M-AAT4-11A-11R-A41B-07 at the fourth position I have -11A, Now my problem is I have to delete the entire columns which have -11A at 4th position (xx-xx-xx-11A-xx-xx-xx).This has to search all 1800 columns and keep only those columns which do not have -11A at a fourth position.
Can you please help me what command should i use to get the required data.
I am a biologist and have limited Experience in coding
EDITED:
I have a data file collected from 1800 Breast cancer patient, the table has got 55000 gene names as rows and 1800 samples as the columns. (55000 * 1800 matrix file)Few samples designed by our lab were faulty and we have to remove those from our analysis. Now, I have identified those samples and I wanted to remove them from my file1.csv. xx-xx-xx-11A-xx-xx-xx are the faulty samples, i need to identify only those samples and remove them from the file.csv. The samples which show 11A in the fourth place of the column name. I can do this in R but it takes too long for me to process. Thanks in advance, sorry for annoying.
Try this
#! /usr/local/bin/gawk -f
# blacklist_columns.awk
# https://stackoverflow.com/questions/49578756
# i.e. TCGA-5M-AAT4-11A-11R-A41B-07
BEGIN{
PATTERN="TCGA-..-....-11A-...-....-.."
}
$0 ~ ".*" PATTERN ".*"{ # matches rows with the pattern
for(col=1;col<=NF; col++)
# find column(s) in the row with the patten
if($col ~ PATTERN){
blacklist[col]++ # note which column
}
}
END{ # output the list collected
n = asorti(blacklist)
for(i=1;i<=n;i++)
bl=bl "," blacklist[i]
print substr(bl, 2)
}
# Usage try ... :
# BLACKLIST=blacklist_columns.awk table.tab
#
# cut --complement -f $BLACKLIST table.tab > table_purged.tab
You can't do it in one pass so you might as well let an existing tool
do the second pass especially since you are more on the wet side.
The script should spit out a list of columns it thinks you should skip
you can feed that list as an argument to the program cut
and get it to only keep the columns not mentioned.
Edit(orial):
Thank you for your sentiment Wojciech Kaczmarek I could not agree more.
There is also a flip side where some biologist discount "coders" which I find annoying. The paper being working on here may include some water cooler collaborator but fail to mention technical help on a show stopper (hey, they fixed it must not have been any big deal).
Not sure what you really asking for, this script will delete row by row the fields which has the "11A" in the 4th position (based on - delim).
$ awk -F', *' -v OFS=', ' '{for(i=1;i<=NF;i++)
{split($i,a,"-");
if(a[4]=="11A") $i=""}}1' input > output
If you're asking to remove the entire column for all rows not just the found row, this is not it. Also not tested but perhaps will give you ideas...

How do I create a sub arrary in awk?

Given a list like:
Dog bone
Cat catnip
Human ipad
Dog collar
Dog collar
Cat collar
Human car
Human laptop
Cat catnip
Human ipad
How can I get results like this, using awk:
Dog bone 1
Dog collar 2
Cat catnip 2
Cat collar 1
Human car 1
Human laptop 1
Human ipad 2
Do I need a sub array? It seems to me like a need an array of "owners" which is populated by arrays of "things."
I'd like to use awk to do this, as this is a subscript of another program in awk, and for now, I'd rather not create a separate program.
By the way, I can already do it using sort and grep -c, and a few other pipes, but I really won't be able to do that on gigantic data files, as it would be too slow. Awk is generally much faster for this kind of thing, I'm told.
Thanks,
Kevin
EDIT: Be aware, that the columns are actually not next to eachother like this, in the real file, they are more like column $8 and $11. I say this because I suppose if they were next to eachother I could incorporate an awk regex ~/Dog\ Collar/ or something. But I won't have that option. -thanks!
awk does not have multi-dimensional arrays, but you can manage by constructing 2D-ish array keys:
awk '{count[$1 " " $2]++} END {for (key in count) print key, count[key]}' | sort
which, from your input, outputs
Cat catnip 2
Cat collar 1
Dog bone 1
Dog collar 2
Human car 1
Human ipad 2
Human laptop 1
Here, I use a space to separate the key values. If your data contains spaces, you can use some other character that does not appear in your input. I typically use array[$a FS $b] when I have a specific field separator, since that's guaranteed not to appear in the field values.
GNU Awk has some support for multi-dimensional arrays, but it's really just cleverly concatenating keys to form a sort of compound key.
I'd recommend learning Perl, which will be fairly familiar to you if you like awk, but Perl supports true Lists of Lists. In general, Perl will take you much further than awk.
Re your comment:
I'm not trying to be superior. I understand you asked how to accomplish a task with a specific tool, awk. I did give a link to the documentation for simulating multi-dimensional arrays in awk. But awk doesn't do that task well, and it was effectively replaced by Perl nearly 20 years ago.
If you ask how to cross a lake on a bicycle, and I tell you it'll be easier in a boat, I don't think that's unreasonable. If I tell you it'll be easier to first build a bridge, or first invent a Star Trek transporter, then that would be unreasonable.