How do I determine the model organism from FASTA format - sequence

So I have this fasta format: For example
>sp|A9X7L0|ANMT_RUTGR Anthranilate N-methyltransferase OS=Ruta graveolens OX=37565 PE=1 SV=1
MGSLSESHTQYKHGVEVEEDEEESYSRAMQLSMAIVLPMATQSAIQLGVFEIIAKAPGGR
LSASEIATILQAQNPKAPVMLDRMLRLLVSHRVLDCSVSGPAGERLYGLTSVSKYFVPDQ
DGASLGNFMALPLDKVFMESWMGVKGAVMEGGIPFNRVHGMHIFEYASSNSKFSDTYHRA
MFNHSTIALKRILEHYKGFENVTKLVDVGGGLGVTLSMIASKYPHIQAINFDLPHVVQDA
ASYPGVEHVGGNMFESVPEGDAILMKWILHCWDDEQCLRILKNCYKATPENGKVIVMNSV
VPETPEVSSSARETSLLDVLLMTRDGGGRERTQKEFTELAIGAGFKGINFACCVCNLHIM
EFFK
So I am wondering how do I determine if one is:
Bacteria
Viruses
Archaea
Eukaryota

The anwser can be found when looking at the OS part of the header of your FASTA file. But suppose you don't have this information, then you would perform a BLAST search. If the letters in your sequence would consist of only A, T, C and G it would be a DNA sequence. But since they are not you are dealing with a protein sequence. So we need to use protein BLAST.
Copy/paste the FASTA file in the online tool:
Leave the rest at the default settings and click on the BLAST button. After some time you will get the following results:
You will see that there is a 100% similarity match found with Ruta graveolens (as mentioned in the FASTA header) and around 80% similarity match found in Citrus sinensis.
If you want to know to which domain these species belong, you can click on the link to the accession records. For Ruta graveolens that is A9X7L0.1. There you see that the common name of this plant is common rue which has the following taxonomy:
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Sapindales; Rutaceae; Rutoideae;
Ruta.

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

.docx file chapter extraction

I would like to extract the content of a .docxfile, chaptervise.
So, my .docxdocument has a register and every chapter has some content
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
So finally it would be great to receive a Nx3 matrix, containing the index number, the index name and at least the content.
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
Thanks for your help
You could export or copy-paste your .docx in a .txt and apply this R script :
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
It doesn't work if your document contains any sort of line which is not a heading beginning with a number (eg, footnotes or numbered lists)
(please, no mindless downvote : this script works perfectly with the example given)

Extract sections of PDF

I am trying to extract sections of a PDF file, for use in text analysis. I tried using pdfextract to accomplish this. However, a command such as
pdf-extract extract --regions --no-lines Bauer2010.pdf
only extract the (x,y) coordinates of a region, as in the example below.
<region x="226.32" y="750.47" width="165.57" height="6.37"
line_height="6.37" font="BGBFHO+AdvP4DF60E">Patient Education and
Counseling 79 (2010) 315-319</region>
Can sections of a PDF be extracted?
Have a look at http://text-analyzer.com where you can upload your PDF file and it will convert it into a format suitable for Natural Language Processing. Once converted into a text file it can then process the file, breaking it down into sentences with sentiment analysis. It has over 40 different types of sentence views where you can tag sections. Those tagged sentences can be exported.

How to find out similarity degree of the two sentences?

I want to analyze the guest's review text and the host's comment text in Airbnb.
I have some pair of text data(of guests and hosts).
ex)
guest1 review with host1 comment
guest2 review with host1 comment
guest3 review with host2 comment
guest2 review with host2 comment
guest4 review with host3 comment
And then, I want to see the similarity or conformity of each pair of paragraph.
Do I need to extract the main topic word in each sentence?
Which text mining algorithm can help me?
Can LDA show topics for each paragraph? (not for whole text data)
There are many ways. Try Shingling sentences to K-Shingle: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
or look at it in wikipedia: https://en.wikipedia.org/wiki/W-shingling
,you can find the jaccard similarity between shingles of two sentences
Also take a look at Bag Of Words Model that map each sentence to a vector and you can fine the similarities between two vector(two sentences) easily by dot product of each matching word: https://en.wikipedia.org/wiki/Bag-of-words_model

Python: Search Journal.txt for dates and write the corresponding text into a new file for Evernote import

I've been learning Python for a week now and am currently at Exercise 26 (learnpythonthehardway). So I know nothing. I tried searching but couldn't find what I need.
What I need:
I want to write a script that breaks my Journal.txt file into several text files to be able to import them into Evernote. Evernote pulls the title for a note from the first line of the .txt
This is a random example of the date format I used in my Journal.txt
1/13/2013, 09/02/2012, so I'm afraid the date is not consistent. I know about:
if 'blabla' in open('example.txt').read():
but don't know how to use it with a date. Please help me to elicit date corresponding Journal entries from a large file into a new one. This is literally all I got so far:
Journal = open("D:/Australien/Journal.txt", 'r').read()
Consider doing it like recommended here - replacing YOUR_TEXT_HERE with a search pattern for a date, e. g. [0-9]+\/[0-9]+\/[0-9]+.
awk "/[0-9]+\/[0-9]+\/[0-9]+/{n++}{print > n \".txt\" }" a.txt
If you don't have awk installed on your PC, you can fetch it here.