How to find out similarity degree of the two sentences? - text-mining

I want to analyze the guest's review text and the host's comment text in Airbnb.
I have some pair of text data(of guests and hosts).
ex)
guest1 review with host1 comment
guest2 review with host1 comment
guest3 review with host2 comment
guest2 review with host2 comment
guest4 review with host3 comment
And then, I want to see the similarity or conformity of each pair of paragraph.
Do I need to extract the main topic word in each sentence?
Which text mining algorithm can help me?
Can LDA show topics for each paragraph? (not for whole text data)

There are many ways. Try Shingling sentences to K-Shingle: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
or look at it in wikipedia: https://en.wikipedia.org/wiki/W-shingling
,you can find the jaccard similarity between shingles of two sentences
Also take a look at Bag Of Words Model that map each sentence to a vector and you can fine the similarities between two vector(two sentences) easily by dot product of each matching word: https://en.wikipedia.org/wiki/Bag-of-words_model

Related

segmenting bs4.element.Tag

Is it possible to segment a bs4.element.Tag into several bs4.element.Tag?
You can think of an application as the following:
1- The original bs4.element.Tag contains a paragraph.
2- We want to segment the paragraph in the original bs4.element.Tag into sentences and get a bs4.element.Tag corresponding to each sentence.
Example:
paragraphs = soup.find_all('p') gives all the paragraphs in an HTML file.
Suppose a paragraph (which is also a bs4.element.Tag instance) is the following:
<p><i>Le Bassin Aux Nymphéas</i>, 1919. Monet's late series of water lily paintings are among his best-known works.
I would like to turn this bs4.element.Tag instance (which is also a paragraph) into 2 bs4.element.Tag instances as the following (one for each sentence):
First bs4.element.Tag should correspond to the first sentence:
<i>Le Bassin Aux Nymphéas</i>, 1919.
Second bs4.element.Tag should correspond to the second sentence:
Monet's late series of water lily paintings are among his best-known works.

Generate rule-based passwords with John the Ripper

I am trying to recover a password I have not used in a long time.
I know the words used in the passphrase, but I do not remember exactly the character substitutions,
and upper/lower case I have used. I only remember some, and know the possibilities for others.
The passphrase I am trying to recover is 15 characters long.
I have installed John the Ripper (jumbo version 1.9), and I tried to create some rules for character
substitutions I know I have used hoping to quickly generate a wordlist with all possible passphrases
based on my rules.
Let's say my passphrase is password with some character substitutions. If I use this set of rules:
sa#
ss$
so0
soO
I get those results:
p#ssword
pa$$word
passw0rd
passwOrd
When I say I am looking for all possible combinations, I am looking for something lookig more like the following (this list is not exhaustive)
p#ssword
p#$sword
p#$$word
pa$sword
pa$$word
p#ssw0rd
p#$sw0rd
p#$$w0rd
pa$sw0rd
pa$$w0rd
p#sswOrd
p#$swOrd
p#$$wOrd
pa$swOrd
pa$$wOrd
Gathering all rules in one line does not help me achieve my goal, and making one rule (line) with substitution by character position is basically generating my list by hand.
I am now wondering how can I achieve my goal, or, if JtR is the right tool for the job.
I have found a solution that fits my use case. the oNx syntax allows to replace the character at Nth position (zero based) with x.
In addition to that, using brackets allow to apply more than one substitution to the same character. So oN[xy] will yield two passwords with the character at Nth position replaced with x, then y.
For my password example above, the rule needed to achieve my goal would be:
o1[a#] o2[sS$] o3[sS$] o5[oO0]
I hope it helps someone with some old database to unlock )

Vector Source block in GNU Radio PSK tutorial

I don't understand the inputs in the Vector Source block in the fourth flowchart in the GNU Radio Guided PSK Tutorial. What is behind the three dots? Please state, in full, the input to the Vector line in the first couple of Vector Source blocks so that I can see and understand the inputs. The tutorial is at: https://wiki.gnuradio.org/index.php/Guided_Tutorial_PSK_Demodulation.
The problem I have is in the section called Recovering Timing. There is no link to any file that explains the inputs to the Vector Source blocks. The tutorial shows the surface of the block but not the input. The surface shows 49*[0,]+[1,]+5... and then the next one is 50*[0,]+[1,]+4... I don't understand the input to these Vector Source blocks.
When we updated that tutorial to 3.8, it was decided that only the final flowgraph source would be included in the active gnu radio tree. However, all of the previous ones from 3.7 can be found in https://github.com/gnuradio/gr-tutorial/tree/master/examples/tutorial7 You can get the specific parameters there.
Also note that both the old and new versions of mpsk_stage6.grc were incorrect. Look at https://github.com/gnuradio/gnuradio/issues/3599 to find the solution. NOTE: As of 9 July 2020, that flowgraph has been incorporated into the gnu radio tree, so the link in the tutorial is correct.
that's just a pythonic way of making vectors that are
0,0,0,0,…,0,1,0,…,0
49*[0,] is of the shape integer * list, and that just means "generate a list containing integer repetition of list. So 49*[0,] is a list containing 49 zeros. You can append more lists using +.
Here, the first source contains data that's 49 zeros, followed by a single one, followed by more zeros (likely, 50).
The secons source contains 50 zeros, followed by a one, followed by more zeros (probably 49), and so on.
The idea here is just that they contain the same signal, but shifted!

How do I determine the model organism from FASTA format

So I have this fasta format: For example
>sp|A9X7L0|ANMT_RUTGR Anthranilate N-methyltransferase OS=Ruta graveolens OX=37565 PE=1 SV=1
MGSLSESHTQYKHGVEVEEDEEESYSRAMQLSMAIVLPMATQSAIQLGVFEIIAKAPGGR
LSASEIATILQAQNPKAPVMLDRMLRLLVSHRVLDCSVSGPAGERLYGLTSVSKYFVPDQ
DGASLGNFMALPLDKVFMESWMGVKGAVMEGGIPFNRVHGMHIFEYASSNSKFSDTYHRA
MFNHSTIALKRILEHYKGFENVTKLVDVGGGLGVTLSMIASKYPHIQAINFDLPHVVQDA
ASYPGVEHVGGNMFESVPEGDAILMKWILHCWDDEQCLRILKNCYKATPENGKVIVMNSV
VPETPEVSSSARETSLLDVLLMTRDGGGRERTQKEFTELAIGAGFKGINFACCVCNLHIM
EFFK
So I am wondering how do I determine if one is:
Bacteria
Viruses
Archaea
Eukaryota
The anwser can be found when looking at the OS part of the header of your FASTA file. But suppose you don't have this information, then you would perform a BLAST search. If the letters in your sequence would consist of only A, T, C and G it would be a DNA sequence. But since they are not you are dealing with a protein sequence. So we need to use protein BLAST.
Copy/paste the FASTA file in the online tool:
Leave the rest at the default settings and click on the BLAST button. After some time you will get the following results:
You will see that there is a 100% similarity match found with Ruta graveolens (as mentioned in the FASTA header) and around 80% similarity match found in Citrus sinensis.
If you want to know to which domain these species belong, you can click on the link to the accession records. For Ruta graveolens that is A9X7L0.1. There you see that the common name of this plant is common rue which has the following taxonomy:
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Sapindales; Rutaceae; Rutoideae;
Ruta.

Tools for generating strong passphrases?

Passphrases seem like a good alternative for traditional
guidelines for strong passwords. See http://xkcd.com/936/ for an entertaining take on passwords vs. passphrases.
There are many tools for generating more traditional passwords (eg. pwgen.)
Such tools are useful when for example providing users with good initial passwords.
What tools are available for generating good passphrases?
Do you have experience on using them or insight about their security or other features?
I've recently released a couple of Perl scripts, gen-password and gen-passphrase, on GitHub here.
The gen-passphrase script could suit your needs. It takes three arguments: a word used as a sequence of initials, a minimum length, and a maximum length. For example:
$ gen-passphrase abcde 6 8
acrimony borrowed chasten drifts educable
or you can ask for a number of words without specifying their initials (a new feature I just added):
$ gen-passphrase 5 6 8
poplin outbreak aconites academic azimuths
It requires a word list; it uses /usr/share/dict/words by default if it exists. It uses /dev/urandom by default, but can be told to use /dev/random. See this answer of mine on superuser.com for more information about /dev/urandom vs. /dev/urandom.
NOTE: So far, nobody other than me has tested these scripts. I've made my best effort to have them generate strong passwords/passphrases, but I guarantee nothing.
I wrote a command line based Perl script passphrase-generator.
The passphrase-generator defaults to only 3 words instead of the 4 suggested by XKCD, but uses a larger dictionary found in many linux based systems at /usr/share/dict/words. It also provides estimates for the entropy of the generated passphrases. The randomization is based on /dev/urandom and SHA1.
Example run:
$ passphrase-generator
Random passphrase generator
Entropy per passphrase is 43.2 bits (per word: 14.4 bits.)
For reference, entropy of completely random 8 character (very hard to memorize)
password of upper and lowercase letters plus numbers is 47.6 bits
Entropy of a typical human generated "strong" 8 character password is in the
ballpark of 20 - 30 bits.
Below is a list of 16 passphrases.
Assuming you select one of these based on some non random preference
your new passphrase will have entropy of 39.2 bits.
Note that first letter is always capitalized and spaces are
replaced with '1' to meet password requirements of many systems.
Goatees1maneuver1pods
Aught1fuel1hungers
Flavor1knock1foreman
Holding1holster1smarts
Vitamin1mislead1abhors
Proverbs1lactose1brat
... and so on 10 more
There are also some browser/javascript based tools:
http://preshing.com/20110811/xkcd-password-generator
http://passphra.se/
http://lightsecond.com/passphrase.html
CPAN hosts a Perl module for generating XKCD style passphrases:
http://metacpan.org/pod/Crypt::XkcdPassword