How to search for the specific word using either
1) Map Reduce WC program - Java code or
2) Word count with Hive
Eg: below is my file
Hello my name is Jammy
Jammy is the best
Jammy likes football
I want to search how many times the word "Jammy" appeared.. Please suggest
Please find word count example with Integration Test .
http://bytepadding.com/big-data/map-reduce/word-count-map-reduce
You can add a filter in the Map phase for, if you only want a specific word in the output.
if (tokenizer.nextToken().equals("Jammy")){
word.set(tokenizer.nextToken());
}
Related
I would like to know if it's possible to create a pdf file, send it to a print shop, and that each copy once printed has a unique identification number.
For instance, the document would have a fixed content like "Banana". But next to it, there would be a field that gives a different result everytime the document gets printed. Maybe it would be possible to force the printer to print the time when it prints in milliseconds, or any other unique identification? The result I would like to get would be:
Copy #1 : Banana - 42822435
Copy #2 : Banana - 42922998
Copy #3 : Banana - 43059609
etc.
It is possible to have a result like this with a pdf document? I feel like I could get something like that with some code, but I have to go through a print shop. So, I have to be able to just upload a file, and ask them to print multiple copies of it for me.
I hope my question is clear, thank you in advance for any help.
Such functionality would 100% rely on your printer and what they use. It has nothing to do with PDF unless you change and generate a single PDF with 1000 documents and you inject the number. Or you generate 1000 PDFs each with their own number.
So I have this fasta format: For example
>sp|A9X7L0|ANMT_RUTGR Anthranilate N-methyltransferase OS=Ruta graveolens OX=37565 PE=1 SV=1
MGSLSESHTQYKHGVEVEEDEEESYSRAMQLSMAIVLPMATQSAIQLGVFEIIAKAPGGR
LSASEIATILQAQNPKAPVMLDRMLRLLVSHRVLDCSVSGPAGERLYGLTSVSKYFVPDQ
DGASLGNFMALPLDKVFMESWMGVKGAVMEGGIPFNRVHGMHIFEYASSNSKFSDTYHRA
MFNHSTIALKRILEHYKGFENVTKLVDVGGGLGVTLSMIASKYPHIQAINFDLPHVVQDA
ASYPGVEHVGGNMFESVPEGDAILMKWILHCWDDEQCLRILKNCYKATPENGKVIVMNSV
VPETPEVSSSARETSLLDVLLMTRDGGGRERTQKEFTELAIGAGFKGINFACCVCNLHIM
EFFK
So I am wondering how do I determine if one is:
Bacteria
Viruses
Archaea
Eukaryota
The anwser can be found when looking at the OS part of the header of your FASTA file. But suppose you don't have this information, then you would perform a BLAST search. If the letters in your sequence would consist of only A, T, C and G it would be a DNA sequence. But since they are not you are dealing with a protein sequence. So we need to use protein BLAST.
Copy/paste the FASTA file in the online tool:
Leave the rest at the default settings and click on the BLAST button. After some time you will get the following results:
You will see that there is a 100% similarity match found with Ruta graveolens (as mentioned in the FASTA header) and around 80% similarity match found in Citrus sinensis.
If you want to know to which domain these species belong, you can click on the link to the accession records. For Ruta graveolens that is A9X7L0.1. There you see that the common name of this plant is common rue which has the following taxonomy:
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Sapindales; Rutaceae; Rutoideae;
Ruta.
I've been learning Python for a week now and am currently at Exercise 26 (learnpythonthehardway). So I know nothing. I tried searching but couldn't find what I need.
What I need:
I want to write a script that breaks my Journal.txt file into several text files to be able to import them into Evernote. Evernote pulls the title for a note from the first line of the .txt
This is a random example of the date format I used in my Journal.txt
1/13/2013, 09/02/2012, so I'm afraid the date is not consistent. I know about:
if 'blabla' in open('example.txt').read():
but don't know how to use it with a date. Please help me to elicit date corresponding Journal entries from a large file into a new one. This is literally all I got so far:
Journal = open("D:/Australien/Journal.txt", 'r').read()
Consider doing it like recommended here - replacing YOUR_TEXT_HERE with a search pattern for a date, e. g. [0-9]+\/[0-9]+\/[0-9]+.
awk "/[0-9]+\/[0-9]+\/[0-9]+/{n++}{print > n \".txt\" }" a.txt
If you don't have awk installed on your PC, you can fetch it here.
I need to parse a CSV file with blocks of text being processed in different ways according to certain rules, e.g.
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
customerone,columnone<br>
customertwo,columntwo<br>
singlevalueone
singlevaluetwo
singlevalueone_otherruleapplies
singlevaluethree_otherruleapplies
Each block of text will be grouped so the first three rows will be parsed using certain rules and so on. Notice that the last two groups have only one single column but each group must be handled in a different way.
I have the chance to propose the customer the format of the file so I'm thinking to propose the following.
[group 1]
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
[group N]
rowN
A kind of sections like the INI files from some years ago. However I'd like to hear your comments because I think there must be a better way to handle this.
I proposed to use XML but the customer prefers the text files.
Any suggestions are welcome.
m0dest0.
Ps. using VB.net and VS 2008
You can use regular expression groups set to either an enum line mode if each line has the same format, or to an enum multi-line if the format is not constrained to a single line. For each line in multiline you can include \n in your pattern to cross multiple lines to find you pattern. If its on a single line you don't need to include \n also know as Carriage return line feed in your regex matching pattern.
vb.net as well as many other modern programming language has extensive support for grouping operations. You can use index groups, or named groups.
Each name such as header1 or whatever you want to name it would be in this format: <myname>
See this link for more info: How do I access named capturing groups in a .NET Regex?.
Good luck.
What is the easiest/simplest way of finding out the number of lines in your VB project?
You might be interested to know how this can be a relevant metric: Wikipedia SLOC
On a linux/mac, to count lines in files matching a certain extension ('vbs'):
find . -regex '.*.(vbs)' -print0 | xargs -0 cat | wc -l
Third party software, like VB Pure Lines of Code
if you're in Visual Studio, look at the line number on the left. If you want the total for all your pages, open them up, look at all your numbers, add them.
There a ton of utilities out there that will count lines for you. They typically count total lines, comment lines, white space lines, and code lines. Just Google your particular language and you'll find plenty.
What would you wanna see? Logical lines of code, or number of physical lines? Strictly, lines of code does not count empty lines, comments, and generated code lines. There is no easy way to find out lines of code in VB. Maybe you find something here.