How to generate combination with pig - apache-pig

I have a map like this
{Tim, [Badminton, Basketball]}
{Viola, [Badminton, Baseball]}
{David, [Basketball]}
....
I use pig to find which games can they play together
for example, Tim and Viola can play Badminton together
Tim, Viola, David can not play together
I also need to find what combination can play more than N types of ball games.
How can I do that?

It's straightforward if you change the way you present the data.
At the moment, you have :
{Tim, [Badminton, Basketball]}
{Viola, [Badminton, Baseball]}
Now, let consider you flat your map games and to have a two-columns dataset :
{Tim, Badminton}
{Tim, Basketball}
{Viola, Badminton}
{Viola, Baseball}
You group on the second column and you will immediatly have the persons that can to play together.
There is also the possibility to use DataFu Bag's join BagLeftOuterJoin. However, on your example, it may not worth it.

Related

Insert and Select Data in Redis

I need to enter the below data in Redis:
Atlanta_96_Bronze Ana_Moser Ida_Alvares Ana_Paula Hilma_Caldeira Leila_Barros Virna_Dias Marcia_Fu Ericleia_Bodziak Ana_Flavia_Sanglard Fernanda_Venturini Fofao_Helia_Souza Sandra_Suruagy
Sidney_00_Bronze Elisangela_Oliveira Erika_Coimbra Fofao_Helia_Souza Janina_Conceicao Karin_Rodrigues Katia_Lopes Kely_Fraga Leila_Barros Raquel_Silva Ricarda_Lima Virna_Dias Walewska_Oliveira
Pequim_08_Gold Marianne_Steinbrecher Fofao_Helia_Souza Paula_Pequeno Walewska_Oliveira Thaisa_Menezes Valeska_Menezes Welissa_Gonzaga Fabiana_Oliveira Fabiana_Claudino Sheilla_Castro Jaqueline_Carvalho Carolina_Albuquerque
Londres_12_Gold Fabiana_Claudino Dani_Lins Paula_Pequeno Adenizia_Silva Thaisa_Menezes Jaqueline_Carvalho Fernanda_Ferreira Tandara_Caixeta Natalia_Pereira Sheilla_Castro Fabiana_Oliveira Fernanda_Garay
And then perform the following queries:
Which players won gold and silver medals?
Which players won two gold medals?
Which players only won medal in '96?
Which players were in the 96, 00 and 08 Olympics?
Which players were only in the 12 olympics?
But I never touched Redis, I came from a relational world, I need help.
Redis doesn't really seem like the right tool for the job as you've described it, as Redis is more of a cache than a database. There's no concept of running a "query" in Redis.
If you must use Redis, I would recommend storing the data in multiple sets to facilitate getting to the answers you need.
You might make a set for each type of medal, for example (a set of gold medalists, a set of silver medalists, and a set of bronze medalists). Then you could ask for the union of the gold and silver sets (Redis's SUNION operator) to get the answer to your first question.
You might also make a set for each year, so that you could retrieve information by year (for your last three questions).
In some cases, there may be no way around doing some coding to refine the results to give you exactly the answers you need.

Hadoop Pig - Replace strings in a relation with their corresponding values in a map

I have a relation called conversations_grouped made up of bags of tuples of varying sizes, like so:
DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})
Each L[0-9]+ is a tag corresponding to a string. For example, L194 might be "Hello, how are you doing?" and L195 might be "fine, how are you?". This correspondence is maintained by a map called line_map. Here's a sample:
DUMP line_map;
...
([L666324#Do you think she might be interested in someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])
What I'm trying to do now is parse through each line and replace the L[0-9]+ tags with their corresponding text from line_map. Is it possible to make references to line_map from within a Pig FOREACH statement, or is there something else I have to do?
The first issue with this is that in a map the key must be a quoted string. So you can't use a schema value to access the map. E.G. This will not work.
C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;
The solution that comes to mind is to FLATTEN conversations_grouped. Then do a join between conversations_grouped and line_map on the L[0-9]+ tag. You'll probably want to project out some of the extra fields (like the L[0-9]+ tag after the join) to make the next step faster. After that you'll have to regroup the data, and massage it into the correct format.
This won't work unless each bag has it's own unique ID for the regrouping, but if each of the L[0-9]+ tags appear in only one bag (conversation) you can use this to create a unique id.
-- A is dumped conversations_grouped
B = FOREACH A {
-- Pulls out an element from the bag to use as the id
id = LIMIT tags 1 ;
-- Flattens B into id, tag form. Each group of tags will have the same id.
GENERATE FLATTEN(id), FLATTEN(tags) ;
}
The schema and output for B is:
B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)
Assuming that the tags are unique, the rest is done like:
-- A2 is line_map, loaded in tag/message pairs instead of a map
-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
-- This generate removes the tag
GENERATE id, message ;
-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id)
-- This step limits the output to just messages
GENERATE C.(message) AS messages ;
Schema and output from D:
D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})
NOTE: If at worst, (the L[0-9]+ tags aren't unique) you can give each line of the input file(s) a sequential, integer id before you load it into pig.
UPDATE: If you are using pig 0.11, then you can also use the RANK operator.

Can Redis do prefix matching?

Lets say I have a set of cities in the world like so:
EUKLOND
EUKMANC
EUKEDIN
EITROME
EITMILA
EITNAPE
EFRPARI
EFRAVIG
EFRBRES
Where the first letter is continent, next two are country and the trailing 4 are an abbreviated city name.
I would like to be able to search this set by passing in "E" which would return all the entries or EIT and retrieve all the entries for Italy or EFRPARI and get just the Paris entry.
Is this something I can do with Redis?
Generally, it's an Auto-Complete scenario.
Salvatore Sanfilippo (#Antirez), Redis's author, wrote a thorough blog post about how to accomplish this.
UPDATE: I just saw another great blog post, that first takes Salvatore's solution and explains it in a clear way, and second offers another solution that is good also for multiple-word phrases.

MongoDB: How to query or a container document that contains only embedded documents with a certain attribute value?

Ok, let me try to explain what I am trying to achieve...
Let's say that I have a collection HOUSE that embeds ROOMS. Each house has many rooms.
Let's say that each room has a color attribute (blue, red, green, etc.)
Now if I want to retrieve all the houses that have a room of the color blue, I can go ahead and simply do for instance
House.where(:'rooms.color' => :blue)
However what I really want is to query all the houses that ONLY have blue rooms. And that I have no idea how to do... I could create a new attribute at the HOUSE level to "mark" if the rooms are all of the same given colors... but I would rather avoid that if I could since my current data set would need to be upgraded to reflect that.
Thanks,
Alex
Have you tried?
House.only(:'rooms.color' => :blue)
Thinking about it with a step back... I was actually going at this the wrong way, sometime you have to negate :)
Basically having a house that only has Blue rooms, means that this house has no rooms of other colors...
So imagining that I have a finite set of possible colors like: :red :green :blue then in order to find the house that have only blue rooms, I only need to find house that have no :red or :green rooms :)
House.where(:'rooms.color'.nin => [:red, :green])
should do the trick :)
Alex
#Alex You are better judge of your dataset, but theoretically following should also do the trick.
house_ids = House.where("rooms.color" => :blue).only(:_id).map(&:_id)
unwanted_house_ids = House.where("rooms.color".to_sym.ne => :blue).only(:_id).map(&:_id)
houses_with_only_blue_rooms = House.all.for_ids(house_ids - unwanted_house_ids)

Creating, Visualizing and Querying simple Data Structures

Simple and common tree like data structures
Data Structure example
Animated Cartoons have 4 extremities (arm, leg,limb..)
Human have 4 ext.
Insects have 6 ext.
Arachnids have 6 ext.
Animated Cartoons have 4 by extremity
Human have 5 by ext.
Insects have 1 by ext.
Arachnids have 1 by ext.
Some Kind of Implementation
Level/Table0
Quantity, Item
Level/Table1
ItemName, Kingdom
Level/Table2
Kingdom, NumberOfExtremities
Level/Table3
ExtremityName, NumberOfFingers
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon
skeet, 3 Atomic ant, 2 Shelob (spider)
Querying.. "Number of fingers"
Number = 1*4*4 + 1*4*4 + 1*4*5 + 3*6*1 + 2*6*1 = 82 fingers (Let Jon be a Human)
I wonder if there is any tool for define it parseable for automatic create the inherited data, and drawing this kind of trees, (with the plus of making this kind of data access, if where posible..)
It could be drawn manually with for example FreeMind, but AFAIK it dont let you define datatype or structures to automatically create inherited branch of items, so it's really annoying to have to repeat and repeat a structure by copying (and with the risk of mistake). Repeated Work over Repeated Data, (an human running repeated code), it's a buggy feature.
So I would like to write the data in the correct language that let me reuse it
for queries and visualization, if all data is in XML, or Java Classes, or in a Database File, etc.. there is some tool for viewing the tree and making the query?
PD : Creating nested folders in a filesystem and using Norton Commander in tree view, is not an option, I hope (just because It have to be builded manually)
Your answer is mostly going to depend on what programming skills you already have and what skills you are willing to acquire. I can tell you what I would do with what I know.
I think for drawing trees you want a LaTeX package like qtree. If you don't like this one, there are a bunch of others out there. You'd have to write a script in whatever your favorite scripting language is to parse your input into the LaTeX code to generate the trees, but this could easily be done with less than 100 lines in most languages, if I properly understand your intentions. I would definitely recommend storing your data in an XML format using a library like Ruby's REXML, or whatever your favorite scripting language has.
If you are looking to generate more interactive trees, check out the Adobe Flex Framework. Again, if you don't like this specific framework, there are bunches of others out there (I recommend the blog FlowingData).
Hope this helps and I didn't miserably misunderstand your question.
Data structure that You are describing looks like it can fit in xml format. Take a look at Exist XML database, and if I can say so it is the most complete xml database. It comes with many tools to get you started fast ! like XQuery Sandbox option in admin http interface.
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon skeet, 3 Atomic ant, 2 Shelob (spider)
I am assuming that there are 2 instances of jon skeet, 3 instances of Atomic ant and 2 instances of Shelob
Here is a XQuery example:
let $doc :=
<root>
<definition>
<AnimatedCartoons>
<extremities>4</extremities>
<fingers_per_ext>4</fingers_per_ext>
</AnimatedCartoons>
<Human>
<extremities>4</extremities>
<fingers_per_ext>5</fingers_per_ext>
</Human>
<Insects>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Insects>
<Arachnids>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Arachnids>
</definition>
<subject><name>Homer Simpson</name><kind>AnimatedCartoons</kind></subject>
<subject><name>Ralph Wiggum</name><kind>AnimatedCartoons</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
</root>
let $definitions := $doc/definition/*
let $subjects := $doc/subject
(: here goes some query logic :)
let $fingers := fn:sum(
for $subject in $subjects
return (
for $x in $definitions
where fn:name($x) = $subject/kind
return $x/extremities * $x/fingers_per_ext
)
)
return $fingers
XML Schema Editor with visualization is perhaps what I am searching for
http://en.wikipedia.org/wiki/XML_Schema_Editor
checking it..