Matching an element in a column, to others in the same column - pandas

I have columns taken from excel as a dataframe, the columns are as follows:
HolidayTourProvider|Packages|Meals|Accommodation|LocalTravelVehicle|Cancellationfee
Holiday Tour Provider has a couple of company names
Packages, the features provided in each package are mostly the same like
Meals,Accommodation etc... even though one company may call it "Saver", others may call it "Budget". (each of column mostly follow Yes/No, except Local travel vehicle are again car names like Ford Taurus,jeep cherokee etc..
Cancellation amount is integers)
I need to write a function like
match(HolidayTP,Package)
where the user can give input like
match(AdventureLife, Luxury)
then I need to return all the packages that have similar features with Luxury by other Holiday Tour Providers, no matter what name they give the package like 'Semi Lux', 'Comfort' etc...
I want to give a counter for every match and display all the packages that exceed the counter by 3 or 4.
This is my first python code. I am stuck here.
fb is the total df I exported to
def mapHol(HTP, PACKAGE):
mfb = (fb['HTP']== HTP)&(fb['package']== package)
B = fb[mfb]
for i in fb[i]:
for j in B[j]:
if fb[i]==B[j]:
count+=1
I dont know how to proceed, please help me this is my first major project, I started on my own.

Related

Get rows in pandas with a Query based approach

I have a dataset as such and I need to get its conference calls using pandas.
This would have been very easy in SQL by creating 2 tables.
Created 2 tables,
conference1= conference
My approach,
conference.loc[( (conference['Call date'] < conference1['Call date']) & conference['Cell Id']==conference1['Cell Id'])
&(conference['Called (B) Party Number'] != conference1['Called (B) Party Number']) ]
So I tried somewhat same query approach here too using python and it gives me no rows.
Now, to make it clear a conference call would be-
1. conference['Call date'] < conference1['Call date'] where user is same(i.e. cell ids will be same)
2. conference['Cell Id'] != conference1['Cell Id']
3. Also, the persons called by same person should be different, therefore,
conference['Called (B) Party Number'] != conference1['Called (B) Party Number']
The output should look like this file
in this file,
The call date in 2nd row is greater(that this user must have been added a little later in the conference)
The called party B are different(that users are different)
The end date at the end, might end the call at the same time or leave at any point so that doesn't count much for analysis.
Can somebody help me out with this? A reference link or an idea would also work

(Neo4j) Carry variable over subsequent queries

I am trying to carry over a variable through 2 subsequent queries. It seems like WITH only helps carry over the variable to the next query, but not any before that. Suggestions?
This is example of what I am trying to do:
Person nodes contain information on publishers, writers and editors (e.g. name, gender, etc.)
Story nodes contain data on Story (e.g. title, publish date, etc.)
IN relationships have categories: created, edited, published.
Return editor-publishers who have edited stories published by another editor-publisher:
assume no duplicate Person names
Find all Persons who have edited at least one story who have also published at least one story
Find list of stories published by these editor-publishers in 1
In all editors of stories in 2, return sublist of these editors also in 1
MATCH (EditorPublisher:Person)-[:IN{category: "published"}]->(:Story) // 1
WHERE (EditorPublisher:Person)-[:IN{category: "edited"}]->(:Story)
WITH COLLECT(EditorPublisher.name) as EditorPublisher_list
MATCH (EditorPublisher_stories:Story)<-[:IN{category: "published"}]-(publisher:Person) // 2
WHERE publisher.name in EditorPublisher_list
WITH EditorPublisher_list // throws error EditorPublisher_list variable not found
WITH COLLECT(EditorPublisher_stories.title) as EditorPublisher_stories_list
MATCH (epe:Person)-[contribution:PLAYED]-(eps:Movie) // 3
WHERE epe.name in EditorPublisher_list
AND eps IN EditorPublisher_stories_list
RETURN epe.name
NVM I got it to work. With does keep the variables if i don't rename them.
I just had to do WITH return.nodes, and call the return.nodes in subsequent queries instead of using in [return.nodes.list]

Finding a string embedded with several strings in a single cell

Working on a large macro to automate an end of day email process that involves sending emails to different customers. Everything is working fairly well except in the several instances that some firms have different emails for different employees. It would be simple if these employees were the only ones that would be sent to that email, but that is of course not the case.
I'm struggling to find a solution to finding the names and referencing the email. But referencing multiple names without knowing how many names it could be ahead of time (meaning in the future a firm could have 10 employees with the same email confirmation and some may have as little as 1 (for those that require separate emails per employee)). Would I use an array for this and test against the array? I would also need to store the employee names in order to ensure duplicates aren't created. See Text example and code below:
The names are stored in a sheet named emailMaster in this format say starting in cell(1,3) (Joe GoodGuy; James Johanson; Jimmy TheHat (All encapsulated in one cell)) and the email they correspond to is found at .Offset(0,1). To clarify, these gentlemen may work for the firm "CodersUnited", but their may be another group from the firm "CodersUnited", who require a different email address for their end of day receipts and they could be in cell(1,5) (Jimmy John; Franky TwoToes; Jimmy Hendrix) and their corresponding email found in .Offset(0,1).
Row ____________________C____________________ __D___ ___________________E____________________ __F__
1 Joe GoodGuy; James Johanson; Jimmy TheHat Emails Jimmy John; Frank TwoToes; Jimmy Hendrix Email
The solution below only works if their is one name corresponding to one email. There needs to be multiple names corresponding to one email.
'Gets firm name
If firmName = emailMaster.Cells(emrow_num, 1) Then
continue = False
cFirm = firmName
iFirm = emailMaster.Cells(emrow_num, 2)
If IsEmpty(emailMaster.Cells(emrow_num, 4)) = True Then
firmEmail = emailMaster.Cells(emrow_num, 3)
'Tests for separate employee emails
ElseIf emailMaster.Cells(emrow_num, 4) = "Yes" Then
empSeparate = True
'Captures separate emp email
Set empTestFinder = emailMaster.Rows(emrow_num).Find(empName)
empFinder = empTestFinder.Address
firmEmail = emailMaster.Range(empFinder).Offset(0, 1)
Else
MsgBox ("Firm designated as different emails for employees. Either change designation of firm or add employee. Contact dev for more assistance.")
Exit Sub
End If
End If
In my view you are mixing three tasks:
modelling your data
determining a convenient method of holding that data in Excel
determining a convenient method of holding that data in VBA variables
You need to tackle these tasks in sequence but be willing to revert to a previous task if you encounter problems attempting the current task..
Below is a data model which I have deduced from your question. This model does not conform to any standard modelling notation but, if you are not a data modeller, I think you will find this notation easier to understand.
You need to send emails to all your customers. I have shown three example customers. At Customer1 there is only one person you send emails to who has there own email address. At Customer2, there are three people to whom you send emails who each have their own email address. Customer3 is more complicated. Person13 has their own address but the other contacts share addresses. For example, EmailAddr5 is shared by Person5, Person6 and Person7.
Customer1────EmailAddr1────Person1
Customer2─┬──EmailAddr2────Person2
├──EmailAddr3────Person3
└──EmailAddr4────Person4
Customer3─┬──EmailAddr5─┬──Person5
│ ├──Person6
│ └──Person7
├──EmailAddr6─┬──Person8
│ ├──Person9
│ ├──Person10
│ ├──Person11
│ └──Person12
└──EmailAddr7────Person13
Is this a correct representation of your data? I would summarise this as:
There are 1 or many PERSONs per EMAILADDR
There are 1 or many EMAILADDRs per CUSTOMER
If you are not familiar with modelling data, this may be confusing at first but I believe that with a little study it will become clear.
The question you need to answer is: “Is this a complete description of my data?” Only when you are convinced there are no special cases not covered by this model can you proceed to Task 2.
Unless your final model is much more complicated than mine, I do not believe you will need two or more worksheets to hold this data. So, task 2 is to map your data model onto an Excel worksheet.
During task2, the person to consider is the user who will create and maintain this data. For example, I would have thought holding multiple people in a single cell would be awkward to maintain. How does this data arrive? Does Acme Supplies tell you to contact John Smith whose email address is Sales#AcmeSuppliers.Com or do they tell you use Sales#AcmeSuppliers.Com to contact any of: Angela Brown, Cherry White and John Smith? If data arrives as “name – address”, the model above may be correct but inconvenient. Would this be a convenient arrangement of the data for the user to maintain?
Acme Supplies | Brown, Angela | Sales#AcmeSuppliers.Com
Acme Supplies | Chester, Neal | Admin#AcmeSuppliers.Com
Acme Supplies | Smith, John | Sales#AcmeSuppliers.Com
Acme Supplies | White, Cherry | Sales#AcmeSuppliers.Com
With this arrangement, there is a row per person per company.
If you really think your data model is correct, how about:
Acme Supplies|Admin#AcmeSuppliers.Com|Chester, Neal|Sales#AcmeSuppliers.Com|Brown, Angela|Smith, John|White, Cherry|
where I am using vertical lines to represent cell boundaries.
What I done is take:
Customer3─┬──EmailAddr5─┬──Person5
│ ├──Person6
│ └──Person7
├──EmailAddr6─┬──Person8
│ ├──Person9
│ ├──Person10
│ ├──Person11
│ └──Person12
└──EmailAddr7────Person13
and arrange it as:
Customer3|EmailAddr5|Person5|Person6|Person7|EmailAddr6|Person8|Person9| and so on
This could work because every address contains the symbol “#” while no name contains this symbol.
As you develop the mapping from the data model to Excel, you may need to revise the data model. Your original model may be technically correct but implementation in Excel may be inconvenient. Provided your revised model can handle any combination of company, address and name you can envisage and providing that revised model maps cleanly to Excel you will be all right.
Lastly, task 3 is to map the Excel worksheet to VBA variables. This depends on how you need to process the data. For example, you might have a list like:
CompanyA, PersonZ
CompanyB, PersonY
CompanyB, PersonX
: :
With this it might be easier to leave the data in the worksheet and have code like this:
For each line of list
Search for Person
If not found
Report “No such person”
Else
Record row on which first Person found
Do While True
If Person belongs to Company Then
Generate email
Exit Do
End If
Search for next Person
If row for Person is the row for the first Person Then
Report “No such person”
Exit Do
End If
Loop
End If
Next
I think the above is as much as I can offer because there is too much uncertainty about your requirement. This may be enough for you to solve your own problem. If not, you need update your question clarify your requirement.
Perhaps you would like a function like the below:
Function FoundEMail(groupArray As Variant, emailArray As Variant) As Boolean
Dim PersonInGroup As Variant, PersonInEMail As Variant, PersonFound As Boolean
For Each PersonInGroup In groupArray
PersonFound = False
For Each PersonInEMail In emailArray
PersonFound = StrConv(Trim(PersonInGroup), vbUpperCase) = StrConv(Trim(PersonInEMail), vbUpperCase)
If PersonFound Then Exit For
Next PersonInEMail
If Not PersonFound Then Exit Function
Next PersonInGroup
FoundEMail = True
End Function
So instead of If firmName = emailMaster.Cells(emrow_num, 1) Then you might have something like If FoundEMail(Split(firmName,";"),Split(emailMaster.Cells(emrow_num, 1),";")) Then

Hadoop Pig - Replace strings in a relation with their corresponding values in a map

I have a relation called conversations_grouped made up of bags of tuples of varying sizes, like so:
DUMP conversations_grouped:
...
({(L194),(L195),(L196),(L197)})
({(L198),(L199)})
({(L200),(L201),(L202),(L203)})
({(L204),(L205),(L206)})
({(L207),(L208)})
({(L271),(L272),(L273),(L274),(L275)})
({(L276),(L277)})
({(L280),(L281)})
({(L363),(L364)})
({(L365),(L366)})
({(L666256),(L666257)})
({(L666369),(L666370),(L666371),(L666372)})
({(L666520),(L666521),(L666522)})
Each L[0-9]+ is a tag corresponding to a string. For example, L194 might be "Hello, how are you doing?" and L195 might be "fine, how are you?". This correspondence is maintained by a map called line_map. Here's a sample:
DUMP line_map;
...
([L666324#Do you think she might be interested in someone?])
([L666264#Well that's typical of Her Majesty's army. Appoint an engineer to do a soldier's work.])
([L666263#Um. There are rumours that my Lord Chelmsford intends to make Durnford Second in Command.])
([L666262#Lighting COGHILL' 5 cigar: Our good Colonel Dumford scored quite a coup with the Sikali Horse.])
([L666522#So far only their scouts. But we have had reports of a small Impi farther north, over there. ])
([L666521#And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?])
([L666520#Well I assure you, Sir, I have no desire to create difficulties. 45])
([L666372#I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.])
([L666371#Lord Chelmsford seems to want me to stay back with my Basutos.])
([L666370#I'm to take the Sikali with the main column to the river])
([L666369#Your orders, Mr Vereker?])
([L666257#Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot])
([L666256#Colonel Durnford... William Vereker. I hear you 've been seeking Officers?])
What I'm trying to do now is parse through each line and replace the L[0-9]+ tags with their corresponding text from line_map. Is it possible to make references to line_map from within a Pig FOREACH statement, or is there something else I have to do?
The first issue with this is that in a map the key must be a quoted string. So you can't use a schema value to access the map. E.G. This will not work.
C: {foo: chararray, M: [value:chararray]}
D = FOREACH C GENERATE M#foo ;
The solution that comes to mind is to FLATTEN conversations_grouped. Then do a join between conversations_grouped and line_map on the L[0-9]+ tag. You'll probably want to project out some of the extra fields (like the L[0-9]+ tag after the join) to make the next step faster. After that you'll have to regroup the data, and massage it into the correct format.
This won't work unless each bag has it's own unique ID for the regrouping, but if each of the L[0-9]+ tags appear in only one bag (conversation) you can use this to create a unique id.
-- A is dumped conversations_grouped
B = FOREACH A {
-- Pulls out an element from the bag to use as the id
id = LIMIT tags 1 ;
-- Flattens B into id, tag form. Each group of tags will have the same id.
GENERATE FLATTEN(id), FLATTEN(tags) ;
}
The schema and output for B is:
B: {id: chararray,tags::tag: chararray}
(L194,L194)
(L194,L195)
(L194,L196)
(L194,L197)
(L198,L198)
(L198,L199)
(L200,L200)
(L200,L201)
(L200,L202)
(L200,L203)
(L204,L204)
(L204,L205)
(L204,L206)
(L207,L207)
(L207,L208)
(L271,L271)
(L271,L272)
(L271,L273)
(L271,L274)
(L271,L275)
(L276,L276)
(L276,L277)
(L280,L280)
(L280,L281)
(L363,L363)
(L363,L364)
(L365,L365)
(L365,L366)
(L666256,L666256)
(L666256,L666257)
(L666369,L666369)
(L666369,L666370)
(L666369,L666371)
(L666369,L666372)
(L666520,L666520)
(L666520,L666521)
(L666520,L666522)
Assuming that the tags are unique, the rest is done like:
-- A2 is line_map, loaded in tag/message pairs instead of a map
-- Joins conversations_grouped and line_map on tag
C = FOREACH (JOIN B by tags::tag, A2 by tag)
-- This generate removes the tag
GENERATE id, message ;
-- Regroups C on the id created in B
D = FOREACH (GROUP C BY id)
-- This step limits the output to just messages
GENERATE C.(message) AS messages ;
Schema and output from D:
D: {messages: {(A2::message: chararray)}}
({(Colonel Durnford... William Vereker. I hear you 've been seeking Officers?),(Good ones, yes, Mr Vereker. Gentlemen who can ride and shoot)})
({(Your orders, Mr Vereker?),(I'm to take the Sikali with the main column to the river),(Lord Chelmsford seems to want me to stay back with my Basutos.),(I think Chelmsford wants a good man on the border Why he fears a flanking attack and requires a steady Commander in reserve.)})
({(Well I assure you, Sir, I have no desire to create difficulties. 45),(And I assure you, you do not In fact I'd be obliged for your best advice. What have your scouts seen?),(So far only their scouts. But we have had reports of a small Impi farther north, over there. )})
NOTE: If at worst, (the L[0-9]+ tags aren't unique) you can give each line of the input file(s) a sequential, integer id before you load it into pig.
UPDATE: If you are using pig 0.11, then you can also use the RANK operator.

How to change values in facet to same in Google Refine?

I'm trying to clean this data: https://dl.dropbox.com/u/820037/local_council_election_data_w_occupation.gz
It's all the candidates for a local councils' election in Finland. In the column "Ammatti" there is the occupation of a candidate as reported by him/her.
I want to find all the students, but the problem is, that they can be "opiskelija" (student) or "yliopisto-opiskelija" (university student) and things like that.
I clicked the column title "Ammatti" and Filtered it with "opiskelija", then I created a "text facet" from the menu in column title.
That gives me the following facet:
agrol. opiskelija AMK 1
agrologiopiskelija 9
agronomiopiskelija 1
...and so on.
I'd want to change the value of "Ammatti" (occupation) to "opiskelija" (student) in everyone of these occasions.
To make thngs a bit more complicated the facet has also some occupations (mature students and administrative staff) I don't want to change to "opiskelija":
aikuisopiskelija 10
opiskelijakunnan hallituksen varapuheenjohtaja 1
opiskelijapalvelun päällikkö 1
opiskelijapalvelupäällikkö 1
I did this by hand clicking through the whole list in the facet and changing the occupations one by one.
I suppose there is a better way to do this, but could someone please tell me how I should've done it?
Using the 'include' option in the facet, select all the rows that you want to transform from the column "Ammatti". Then in for this column invoke the Transform function and replace "value" by "opiskelija"
This will replace all the value you have selected by "opiskelija".
Hope this help (and it doesn't come too late).