Query with lambda function in pymongo? - pymongo

I use the following query to fish for all men in a database:
f = pd.DataFrame(x for x in collection.find({"gender": "M"},{"_id":0}))
How could I find only the men where the "name" starts with an "A". Obviously I could filter the resulting huge DataFrame but how can I avoid creating this Frame in the first place.
Many thanks

You can use a MongoDB regular expression query, something like:
from bson.regex import Regex
f = pd.DataFrame(x for x in collection.find({"gender": "M", "name": Regex(r"^A.*")},{"_id":0}))

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

How to write a function that evaluates two data frame columns of choice and returns the output?

Given the following data frame:
site <- c("site_1", "site_2", "site_3", "site_4", "site_5", "site_6")
protein1 <- c("M", "Q", "W", "F", "M", "M")
protein2 <- c("M", "W", "V", "M", "M", "M")
protein3 <- c("M", "D", "W", "F", "M", "M")
df <- data.frame(site, protein1, protein2, protein3)
I would like to extract the first column of the data frame, and two additional columns (proteins) that are being compared. However, the latter two columns will vary, depending on the comparison, and only the rows (site_number) where the two proteins differ should be returned. I have achieved this using subset(), but I was hoping to avoid copying and pasting the same line many times and replacing the name of the columns in each line. Here's what I've been able to do, but what I feel is more script than necessary:
comparison1 <- subset(df, protein1 != protein2, select = c(site, protein1, protein2))
comparison2 <- subset(df, protein1 != protein3, select = c(site, protein1, protein3))
#in each case, this produces the desired result of showing the "site" and "protein" values, in rows where the "protein" values differ.
In a large dataset, one would have many columns (18) with different names. Additionally, two different pairwise comparisons would be performed for each column. So I thought it would be wise to write a function that takes the column names of interest as input. Not having so much experience, I tried the following function before learning that you should not use subset() inside functions:
#to establish the function:
compare <- function(first, second){
result <- subset(df, df$first != df$second, select = c(site, first, second))
return(result)
}
#then to do my comparisons:
compare(protein1, protein2)
compare(protein1, protein3)
This returned the following error:
Error in `[.data.frame`(x, r, vars, drop = drop) :
undefined columns selected
Downstream, I would like to put the results into a list of data frames.
I'm quite sure I'm overlooking something simple. Perhaps the answer lies in using square brackets ([[). At least it seems that R is not converting the "first" and "second" variables to character strings that can match names of columns, as the error shows columns are undefined. If anyone knows whether writing a function for this is the right thing to do, or if I should do something else, then I would be very grateful for the feedback!
Thanks, and take care,
A

How to return different rows from the same table on single query as a list

I have a table which has a boolean column, this column is used to filter some responses. I'm in the need to return a response as a tuple as {claimed, unclaimed} (Imagine the table is called winnings)
While working on it I've done two separate queries to return claimed then unclaimed rows and manually constructing the response, then I went with returning all rows without checking the boolean column and splitting it outside of the query. Now I'm wondering if there's a way I can run a single query on the same table and return both claimed and unclaimed as separate results mainly for performance hoping it runs better. I've tried doing it with joins but its returning a list of two items tuples like:
[{claimed, unclaimed}, {claimed, unclaimed}]...
While I want:
{claimed, unclaimed}
# OR
[{claimed, unclaimed}]
At most, no more tuples. Note that I'm not running the raw queries but using a library so excuse if the terminology is not right.
This is the last query I ran:
SELECT w0."claimed", w1."claimed"
FROM "winnings" AS w0
INNER JOIN "winnings" AS w1 ON TRUE
WHERE (w0."claimed" AND NOT (w1."claimed"))
LIMIT 10;
EDIT: More details.
When I run the query from above this is the result I get:
=> SELECT w0."claimed", w1."claimed" FROM "winnings" AS w0 INNER JOIN "winnings" AS w1 ON TRUE WHERE (w0."claimed" AND NOT (w1."claimed")) LIMIT 10;
claimed | claimed
---------+---------
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
t | f
(10 rows)
This is converted to the following on Elixir which is the language I'm using:
[
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false,
true: false
]
This is a keyword list which internally is a list of tuples as [{true, false}, {true, false}] - I want: [{[true, true], [false, false]}]
Means that I want 2 lists, each list with their respective rows, only claimed on one and only unclaimed on the other one.
I don't really mind the type it outputs as long as it includes two lists with their rows how I said.
To get the first column from a list of rows, you can use Enum.map/2 to get the first element of each tuple:
Enum.map(rows, &elem(&1, 0))
If you're a newcomer to elixir, the & syntax may be a bit confusing. That code is shorthand for
Enum.map(rows, fn field -> elem(field, 0) end)
You could make that into a function that does that for all your columns like this:
def columnize(rows = [first_row | _]) when is_tuple(first_row) do
for column <- 1..tuple_size(first_row), do: Enum.map(rows, &elem(&1, column - 1))
end
def columnize([]) do
[]
end
hd/1 is a function used to get the first tuple in the list. The = [first_row | _] part guarantees the argument is a list with at least one element; the when is_tuple(first_row) assures at least the first row is a tuple.
If you'd like to know more about the for, it's a comprehension.
Note that this function assumes all rows have the same number of columns and are tuples (which should be true in the results of a query; but may not be in other cases), and will throw an ArgumentError if the first row is wider than any other row. It will also cause errors if other elements of the list are not tuples.
This feels very unidiomatic though. Is this maybe an XY problem?

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2