Grabbing an unordered index given a matching criteria - pandas

I have a pandas data frame that is already filtered so the indices are not in order (i.e not from 0 - the end of my data frame). I have a column that is a 'bag of words'. This is simply a list of words. I know the word I am searching for. How can I find the index/indices that contain this word?
I tried using the '.index' function but I may not be using it correctly.
Sample of DataFrame
Desired output would just be the index/indices of the entry/entries that contain the word I am looking for.

A simple hack for your case, considering bag column as a string rather than a list if you need to check only contains
df['word_flag'] = df['bag'].astype(str).str.contains('your-word-here')
If you specifically need the indices
df[df['word_flag'] == 1].index

Related

Pandas Identify Column name where Pattern was found using df.iterrows()

I have datasets I am scanning for a certain pattern using regex. Some of these Tables have millions of rows and doing column by column search is time consuming. So I am using iterrows.
This way the first index, row it finds the matching pattern it flags and ends the loop. But the problem with this is that I can't determine the column name. Ideally I want the name of column where it found the match
Code sample:
for index, row in df.iterrows():
#regex to identify any 9 digit number starting with 456 goes here
Currently my output prints the index of the row it found the first match in and exits. What's a better way I can write this so that I can capture the column name or column index it was found in? Like for the Data sample above Ideally I want the columns "Acc_Number" printed.

Filtering and display unique column pairs in Excel

Follow on from Excel Count unique value multiple columns
I am trying to filter and setup a table containing all the unique combinations of message types.
So with three message types as an example below, I want to create a table with all the possible flows from this.
So every time MessageA exists, it is either followed by a MessageA, MessageB, MessageC or is the last of the sequence.
And everytime we see MessageC it is only followed by MessageA.
On the left, is the data and on the right is the desired result.
I want this to be able to scale to multiple columns/rows
You could do it by comparing two offset ranges, A1:D5 and B1:E5
=SUMPRODUCT(($A$1:$D$5=$G2)*($B$1:$E$5=K$1))
As you can see, I have cheated slightly by setting K1 blank so it compares correctly with column E, but this could be made part of a longer formula if it was necessary to have END as the column header for K.

How to query the presence of an element inside a Spark Dataframe Column that contains a set?

I have a spark dataframe where one column has the type Set<text>.
This column contains a set of string, for example ["eenie","meenie","mo"].
How do I filter the contents of the whole dataframe so that
I only get those rows that (for example) contain the value eenie in the set?
I'm looking for something similar to
dataframe.where($"list".contains("eenie"))
the above shown example is only valid for when the content of column list is a string not a Set. What alternatives are there to fit my circumstances?
Edit: My question is not a duplicate. The user in that question has a set of values and wants to know which ones are located inside a specific column. I have a column that contains a set, and I want to know if a specific value is part of the set. My approach is the opposite of that.
Try:
import org.apache.spark.sql.functions.array_contains
dataframe.where(array_contains($"list", "eenie"))

Need a simple search function to display most common value in a column. (with ambiguous choices)

I have a very large array of data with many columns that display different outputs for the values presented. I would like to add a row above the data that will display the most common occurring value or word below.
Generally I would like to have each top of the column (right under the column label in row 1) have the most common value below. I will then use this value for various data analysis functions!
Is this possible, and if so, how? Preferably this will not require VBA, but simply a short code in the cell.
One caveat: The exact values may vary, so there is no set list where I can say "it will be one of these."
Any ideas appreciated!
Try a series of =COUNTIF(A:A,"VALUE TO SEARCH") functions if you want to stay away from VBA.
Otherwise, the best method would be to iterate through each column via VBA. With this method, you can even count the "varying" values and return the count and/or the value itself.
http://www.excel-easy.com/examples/most-frequently-occurring-word.html
This is a single formula you would write at the top of each column. Does not require VBA. You can replace the set range to an entire column, such as (A:A) instead of (A1:A7).
If you mean an array as in a data type, it could work differently but it depends what you're trying to do.
With data from A3 through A16, in A2 enter:
=INDEX($A$3:$A$16,MODE(MATCH($A$3:$A$16,$A$3:$A$16,0)))
This will work for text as well as numbers. Adjust this to match the column size.

Count unique string variants

There could be quite a simple solution to this, but I am trying to find the number of times a unique variant (i.e. non-duplicates) of a string appears in a column. However this string is only part of the text contained in a cell, and not the entire cell. To illustrate:
EuropeSpainMadrid
EuropeSpainBarcelona
AsiaChinaShanghai
AsiaJapanTokyo
EuropeEnglandLondon
EuropeSpainMadrid
I would like to find how many unique instances there are of a string that contains "EuropeSpain". So using this example, I would find that a variant of "EuropeSpain" appears only twice (given that the second instance of "EuropeSpainMadrid" is a duplicate).
A solution to this is to use pivots to summarise the data and remove duplicated; however given that my underlying dataset changes often this would require manual adjustments and corrections. I would therefore like to avoid adding any intermediate steps (i.e. PivotTables, other data sets etc) between my data and the counts.
UPDATE: I now understand to use wildcards to solve the first part of my question (counting the occurrences of "EuropeSpain"), however I am not yet clear on the second part of my question (how to find the number of unique occurrences).
Is there a formula or VBA code that could do this?
Using wildcards:
=COUNTIF(A1:A6,"="&"*"&C1&"*")
For without VBA but with some versatility, I suggest with Text in ColumnA (labelled), ColumnB labelled Flag and EuropeSpain in C1:
=FIND(C$1,A2)
in B2 copied down.
Then pivot A:B with Flag for FILTERS (and 1 selected), Text for ROWS and Count of Text for Sigma VALUES.
Apply Distinct Values if required (and available!), alternatively a formula of the kind:
=MATCH("Grand Total",E:E)-4
would count uniques.