I have datasets I am scanning for a certain pattern using regex. Some of these Tables have millions of rows and doing column by column search is time consuming. So I am using iterrows.
This way the first index, row it finds the matching pattern it flags and ends the loop. But the problem with this is that I can't determine the column name. Ideally I want the name of column where it found the match
Code sample:
for index, row in df.iterrows():
#regex to identify any 9 digit number starting with 456 goes here
Currently my output prints the index of the row it found the first match in and exits. What's a better way I can write this so that I can capture the column name or column index it was found in? Like for the Data sample above Ideally I want the columns "Acc_Number" printed.
Related
I have a Google sheet with 186,000 rows. I have included a dummy spreadsheet to give you an idea of the data. I need to select ALL duplicates, that includes rows where the first names might not match (i.e. Cathy vs Catherine), but they still refer to the same individual. There are also instances where the addresses might be slightly different (like omitting "Ave" in one row but including it in another).
I need to write a query to account for all of these instances, including just regular duplicates. Or I could do multiple queries and just copy the results into one spreadsheet. In any case, I'm at a loss.
Dummy spreadsheet. I have included one example of each case I am trying to account for (3 total).
I have something that may be useful. See my example sheet here:
https://docs.google.com/spreadsheets/d/19h28go-nzunW6zexcMD61QjySUKJA3Q2Ci2Hu3OMuAg/edit?usp=sharing
Basically I build a key value for each record, along the lines you asked for.
All of the last name, part of the first name, part of the address, and the ZIP code. Other variations are easily added.
The formula is just a string concatenation of parts of these fields, as follows:
=ArrayFormula(
IF(ROW(A2:A)=2,"DupeKey",
IF(A2:A<>"",A2:A &LEFT(B2:B,$N$1) &LEFT(G2:G,$N$1) &K2:K,"")))
A valuable option is to allow varying the length of the required matching sub-string, from the first name and address. This is controlled for the formula by selecting a substring length of 1 to 6 in cell N1, and seeing how this changes the duplicate records that are found. The shorter the substring length, the more duplicate (or possibly duplicate) records will be found.
Conditional formating is used to highlight the duplicate records.
And you can use the column filters to sort by different data columns - to put all of the duplicates at the top, sort by column N, in Z-A order, and exclude blanks.
Note that this isn't perfect. If someone accidentally types a space, or anything else, at the start of a data field, it will not be considered a duplicate. Better logic would be required to catch those.
Let me know if this helps.
You can use these formulas:
If cell B3 match "John" write "match", if doesn't match write "no"
=IF(REGEXMATCH(B3,"John"), "match", "no")
If cell F2 contains content of cell B3, write "match", if doesn't match write "no"
=IF(SEARCH(B3, F2)>0,"match","no")
References:
REGEXMATCH
SEARCH
Follow on from Excel Count unique value multiple columns
I am trying to filter and setup a table containing all the unique combinations of message types.
So with three message types as an example below, I want to create a table with all the possible flows from this.
So every time MessageA exists, it is either followed by a MessageA, MessageB, MessageC or is the last of the sequence.
And everytime we see MessageC it is only followed by MessageA.
On the left, is the data and on the right is the desired result.
I want this to be able to scale to multiple columns/rows
You could do it by comparing two offset ranges, A1:D5 and B1:E5
=SUMPRODUCT(($A$1:$D$5=$G2)*($B$1:$E$5=K$1))
As you can see, I have cheated slightly by setting K1 blank so it compares correctly with column E, but this could be made part of a longer formula if it was necessary to have END as the column header for K.
I have a pandas data frame that is already filtered so the indices are not in order (i.e not from 0 - the end of my data frame). I have a column that is a 'bag of words'. This is simply a list of words. I know the word I am searching for. How can I find the index/indices that contain this word?
I tried using the '.index' function but I may not be using it correctly.
Sample of DataFrame
Desired output would just be the index/indices of the entry/entries that contain the word I am looking for.
A simple hack for your case, considering bag column as a string rather than a list if you need to check only contains
df['word_flag'] = df['bag'].astype(str).str.contains('your-word-here')
If you specifically need the indices
df[df['word_flag'] == 1].index
I'd like to create a super substitute function in excel (if there is such a thing). For my needs, nesting is insufficient because my substitution list is long.
What I'd like to do is have a named table with two columns for the function to use as a look-up for potential substitutions. The first column would contain the original text and the second would contain the replacement text.
If we look at the above example, the columns R and S consists of my table, with column R as the original text that I wanted replaced with the equivalent row entry in column S, if found in my list. My data in column V is the list that I want to work on. I want the formula in column W to look at the equivalent entry in column V and lookup in the table for any matches within the text. If there is a match, then I want that text from column R replaced with the text in column S. In some cases there can be 2 matches, as with the top example of 'LHR:JFK' which should ideally get replaced to 'LON:NYC'.
There are ways to do this with VBA but I would like to know if there is an Excel formula option for the same as I don't know where to begin with VBA.
Your help is greatly appreciated.
Thanks,
Neha
This will do as you ask:
=SUBSTITUTE(SUBSTITUTE(V2,LEFT(V2,3),IFERROR(VLOOKUP(LEFT(V2,3),R:S,2,FALSE),LEFT(V2,3))),RIGHT(SUBSTITUTE(V2,LEFT(V2,3),IFERROR(VLOOKUP(LEFT(V2,3),R:S,2,FALSE),LEFT(V2,3))),3),IFERROR(VLOOKUP(RIGHT(SUBSTITUTE(V2,LEFT(V2,3),IFERROR(VLOOKUP(LEFT(V2,3),R:S,2,FALSE),LEFT(V2,3))),3),R:S,2,FALSE),RIGHT(SUBSTITUTE(V2,LEFT(V2,3),IFERROR(VLOOKUP(LEFT(V2,3),R:S,2,FALSE),LEFT(V2,3))),3)))
I have a spark dataframe where one column has the type Set<text>.
This column contains a set of string, for example ["eenie","meenie","mo"].
How do I filter the contents of the whole dataframe so that
I only get those rows that (for example) contain the value eenie in the set?
I'm looking for something similar to
dataframe.where($"list".contains("eenie"))
the above shown example is only valid for when the content of column list is a string not a Set. What alternatives are there to fit my circumstances?
Edit: My question is not a duplicate. The user in that question has a set of values and wants to know which ones are located inside a specific column. I have a column that contains a set, and I want to know if a specific value is part of the set. My approach is the opposite of that.
Try:
import org.apache.spark.sql.functions.array_contains
dataframe.where(array_contains($"list", "eenie"))