Filtering and display unique column pairs in Excel - sql

Follow on from Excel Count unique value multiple columns
I am trying to filter and setup a table containing all the unique combinations of message types.
So with three message types as an example below, I want to create a table with all the possible flows from this.
So every time MessageA exists, it is either followed by a MessageA, MessageB, MessageC or is the last of the sequence.
And everytime we see MessageC it is only followed by MessageA.
On the left, is the data and on the right is the desired result.
I want this to be able to scale to multiple columns/rows

You could do it by comparing two offset ranges, A1:D5 and B1:E5
=SUMPRODUCT(($A$1:$D$5=$G2)*($B$1:$E$5=K$1))
As you can see, I have cheated slightly by setting K1 blank so it compares correctly with column E, but this could be made part of a longer formula if it was necessary to have END as the column header for K.

Related

How do you select irregular duplicates with Google Sheets queries?

I have a Google sheet with 186,000 rows. I have included a dummy spreadsheet to give you an idea of the data. I need to select ALL duplicates, that includes rows where the first names might not match (i.e. Cathy vs Catherine), but they still refer to the same individual. There are also instances where the addresses might be slightly different (like omitting "Ave" in one row but including it in another).
I need to write a query to account for all of these instances, including just regular duplicates. Or I could do multiple queries and just copy the results into one spreadsheet. In any case, I'm at a loss.
Dummy spreadsheet. I have included one example of each case I am trying to account for (3 total).
I have something that may be useful. See my example sheet here:
https://docs.google.com/spreadsheets/d/19h28go-nzunW6zexcMD61QjySUKJA3Q2Ci2Hu3OMuAg/edit?usp=sharing
Basically I build a key value for each record, along the lines you asked for.
All of the last name, part of the first name, part of the address, and the ZIP code. Other variations are easily added.
The formula is just a string concatenation of parts of these fields, as follows:
=ArrayFormula(
IF(ROW(A2:A)=2,"DupeKey",
IF(A2:A<>"",A2:A &LEFT(B2:B,$N$1) &LEFT(G2:G,$N$1) &K2:K,"")))
A valuable option is to allow varying the length of the required matching sub-string, from the first name and address. This is controlled for the formula by selecting a substring length of 1 to 6 in cell N1, and seeing how this changes the duplicate records that are found. The shorter the substring length, the more duplicate (or possibly duplicate) records will be found.
Conditional formating is used to highlight the duplicate records.
And you can use the column filters to sort by different data columns - to put all of the duplicates at the top, sort by column N, in Z-A order, and exclude blanks.
Note that this isn't perfect. If someone accidentally types a space, or anything else, at the start of a data field, it will not be considered a duplicate. Better logic would be required to catch those.
Let me know if this helps.
You can use these formulas:
If cell B3 match "John" write "match", if doesn't match write "no"
=IF(REGEXMATCH(B3,"John"), "match", "no")
If cell F2 contains content of cell B3, write "match", if doesn't match write "no"
=IF(SEARCH(B3, F2)>0,"match","no")
References:
REGEXMATCH
SEARCH

Find first non-blank cell in column that meets criteria in another column

I've compiled multiple spreadsheets containing sporadic employee information, and I'm now trying to consolidate all of the information to remove duplicates and blanks. The formula below is my starting point, but if the first cell that meets that criteria is blank, it returns a blank. I want it to find the next cell that meets that criteria but has a value.
=INDEX(Working!C:C,MATCH($A3,Working!$B:$B,0))
Below is what the Working tab looks like, which contains the master list of data including blanks and duplicates. Working!C:C is the list of last names; $A3 is the Employee ID I'm hoping to retrieve data for, and Working!$B:$B is the list of Employee IDs. I'll be doing this for many columns, so to illustrate this, in the table example below I've shown that Column D is the phone number. Any help you can provide is appreciated!
Column B-------C-------D
---------287-----Doe----blank
---------287-----blank---333-333-3333
---------287-----Doe----blank
Use the following array formula:
=INDEX(Working!C$1:C$100,MATCH(1,($A3 = Working!$B$1:$B$100)*(Working!C$1:C$100<>""),0))
Being an array formula it needs to be confirmed with Ctrl-Shift-Enter instead of Enter when exiting edit mode. If done correctly then Excel will put {} around the formula.
Please note that with an array formula the references need to be the smallest range possible that covers the dataset.

OpenRefine - Fill between cells but not at the end of the list

I have a list of stock prices for several stocks. Some of the values are missing due to weekends, holidays and probably other reasons.
The gaps are not consistent. Some are two days and some are more than that.
I want to fill the gaps with the last known value but not at the end of the list.
I have tried in Excel to test a few cells below and if it's now empty, do the fill. The problem is that due to the inconsistency of the gaps, it's a tedious task to change the function for all the cases.
Is there a way to test for the end of a list?
UPDATE - added a screenshot.
See this screenshot. I want to fill where the blue dots are. The red dots are at the end of the list and I don't want to fill those cells.
I am looking for a way to detect the end of the list and stop the filling when the end is detected.
I think this is pretty difficult in OpenRefine and probably a different tool would work better. The main issue is that OpenRefine does not offer the ability to easily work across rows so 'summing a column' (or part of a column) is tricky - this is mentioned in https://github.com/OpenRefine/OpenRefine/issues/200
However, you can do this by forcing OpenRefine in Record mode with the whole project containing a single record. Once you've done this you can access all values in a column using syntax like:
row.record.cells["Column name"].value
This gives an array of all the non-blank values in the column. Since this ignores blank values, in order to have a true view of the values in the column you have to fill in blank cells with a value.
So I think you could probably achieve what you want as follows:
For each column you are going to work with do a cell transform to put a dummy value in empty cells - e.g. if(isBlank(value),"null",value)
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
You can now access all cells in a column using syntax like row.record.cells["Column 1"].value. You can combine this with 'forRange' to iterate through the contents of this array, using the row.index as the marker for the current row.
I used the following formula to add a new column to the project:
with(row.record.cells["Column 1"].value,w,if(forRange(row.index,w.length(),1,i,w[i].toNumber()).sum()>0,"a","b"))
Then...
Change back to 'Row' mode
Remove the 'null' placeholder from the original column
Create a facet on the 'fill filter' column
In my case I filter to 'a'
Use the 'fill down' option
Remove the filter
And remove the 'record' column
Rather a long winded way of doing it to say the least, but so far I've not been able to find anything better while not going outside OpenRefine. I'm guessing you could probably compress steps 5-11 into a single step or smaller number of steps.
If you want to access the array of cell values using Jython as suggested by iMitwe you need to use:
row["record"]["cells"]["Column 1"]["value"]
instead of
row.record.cells["Column 1"].value
(step 5)
I am doing this on the top of my head, but I think your best chance my be using the fill down option in record mode:
first move your column to the first column and switch to record mode.
then use the following GREL: row.record.cells["data"].value[-1] where data is the name of your column
The [-1] will take the last value and fill the blank. For the case with the red dot, since there is no value it should remains empty. Let us know how it goes.
Unless there's something I am missing or not seeing...
I would have just sorted reverse (date ascending) on the Date column, then individually use Fill Down on each column, except for that last column where you could then use a Date facet on your column Date to specify the exact Date range you wanted to work with, then fill down on that last column, then remove the Date range facet.

Count unique string variants

There could be quite a simple solution to this, but I am trying to find the number of times a unique variant (i.e. non-duplicates) of a string appears in a column. However this string is only part of the text contained in a cell, and not the entire cell. To illustrate:
EuropeSpainMadrid
EuropeSpainBarcelona
AsiaChinaShanghai
AsiaJapanTokyo
EuropeEnglandLondon
EuropeSpainMadrid
I would like to find how many unique instances there are of a string that contains "EuropeSpain". So using this example, I would find that a variant of "EuropeSpain" appears only twice (given that the second instance of "EuropeSpainMadrid" is a duplicate).
A solution to this is to use pivots to summarise the data and remove duplicated; however given that my underlying dataset changes often this would require manual adjustments and corrections. I would therefore like to avoid adding any intermediate steps (i.e. PivotTables, other data sets etc) between my data and the counts.
UPDATE: I now understand to use wildcards to solve the first part of my question (counting the occurrences of "EuropeSpain"), however I am not yet clear on the second part of my question (how to find the number of unique occurrences).
Is there a formula or VBA code that could do this?
Using wildcards:
=COUNTIF(A1:A6,"="&"*"&C1&"*")
For without VBA but with some versatility, I suggest with Text in ColumnA (labelled), ColumnB labelled Flag and EuropeSpain in C1:
=FIND(C$1,A2)
in B2 copied down.
Then pivot A:B with Flag for FILTERS (and 1 selected), Text for ROWS and Count of Text for Sigma VALUES.
Apply Distinct Values if required (and available!), alternatively a formula of the kind:
=MATCH("Grand Total",E:E)-4
would count uniques.

Sorting by cell value or color in multiple cells

I am trying to sort data based upon either color (RGB(186,200,8)) or value ("AMP") within a cell. That part is easy but the problem comes when I want to look for the same value/color in multiple columns (it can occur up to four times) and put the ones with all for equal to the value at the top and then three values next and on down to no match.
I'm not sure how to go about, I think a for loop and/or else would work but I can't come up with one. Any suggestions?
My suggestion would be to calculate "hit" value for the row and based on that you could do the sort easily. For instance if you have two matches on the row value for that row = 2 etc. after each row validated just sort by the value and clean the data.