MS Excel Partial Compare and Copy values from one cell to another - vba

How can i partial compare records in two cell using Excel-formula
Let Say i have AA BB in Sheet1- A1 and Test in Sheet1 B1 and AA VV in sheet2 cell A1 How can i partial Compare the two cell using the first word or word before space and create something shown on below #3
1. Sheet1 look like below
A2 B2
----- -----
AA BB test.
2. Sheet 2 looks like
A2
-----
AA VV
3. I want to see Sheet 3 as
A2 B2
----- -----
AA VV test.
Compare the first word if match

Assuming Sheet 1 has a range of values like the one below, and column A of Sheet 3 is equated to column A of Sheet 2:
A B
--------------------
1 AA BB test.
2 CC DD test..
3 EE FF test...
4 GG HH test....
5 II JJ test.....
You can use this formula on column B of Sheet 3:
=VLOOKUP("*" & MID($A1, 1, FIND(" ", $A1)-1) & "*", Sheet1!$A$1:$B$5, 2, FALSE)

Related

replacing first row of selected column from each group with 0

Existing df :
Id status value
A1 clear 23
A1 in-process 50
A1 done 20
B1 start 2
B1 end 30
Expected df :
Id status value
A1 clear 0
A1 in-process 50
A1 done 20
B1 start 0
B1 end 30
looking to replace first value of each group with 0
Use Series.duplicated for duplicated values, set first duplicate by inverse mask by ~ with DataFrame.loc:
df.loc[~df['Id'].duplicated(), 'value'] = 0
print (df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
One approach could be as follows:
Compare the values for each row in df.Id with the next row, combining Series.shift with Series.ne. This will return a boolean Series with True for each first row of a new Id value.
Next, use df.loc to select only rows with True for column value and assign 0.
df.loc[df.Id.ne(df.Id.shift()), 'value'] = 0
print(df)
Id status value
0 A1 clear 0
1 A1 in-process 50
2 A1 done 20
3 B1 start 0
4 B1 end 30
N.B. this approach assumes that the "groups" in Id are sorted (as they seem to be, indeed). If this is not the case, you could use df.sort_values('Id', inplace=True) first, but if that is necessary, the answer by #jezrael will be faster, surely.
df1.mask(~df1.Id.duplicated(),0)

How do I move the data in one indexed range to another in Pandas?

I would like to move all entries from column N1 to M1, row A1 to B1, to column N2 to M2, row A2 to B2. In other words, the Pandas equivalent of copying and pasting a highlighted region of cells to a different one.
For example, in this case, move row 0, col B:C to row 3, col A:B:
A B C
0 apple pear banana
1
2
3
to
A B C
0 apple
1
2
3 pear banana
How could I do this?
Reassign the values using .iloc():
df.iloc[3,1:3]=df.iloc[0,1:3]
df.iloc[0,1:3]=None
df.iloc() Docs
Update:
Appending :
df.append(df.iloc[0,0:3])
df.append() Docs

append lists of different length to dataframe pandas

Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
df.rename(columns={0:"acc_num"},inplace=True)
Final result:
Out[26]:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10