How to cut a string at an certain string in python? - pandas

I am working with python and want to cut a dataset before the following string occurs: "Customer Information". The goal is that everything before this string is inside my new dataset and the part after the string is cut out.
I have tried some things (2 ways) but it did not work. See Code attached below.
df = dataset.copy()
df.Description = df.Description.str.split('Customer Information').str[0].str.strip()
df['Description'] = [x.lstrip('').rstrip('Customer Information') for x in df['Description']]
The expected result for this string ("test Customer Information: Many lines of customer information") should be:
"test"
my actual result is: "test Customer Information: Many lines of customer information"

Why not:
df.Description=df.Description.str.split('Customer Information').str.get(0)
Here is how I tested:
Item Description
0 1 abc Customer Information abc
1 1 aaa Customer Information abc
2 1 bbb Customer Information abc
3 1 ccc Customer Information abc
Item Description
0 1 abc
1 1 aaa
2 1 bbb
3 1 ccc

Related

Find the certain value at each row data and count the frequency pandas

I want to calculate the frequency by each row data. For instance,
column_nameA
column_nameB
column_nameC
title
content
AAA company
AAA
Ben Simons
AAA company has new product lanuch.
AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned.......
BBB company
BBB
Alex Wong
AAA company has new product lanuch.
AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions…....
In here, the result I expected is
When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.
Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.
However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.
So, the result would be like:
nameA_appear_in_title
nameB_appear_in_title
nameC_appear_in_title
nameA_appear_in_content
nameB_appear_in_content
nameC_appear_in_content
1
1
0
2
1
1
0
0
0
1
1
0
All the data has stored into the dataframe, and hope this can manipulate by using panda.
One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.
Use itertools.product for all combinations of lists of columns names and create new columns with count, last remove original columns names if necessary:
cols = df.columns
L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']
from itertools import product
for a, b in product(L2, L1):
df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)
df = df.drop(cols, axis=1)
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 1 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 3 1
1 1 2 0
Last if necessary subtract column_nameA from column_nameB use:
cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')
df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 0 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 1 1
1 1 1 0

Remove group rows

How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.
Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.

Count duplicate rows in a datatable using vb.net

I have a datatable called dtstore with 4 columns called section, department, palletnumber and uniquenumber. I am trying to make a new datatable called dtmulti which has an extra column called multi which shows the count for the number of duplicate rows...
dtstore
section | department | palletnumber | batchnumber
---------------------------------------------------
pipes 2012 1234 21
taps 2011 5678 345
pipes 2012 1234 21
taps 2011 5678 345
taps 2011 5678 345
plugs 2009 7643 63
dtmulti
section | department | palletnumber | batchnumber | multi
----------------------------------------------------------
pipes 2012 1234 21 2
taps 2011 5678 345 3
I have tried lots of approaches but my code always feels clumsy and bloated, is there an efficient way to do this?
Here is the code I am using:
Dim dupmulti = dataTable.AsEnumerable().GroupBy(Function(i) i).Where(Function(g) g.Count() = 2).Select(Function(g) g.Key)
For Each row In dupmulti multirow("Section") = dup("Section")
multirow("Department") = dup("Department")
multirow("PalletNumber") = dup("PalletNumber")
multirow("BatchNumber") = dup("BatchNumber")
multirow("Multi") = 2
Next
Assumptions of the code below these lines: the DataTable containing the original information is called dup. It might contain any number of duplicates and all of them can be defined by just looking at the first column.
'Creating final table from the columns in the original table
Dim multirow As DataTable = New DataTable
For Each col As DataColumn In dup.Columns
multirow.Columns.Add(col.ColumnName, col.DataType)
Next
multirow.Columns.Add("multi", GetType(Integer))
'Looping though the groupped rows (= no duplicates) on account of the first column
For Each groups In dup.AsEnumerable().GroupBy(Function(x) x(0))
multirow.Rows.Add()
'Adding all the cells in the corresponding row except the last one
For c As Integer = 0 To dup.Columns.Count - 1
multirow(multirow.Rows.Count - 1)(c) = groups(0)(c)
Next
'Adding the last cell (duplicates count)
multirow(multirow.Rows.Count - 1)(multirow.Columns.Count - 1) = groups.Count
Next

How can I identify the text values that have the lowest row IDs across 4 columns?

I found a few articles that are close, but not the same as what I am trying to do. I have an Excel file that has 4 columns of duplicated data, each column is sorted based on a numeric value that came from a different worksheet.
I need to identify the 25(or so?) rows where the value of the four columns match, and the row ID is the lowest. There will be roughly 250 rows of data to sift through, so I only really need the top 10%.
I don't HAVE to approach it this way. I can dump this data into Access if this cannot be done in Excel. Or I can assign columns next to each text column (a way of assigning IDs to each field in column 1, 2, 3, and 4) and use those values. The approach is negotiable, as long as the outcome works.
Here's what my data looks like in Excel:
A B C D
abc bcd abc def
cde fgh def bcd
def def bcd abc
bcd hji xyz lmn
So in this case I would want to highlight (or somehow identify) the value "def" because it appears closest to the top of all 4 columns, hence it has the lowest row ID. The value "bcd" would be second on the list since it also is identified in all 4 and has a low row id.
Any suggestions would be appreciated. I know SQL fairly well, so if you think dumping it in a DB would be best and you can suggest a query that would be awesome. But ideally... keeping it in Excel would be the least amount of work for me. I'm open to formulas, conditional formatting, etc.
Thanks!!
I THINK I came up with a fairly cool solution...
So, supposing you have this data in columns A-D, begining in cell A2, say.
Now, you know that you ONLY want values if they already exist in column A - Otherwise they're not in all 4 columns.
So:
In E2, type in the formula =Row() - This basically says where A's value is located
In F2, type in =Match($A2,B:B,0) - This will find the first match for A2's value in columns B
Drag that formula across to G2 & H2 (to find the first match for A2's value in C & D respectively).
In I2, type in the formula =Sum(E2:H2)
Now, drag E:H down for your entire dataset.
So, If H = #N/A, that means the values weren't in all 4 columns
And the lower the value for H, the lower the rank of the match - (Column A's text being the value you're matching for).
Now you could sort according to Column H, etc, to suit your needs.
Hope this does the trick (and makes sense)!
Cool Q, BTW!!!
Do you have, or can you create, a master list of all of the possible cell values? If so, then some simple VLOOKUPs on each of the 4 data columns could give, for each unique cell value, the row number in each column. Add up the 4 reesults and sort on the total.
If you don't have the master list of unique values, I'd tend to go to Access because it's a pretty easy set of queries to get what you want.
Clarification Needed
When I first came up with this answer I used the same approach that John used in his clever Excel answer, namely to use the sum of the minimum rows per column to produce the rank. That produces the sample result in the question, but consider the following modified test data:
F1 F2 F3 F4 RowNum
--- --- --- --- ------
XXX bar baz bat 1
foo XXX baz bat 2
YYY bar XXX bat 3
foo YYY baz bat 4
foo bar YYY bat 5
foo bar baz YYY 6
foo bar baz bat 7
foo bar baz bat 8
foo bar baz bat 9
foo bar baz XXX 10
XXX appears in rows 1, 2, 3, and 10, so the sum would be 16. YYY appears in rows 3, 4, 5, and 6 so the sum would be 18. Ranking by sum would declare XXX the winner, even though if you started scanning for XXX from row 1 you would have to go all the way to row 10 to reach the last XXX, whereas if you started scanning for YYY from row 1 you would only have to go down to row 6 to reach the last YYY.
In this case should YYY actually be the winner?
(original answer)
The following code will import the Excel data into Access and add a [RowNum] column
Sub ImportExcelData()
On Error Resume Next '' in case it doesn't already exist
DoCmd.DeleteObject acTable, "ExcelData"
On Error GoTo 0
DoCmd.TransferSpreadsheet acImport, acSpreadsheetTypeExcel12Xml, "ExcelData", "C:\Users\Gord\Documents\ExcelData.xlsx", False
CurrentDb.Execute "ALTER TABLE ExcelData ADD COLUMN RowNum AUTOINCREMENT(1,1)", dbFailOnError
End Sub
So now we have an [ExcelData] table in Access like this
F1 F2 F3 F4 RowNum
--- --- --- --- ------
abc bcd abc def 1
cde fgh def bcd 2
def def bcd abc 3
bcd hji xyz lmn 4
Let's create a saved query named ExcelItems in Access to string the entries out in a long "list"...
SELECT F1 AS Item, RowNum, 1 AS ColNum FROM ExcelData
UNION ALL
SELECT F2 AS Item, RowNum, 2 AS ColNum FROM ExcelData
UNION ALL
SELECT F3 AS Item, RowNum, 3 AS ColNum FROM ExcelData
UNION ALL
SELECT F4 AS Item, RowNum, 4 AS ColNum FROM ExcelData
...returning...
Item RowNum ColNum
---- ------ ------
abc 1 1
cde 2 1
def 3 1
bcd 4 1
bcd 1 2
fgh 2 2
def 3 2
hji 4 2
abc 1 3
def 2 3
bcd 3 3
xyz 4 3
def 1 4
bcd 2 4
abc 3 4
lmn 4 4
Now we can find the lowest RowNum where Item is found for each ColNum...
TRANSFORM Min(ExcelItems.[RowNum]) AS MinOfRowNum
SELECT ExcelItems.[Item]
FROM ExcelItems
GROUP BY ExcelItems.[Item]
PIVOT ExcelItems.[ColNum] In (1,2,3,4);
...returning...
Item 1 2 3 4
---- - - - -
abc 1 1 3
bcd 4 1 3 2
cde 2
def 3 3 2 1
fgh 2
hji 4
lmn 4
xyz 4
If we save that query as ExcelItems_Crosstab then we can use it to rank the items that appear in all four columns:
SELECT Item, [1]+[2]+[3]+[4] AS Rank
FROM ExcelItems_Crosstab
WHERE ([1]+[2]+[3]+[4]) IS NOT NULL
ORDER BY 2
...returning...
Item Rank
---- ----
def 9
bcd 10

SQL - querying without duplicate base on another column, /improving condition.?

I have written a query which involves joins and finally returns the below result,
Name ID
AAA 1
BBB 1
BBB 6
CCC 1
CCC 6
DDD 6
EEE 1
But I want my result to be still filtered in such a way that, the duplicate values in the first column should be ignored which has lesser value. ie, CCC and BBB which are duplicates with value 1 should be removed. The result should be
AAA 1
BBB 6
CCC 6
DDD 6
EEE 1
Note: I have a condition called Where (ID = '6' or ID = '1'), is there any way to improve this condition saying Where ID = 6 or ID = 1 (if no 6 is available in that table)"
You will likely want to add:
GROUP BY name
to the bottom of your query and change ID to MAX(ID) in your SELECT statement
It is hard to give a more specific answer without seeing the query you've already written.