Pandas - Extract all content based on certain keywords

Pandas - Extract all content based on certain keywords - pandas

I am trying to extract all the content from a Dataframe till a specific word appears. I am trying to extract the entire content till the following words appear:
high, medium, low
Sample view of the text in the Dataframe:
text
Ticket creation dropped in last 24 hours medium range for cust_a
Calls dropped in last 3 months high range for cust_x
Expected output:
text, new_text
Ticket creation dropped in last 24 hours medium range for cust_a, Ticket creation dropped in last 24 hours
Calls dropped in last 3 months high range for cust_x, Calls dropped in last 3 months

You need replace and regex.
The idea will be to match any words from your list and then replace it and anything after it.
We use .* to match anything until the end of a string:
words = 'high, medium, low'
match_words = '|'.join(words.split(', '))
#'high|medium|low'
df['new_text'] = df['text'].str.replace(f"({match_words}).*",'',regex=True)
print(df['text_new'])
0 Ticket creation dropped in last 24 hours
1 Calls dropped in last 3 months
Name: text, dtype: object

Related

updating the next several row values based on the value of a row in another column

I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.

Pandas create graph from Date and Time while them being in different columns

My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet

I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)

need to extract all the content between two string in pandas dataframe

I have data in pandas dataframe. i need to extract all the content between the string which starts with "Impact Factor:" and ends with "&#". If the content doesn't have "Impact Factor:" i want null in that row of the dataframe
this is sample data from a single row.
Save to EndNote online &# Add to Marked List &# Impact Factor: Journal 2 and Citation Reports 500 &# Other Information &# IDS Number: EW5UR &#
I want the content like the below in a dataframe .
Journal 2 and Citation Reports 500
Journal 6 and Citation Reports 120
Journal 50 and Citation Reports 360
Journal 30 and Citation Reports 120

Hi you can just use a regular expression here:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:(.*?)&#',x))
You may want to strip white spaces too in which case you could use:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:\s*(.*?)\s*&#',x))

How to find multiple subsets of numbers that are approximately equal to a given value?

I am using VBA that gets data from an Excel 2013 spreadsheet. I have a couple years experience in computer science from a while back using VBA and java, but I'm by no means an expert.
I have a column of numbers ranging from 20 to 60 total. Each of those numbers represents 'minutes' and can range from 3 to 500 (normally 60 to 300). Each number has an assigner called a 'load number' (such as N03, N22 and etc.) and a date/time. All of these values are attributed to a 'load' that needs to be picked. 'Pickers' are the ones that have the loads or minutes assigned to them. They can only pick so many minutes per given day which ranges from 400-600 (8 hour shift = 400 minutes).
What I need to do is assign sets of loads that are equal to an approximate amount of total minutes (set number w/ threshold) to two groups of pickers (The groups are AM and PM, each have 3-5 pickers). Once one load is assigned to a picker, it can't be assigned to another UNLESS the loads for a given day have too many minutes and all the pickers can't be assigned an approximate amount of minutes.
Example: Out of 8 pickers, 6 can be assigned loads totaling between 380-420 minutes, but 2 can't be assigned between 380-420 because of the remaining loads.
In the case of the given example, for the remaining 2 pickers, a total of 760 - 840 minutes can be assigned to BOTH of them.
Loads also need to be assigned based on their date/time. If pickers are picking loads due on the same day, the earliest loads need to be assigned to the AM group of pickers and, accordingly, the latest to the PM group of pickers. If all loads to be assigned are for the next day, they can be assigned to anyone as long as the earliest loads are prioritized.
Example: AM shift starts at 5AM w/ 5 pickers. There is three loads that are 200 minutes (4 hours, actual) due at 9AM on the same day
The three loads should be assigned to three different pickers, so the loads can be done on time. They would be marked as the #1 load, so each picker knows to do it first
Example: Another load is due at 9AM on the same day. It is 400 minutes though.
2 pickers can be assigned to this load as their #1 pick and 200 minutes would be assigned to both of them.
Once the loads are assigned to the pickers, the results will be displayed in a separate spreadsheet with each row having: AM/PM, Picker's name, Load number #'s 1-10 w/ load number and minutes to pick and the total minutes.
Example: PICKER | AM | Toby | 029-N10 (268), 030-N05 (93), 030-N04 (111) | 472 TOTAL
Any help / pointers on this problem would be appreciated. I've looked at similar questions posted on here and abroad, but couldn't find any that would give me enough to go by to start working on a solution. It's not too bad assigning loads manually, but it gets complex one there's over 30 and 4,000 minutes total and especially when most of them are larger. It would just be much easier having a program assign everything and save 1-2 hours in the process everyday.
Edit:
The data, in Excel, is structured into 8 columns and up to 50 rows. Each row represents a 'load' and has only 3 useful cells. I got all the information into three arrays, which can be used to display the info for any load by using the same element (1-50) for each array.
Dim LoadNumbers(1 To 50) As String
Dim LoadTime(1 To 50) As Double
Dim LoadMinutes(1 To 50) As Double
Dim C As Integer
C = 1
Do While C < 50
LoadNumbers(C) = Cells(C, 2)
LoadTime(C) = Cells(C, 5) * 24
LoadMinutes(C) = Cells(C, 7)
C = C + 1
Loop
For example:
LoadNumbers(5) & " # " & LoadTimes(5) & " Hours PST # " & LoadMinutes(5) & " Minutes"
Will return:
039-N06  # 9.5 Hours PST # 67.4 Minutes (9.5 hours = 9:30AM)
The LoadTimes and LoadMinutes arrays are the ones I need to assign loads. I will have another two cells that users will input the desired minutes (M) to be assigned and the threshold (T). I then need to VBA script to assign (M-T to M+T) minutes to each picker.
Here's what the values in LoadMinutes look like:
141.8
96
73.7
32.2
67.4
106.1
21.3
14.2
141.6
49.5
68.6
200.6
72
174.9
223.1
161.8
76.6
235.5
76.2
134.9
236.7
166.3
170.7
134.6
63.9
352.9
136.2
146.3
243.2
There's 29 loads # 3,818 minutes total
Lets say the minutes need to be between 430 to 470. Out of those 29 loads, I need to assign sets of different numbers adding up to 430 to 470 based on their time. The times in LoadTimes ranges from 7 to 20 (7AM to 8PM).

how to calculate a rolling average based on a column in spotfire

I have a data set where you have a Document Property that Selects "items", each "item" has a particular "usage days". I want to calculate an output of "Moving Average" for 1 or more selected items. the data for the moving average lives under a column named "usage days".
How do I calculate this taking into account the "selected date of my choice" and the rolling average number of days of my choice.
Do you have particular ideas of how I can perform the calculation i.e. in a calculated column or a text field?
Car/ Trip / Start Date/ End Date / Days on trip
1 AB123 / 2 / 6/07/2013
1 AB234 / 29/07/2013 / 6/09/2013 / 42
1 AB345 /6/09/2013 /28/09/2013 /22
1 AB456 /29/09/2013 /21/10/2013 /23
2 AB567 / 26/10/2013 / 12/11/2013 / 22
2 AB678 /12/11/2013 /8/12/2013 /26
[The rows above have an example of the problem (sorry couldn't paste an image because im new), I want to calculate the %usage of the Car and or cars for a selected range of time e.g (Select date range JUlY to AUGUST then (#of days on trip for car 1and 2)/#on days in that period)/2*100]

As phiver said, it is still difficult to see what you expect as a result... but I think I have something that might work. First, I slightly altered the dataset you provided, like so:
car trip startDate endDate daysOnTrip
1 AB123 7/6/2013 7/29/2013 23
1 AB234 7/29/2013 9/6/2013 42
1 AB345 9/6/2013 9/28/2013 22
1 AB456 9/29/2013 10/21/2013 23
2 AB567 10/26/2013 11/12/2013 22
2 AB678 11/12/2013 12/8/2013 26
I then added 2 document properties, "DateRangeFirst" and "DateRangeLast", to allow the user to select beginning and ending dates. Next I made input box property controls for each of the aforementioned document properties in a text area so the user can alter the date range. I then added a datatable visualization with a "Limit data using expression:" of "[startDate] >= Date(${DateRangeFirst}) and [endDate]<= Date(${DateRangeLast})" so we could see the trips selected. Finally, to get the average you appear to be looking for, a barchart set to % of total (daysOnTrip) / car with the same data limiting expression as above. The below screenshot should have everything you need to reproduce my results. I hope this gives you what you need.
NOTE: With this method if you select a date in the middle of a trip, an entire row and all of the days on that trip will be ignored.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas - Extract all content based on certain keywords - pandas

Related

updating the next several row values based on the value of a row in another column

Pandas create graph from Date and Time while them being in different columns

need to extract all the content between two string in pandas dataframe

How to find multiple subsets of numbers that are approximately equal to a given value?

how to calculate a rolling average based on a column in spotfire

Categories

Resources