*Edited for clarity
I need to find values from ['DateA'], ['DateB'], etc. between a range of dates ['Start']['End'] and return a running count to ['A']['B'], etc., for every time ['DateA]', ['DateB'], etc. falls within the ['Start']['End'] range.
The start datetime and end datetime are overlapping 53-week increments. If any of the dates (A, B, C, or D) are within the 53-week window, the script will count those datetimes and return an integer. To add to the confusion, DateD will only return an integer if the datetime coincides with a value of "Failed" from the Pass_Fail element. As example:
Expected Output:
Start End A B C D
10/30/19 11/04/20 0 0 4 3
11/06/19 11/11/20 1 0 3 3
11/13/19 11/18/20 1 0 3 3
Dates DataFrame (simplified)
- the coinciding fail dates below are 02/13/20, 06/21/20, 07/15/20 (hence int "3" in col D of Expected Output above.
DateA DateB DateC DateD Pass_Fail
0 11/07/20 None 06/21/20 02/09/20 Pass
1 None None 06/11/20 12/14/19 Pass
2 None None 09/21/19 03/26/20 Pass
3 None None 03/20/20 02/13/20 Fail
4 None None 08/16/20 06/21/20 Fail
5 None None None 01/06/20 Pass
6 None None None 04/03/20 Pass
7 None None None 07/15/20 Fail
8 None None None 02/20/20 Pass
9 None None None 03/22/20 Pass
10 None None None 11/15/19 Pass
I'm certain this is simple, but I'm obviously just starting out and couldn't locate a direct answer/problem-solve this myself.
Many thanks!
-DD
Related
I am trying to change the values in the data frame below to ints so I can changes these times hh/mm/ss into a number value based on hours (e.g. for row two hrs_cor would equal 5.5).
hrs mins secs
0 None None
1 None None
2 5 30 00
3 5 22 30
4 8 00 00
... .. ... ...
1052 None None
1053 None None
1054 None None
1055 None None
1056 None None
The issue I am running is converting the data frame into numeric values, and I think it is due to the empty cells. So far I have tried variations of the code below:
MID_calc['hrs'] = MID_calc.to_numeric(MID_calc['hrs'], errors='coerce').astype('INT46')
And this error is returned:
AttributeError: 'DataFrame' object has no attribute 'to_numeric'
Currently, all values are objects
hrs object
mins object
secs object
dtype: object
I have looked through several posts, but nothing seems to be working. Any help would be greatly appreciated!
You need to use
import pandas as pd
MID_calc['hrs'] = pd.to_numeric(MID_calc['hrs'] , errors='ignore').astype('INT46')
dont just copy the code directly from the website. Understand what it does.
I can think of 2 ways of doing this:
Apply df.query to match each row, then collect the index of each result
Set the column domain to be the index, and then reorder based on the index (but this would lose the index which I want, so may be trickier)
However I'm not sure these are good solutions (I may be missing something obvious)
Here's an example set up:
domain_vals = list("ABCDEF")
df_domain_vals = list("DECAFB")
df_num_vals = [0,5,10,15,20,25]
df = pd.DataFrame.from_dict({"domain": df_domain_vals, "num": df_num_vals})
This gives df:
domain num
0 D 0
1 E 5
2 C 10
3 A 15
4 F 20
5 B 25
1: Use df.query on each row
So I want to reorder the rows according using the values in order of domain_vals for the column domain.
A possible way to do this is to repeatedly use df.query but this seems like an un-Pythonic (un-panda-ese?) solution:
>>> pd.concat([df.query(f"domain == '{d}'") for d in domain_vals])
domain num
3 A 15
5 B 25
2 C 10
0 D 0
1 E 5
4 F 20
2: Setting the column domain as the index
reorder = df.domain.apply(lambda x: domain_vals.index(x))
df_reorder = df.set_index(reorder)
df_reorder.sort_index(inplace=True)
df_reorder.index.name = None
Again this gives
>>> df_reorder
domain num
0 A 15
1 B 25
2 C 10
3 D 0
4 E 5
5 F 20
Can anyone suggest something better (in the sense of "less of a hack"). I understand that my solution works, I just don't think that calling pandas.concat along with a list comprehension is the right approach here.
Having said that, it's shorter than the 2nd option, so I presume there must be some equally simple way I can do this with pandas methods I've overlooked?
Another way is merge:
(pd.DataFrame({'domain':df_domain_vals})
.merge(df, on='domain', how='left')
)
I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)
Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))
The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)
Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)
I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]
Can anyone help me?
I need to write a formula to check if some of my fields are null, but I'm not sure how to do this.
There are 4 items contained within one field and I need to check to see if they are null / blank and then mark these as 'None'
I've tried the formula below, but now I'm finding that everything is showing as 'None' as it's only checking if they are all filled in.
if {VR_ACCESS_Broker.ACCID} <> 17 and
{VR_ACCESS_CHB2B.ACCID} <> 11 and
{VR_ACCESS_Fleet.ACCID} <> 9 and
{VR_ACCESS_Prefs.ACCID} <> 10
then 'None'
So :- if ACCID 1 has been selected but ACCIDs 2,3 & 4 haven't then I want to show ACCID 1's name
Else if ACCID 2 has been selected but ACCIDs 1,3 & 4 haven't, then I want to show ACCID 2's name
and so on
i.e. if none of ACCID 1, 2, 3 & 4 have been selected then I want that to show the name as 'None'
Basically, the result I'm getting is :-
Quote ID Result
48088 None
48088 9
48090 10
48090 None
48091 None
48092 None
48094 9
48094 None
As you can see in some instances (Quote ID : 48094) there are 2 lines for this quote, whereas what I need the report to state is that, if there are any instances of 9, 10, 11 or 17, then just state 9, 10, 11 or 17, otherwise show 'None'.
So I want my results to look like this:-
Quote ID Result Removed (example - I don't need to see this)
48088 None
48088 9
48090 10
48090 None
48091 None
48092 None
48094 9
48094 None
i.e. :--
Quote ID Result
48088 9
48090 10
48091 None
48092 None
48094 9
So that I get 5 quote ID's and can count 2 for 'None', 2 for '9' and 1 for '10'.
Can anyone please help?
Many thanks
Louise
If you need to check null then no need to write these many conditions you can just write:
if ISNULL({VR_ACCESS_Broker.ACCID})
Then "None"
Assuming at any point only item present in the field {VR_ACCESS_Broker.ACCID}
if {VR_ACCESS_Broker.ACCID} <> 17 OR
{VR_ACCESS_CHB2B.ACCID} <> 11 OR
{VR_ACCESS_Fleet.ACCID} <> 9 OR
{VR_ACCESS_Prefs.ACCID} <> 10
then 'None'
Let me know if this is not your requirement.