Python Pandas groupby and join - pandas

I am fairly new to python pandas and cannot find the answer to my problem in any older posts.
I have a simple dataframe that looks something like that:
dfA ={'stop':[1,2,3,4,5,1610,1611,1612,1613,1614,2915,...]
'seq':[B, B, D, A, C, C, A, B, A, C, A,...] }
Now I want to merge the 'seq' values from each group, where the difference between the next and previous value in 'stop' is equal to 1. When the difference is high like 5 and 1610, that is where the next cluster begins and so on.
What I need is to write all values from each cluster into separate rows:
0 BBDAC #join'stop' cluster 1-5
1 CABAC #join'stop' cluster 1610-1614
2 A.... #join'stop' cluster 2015 - ...
etc...
What I am getting with my current code is like:
True BDACABAC...
False BCA...
for the entire huge dataframe.
I understand the logic behid the whay it merges it, which is meeting the condition (not perfect, loosing cluster edges) I specified, but I am running out of ideas if I can get it joined and split properly into clusters somehow, not all rows of the dataframe.
Please see my code below:
dfB = dfA.groupby((dfA.stop - dfA.stop.shift(1) == 1))['seq'].apply(lambda x: ''.join(x)).reset_index()
Please help.
P.S. I have also tried various combinations with diff() but that didn't help either. I am not sure if groupby is any good for this solution as well. Please advise!
dfC = dfA.groupby((dfA['stop'].diff(periods=1)))['seq'].apply(lambda x: ''.join(x)).reset_index()
This somehow splitted the dataframe into smaller chunks, cluster-like, but I am not understanding the legic behind the way it did it, and I know the result makes no sense and is not what I intended to get.

I think you need create helper Series for grouping:
g = dfA['stop'].diff().ne(1).cumsum()
dfC = dfA.groupby(g)['seq'].apply(''.join).reset_index()
print (dfC)
stop seq
0 1 BBDAC
1 2 CABAC
2 3 A
Details:
First get differences by diff:
print (dfA['stop'].diff())
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1605.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1301.0
Name: stop, dtype: float64
Compare by ne (!=) for first values of groups:
print (dfA['stop'].diff().ne(1))
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
Name: stop, dtype: bool
Asn last create groups by cumsum:
print (dfA['stop'].diff().ne(1).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Name: stop, dtype: int32

I just figured it out.
I managed to round the values of 'stop' to a nearest 100 and assigned it as a new column.
Then my previous code is working....
Thank you so much for quick answer though.
dfA['new_val'] = (dfA['stop'] / 100).astype(int) *100

Related

Create a new column based on another column in a dataframe

I have a df with multiple columns. One of my column is extra_type. Now i want to create a new column based on the values of extra_type column. For example
extra_type
NaN
legbyes
wides
byes
Now i want to create a new column with 1 and 0 if extra_type is not equal to wide then 1 else 0
I tried like this
df1['ball_faced'] = df1[df1['extra_type'].apply(lambda x: 1 if [df1['extra_type']!= 'wides'] else 0)]
It not working this way.Any help on how to make this work is appreciated
expected output is like below
extra_type ball_faced
NaN 1
legbyes 1
wides 0
byes 1
Note that there's no need to use apply() or a lambda as in the original question, since comparison of a pandas Series and a string value can be done in a vectorized manner as follows:
df1['ball_faced'] = df1.extra_type.ne('wides').astype(int)
Output:
extra_type ball_faced
0 NaN 1
1 legbyes 1
2 wides 0
3 byes 1
Here are links to docs for ne() and astype().
For some useful insights on when to use apply (and when not to), see this SO question and its answers. TL;DR from the accepted answer: "If you're not sure whether you should be using apply, you probably shouldn't."
df['ball_faced'] = df.extra_type.apply(lambda x: x != 'wides').astype(int)
extra_type
ball_faced
0
NaN
1
1
legbyes
1
2
wides
0
3
byes
1

Compare Values of 2 dataframes conditionally

I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1.
If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1 are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True

Reorder rows of pandas DataFrame according to a known list of values

I can think of 2 ways of doing this:
Apply df.query to match each row, then collect the index of each result
Set the column domain to be the index, and then reorder based on the index (but this would lose the index which I want, so may be trickier)
However I'm not sure these are good solutions (I may be missing something obvious)
Here's an example set up:
domain_vals = list("ABCDEF")
df_domain_vals = list("DECAFB")
df_num_vals = [0,5,10,15,20,25]
df = pd.DataFrame.from_dict({"domain": df_domain_vals, "num": df_num_vals})
This gives df:
domain num
0 D 0
1 E 5
2 C 10
3 A 15
4 F 20
5 B 25
1: Use df.query on each row
So I want to reorder the rows according using the values in order of domain_vals for the column domain.
A possible way to do this is to repeatedly use df.query but this seems like an un-Pythonic (un-panda-ese?) solution:
>>> pd.concat([df.query(f"domain == '{d}'") for d in domain_vals])
domain num
3 A 15
5 B 25
2 C 10
0 D 0
1 E 5
4 F 20
2: Setting the column domain as the index
reorder = df.domain.apply(lambda x: domain_vals.index(x))
df_reorder = df.set_index(reorder)
df_reorder.sort_index(inplace=True)
df_reorder.index.name = None
Again this gives
>>> df_reorder
domain num
0 A 15
1 B 25
2 C 10
3 D 0
4 E 5
5 F 20
Can anyone suggest something better (in the sense of "less of a hack"). I understand that my solution works, I just don't think that calling pandas.concat along with a list comprehension is the right approach here.
Having said that, it's shorter than the 2nd option, so I presume there must be some equally simple way I can do this with pandas methods I've overlooked?
Another way is merge:
(pd.DataFrame({'domain':df_domain_vals})
.merge(df, on='domain', how='left')
)

Filling previous value by field - Pandas apply function filling None

I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)
Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))
The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)
Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance