Reorder rows of pandas DataFrame according to a known list of values - pandas

I can think of 2 ways of doing this:
Apply df.query to match each row, then collect the index of each result
Set the column domain to be the index, and then reorder based on the index (but this would lose the index which I want, so may be trickier)
However I'm not sure these are good solutions (I may be missing something obvious)
Here's an example set up:
domain_vals = list("ABCDEF")
df_domain_vals = list("DECAFB")
df_num_vals = [0,5,10,15,20,25]
df = pd.DataFrame.from_dict({"domain": df_domain_vals, "num": df_num_vals})
This gives df:
domain num
0 D 0
1 E 5
2 C 10
3 A 15
4 F 20
5 B 25
1: Use df.query on each row
So I want to reorder the rows according using the values in order of domain_vals for the column domain.
A possible way to do this is to repeatedly use df.query but this seems like an un-Pythonic (un-panda-ese?) solution:
>>> pd.concat([df.query(f"domain == '{d}'") for d in domain_vals])
domain num
3 A 15
5 B 25
2 C 10
0 D 0
1 E 5
4 F 20
2: Setting the column domain as the index
reorder = df.domain.apply(lambda x: domain_vals.index(x))
df_reorder = df.set_index(reorder)
df_reorder.sort_index(inplace=True)
df_reorder.index.name = None
Again this gives
>>> df_reorder
domain num
0 A 15
1 B 25
2 C 10
3 D 0
4 E 5
5 F 20
Can anyone suggest something better (in the sense of "less of a hack"). I understand that my solution works, I just don't think that calling pandas.concat along with a list comprehension is the right approach here.
Having said that, it's shorter than the 2nd option, so I presume there must be some equally simple way I can do this with pandas methods I've overlooked?

Another way is merge:
(pd.DataFrame({'domain':df_domain_vals})
.merge(df, on='domain', how='left')
)

Related

Compare Values of 2 dataframes conditionally

I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1.
If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1 are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

Merging two dataframes on the same type column gives me wrong result

I have two dataframes, assume A and B, which have been created after reading the sheets of an Excel file and performing some basic functions. I need to merge right the two dataframes on a column named ID which has first been converted to astype(str) for both dataframes.
The ID column of the left Dataframe (A) is:
0 5815518813016
1 5835503994014
2 5835504934023
3 5845535359006
4 5865520960012
5 5865532845006
6 5875531550008
7 5885498289039
8 5885498289039_A2
9 5885498289039_A3
10 5885498289039_X2
11 5885498289039_X3
12 5885509768698
13 5885522349999
14 5895507791025
Name: ID, dtype: object
The ID column of the right Dataframe (B) is:
0 5835503994014
1 5845535359006
2 5835504934023
3 5815518813016
4 5885498289039_A1
5 5885498289039_A2
6 5885498289039_A3
7 5885498289039_X1
8 5885498289039_X2
9 5885498289039_X3
10 5885498289039
11 5865532845006
12 5875531550008
13 5865520960012
14 5885522349998
15 5895507791025
16 5885509768698
Name: ID, dtype: object
However, when I merge the two, the rest of the columns of the left (A) dataframe become "empty" (np.nan) except for the rows where the ID does not contain only numbers but letters too. This is the pd.merge() I do:
A_B=A.merge(B[['ID','col_B']], left_on='ID', right_on='ID', how='right')
Do you have any ideas what might be so wrong? Your input is valuable.
Try turning all values in both columns into strings:
A['ID'] = A['ID'].astype(str)
B['ID'] = B['ID'].astype(str)
Generally, when a merge like this doesn't work, I would try to debug by printing out the unique values in each column to check if anything pops out (usually dtype issues).

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508