Splitting Dataframe with hierarchical index [duplicate] - pandas

This question already has answers here:
Splitting dataframe into multiple dataframes
(13 answers)
Closed 3 years ago.
I have a large dataframe with hierarchical indexing (a simplistic/ format example provided in the code below). I would like to setup a loop/automated way of splitting the dataframe into subsets per unique index value, i.e. dfa, dfb, dfc etc. in the coded example below and store in a list.
I have tried the following but unfortunately to no success. Any help appreciated!
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b',
'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 1, 2, 2, 3]])
split = []
for value in data.index.unique():
split.append(data[data.index == value])

I am not exactly sure if this is what you are looking for but have you checked groupby pandas function? The crucial part is that you can apply it across MultiIndex specifying which level of indexing (or what subset of levels) to group by. e.g.
split = {}
for value, split_group in data.groupby(level=0):
split[value] = split_group
print(split)
as #jezrael points out a simpler way to do it is:
dict(tuple(df.groupby(level=0)))

Related

Subset multiindex dataframe keeps original index value

I found subsetting multi-index dataframe will keep original index values behind.
Here is the sample code for test.
level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df
Above code will show dataframe like this.
number
foo a 0
b 1
c 2
bar a 3
b 4
c 5
baz a 6
b 7
c 8
Code below subset the dataframe to contain only 'a' and 'b' for index level 1.
df_subset = df.query("(number%3) <=1")
df_subset
number
foo a 0
b 1
bar a 3
b 4
baz a 6
b 7
The dataframe itself is expected result.
BUT index level of it is still containing the original index level, which is NOT expected.
#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')
My first question is how can I remove the 'original' index after subsetting?
The Second question is this is expected behavior for pandas?
Thanks
Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:
df_subset.index = df_subset.index.remove_unused_levels()
print(df_subset.index.levels[1])
Output:
Index(['a', 'b'], dtype='object')
It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."
You can see that the index levels is a FrozenList using:
[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])
If you want to see only the used levels, you can use the get_level_values() or the unique() methods.
Here some example:
[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')
[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')
Hope it can help you!

Fastest way to locate rows of a dataframe from two lists and concatenate them?

Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})
IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

Is there an easier way to grab a single value from within a Pandas DataFrame with multiindexed columns?

I have a Pandas DataFrame of ML experiment results (from MLFlow). I am trying to access the run_id of a single element in the 0th row and under the "tags" -> "run_id" multi-index in the columns.
The DataFrame is called experiment_results_df. I can access the element with the following command:
experiment_results_df.loc[0,(slice(None),'run_id')].values[0]
I thought I should be able to grab the value itself with a statement like the following:
experiment_results_df.at[0,('tags','run_id')]
# or...
experiment_results_df.loc[0,('tags','run_id')]
But either of those just results in the following rather confusing error (as I'm not setting anything):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
It's working now, but I'd prefer to use a simpler syntax. And more than that, I want to understand why the other approach isn't working, and if I can modify it. I find multiindexes very frustrating to work with in Pandas compared to regular indexes, but the additional formatting is nice when I print the DF to the console, or display it in a CSV viewer as I currently have 41 columns (and growing).
I don't understand what is the problem:
df = pd.DataFrame({('T', 'A'): {0: 1, 1: 4},
('T', 'B'): {0: 2, 1: 5},
('T', 'C'): {0: 3, 1: 6}})
print(df)
# Output
T
A B C
0 1 2 3
1 4 5 6
How to extract 1:
>>> df.loc[0, ('T', 'A')]
1
>>> df.at[0, ('T', 'A')]
1
>>> df.loc[0, (slice(None), 'A')][0]
1

Removing selected features from dataset

I am following this program: https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance_multicollinear.html
since I have a problem with highly correlated features in my model (different from that one shown in the example). In this step
selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
I can get information on the features that I will need to remove from my classifier. They are given as numbers ([0, 3, 5, 6, 8, 9, 10, 17]). How can I get names of these features?
Ok, there are two different elements to this problem I think.
First, you need to get a list of the column names. In the example code you linked, it looks like the list of feature names is stored like this:
data.feature_names
Once you have the feature names, you'd need a way to loop through them and grab only the ones you want. Something like this should work:
columns = ['a', 'b', 'c', 'd']
keep_index = [0, 3]
new_columns = [columns[i] for i in keep_index]
new_columns
['a', 'b']