Managing MultiIndex Dataframe objects - pandas

So, yet another problem using grouped DataFrames that I am getting so confused over...
I have defined an aggregation dictionary as:
aggregations_level_1 = {
'A': {
'mean': 'mean',
},
'B': {
'mean': 'mean',
},
}
And now I have two grouped DataFrames that I have aggregated using the above, then joined:
grouped_top =
df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom =
df2.groupby(['group_lvl']).agg(aggregations_level_1)
Joining these:
df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10',
rsuffix='_low_10')
A_top_10 A_low_10 B_top_10 B_low_10
mean mean mean mean
group_lvl
a 3.711413 14.515901 3.711413 14.515901
b 4.024877 14.442106 3.694689 14.209040
c 3.694689 14.209040 4.024877 14.442106
Now, if I call index and columns I have:
print df3.index
>> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')
print df3.columns
>> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]])
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
Only A_top_10, B_top_10 for a and c?
I am pretty confused so any overall help would be great!

Need slicers, but first sort columns by sort_index else error:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
df = df.sort_index(axis=1)
idx = pd.IndexSlice
df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
print (df1)
A_low_10 A_top_10
mean mean
group_lvl
a 14.515901 3.711413
b 14.442106 4.024877
c 14.209040 3.694689
And:
idx = pd.IndexSlice
df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
print (df2)
A_top_10 B_top_10
mean mean
group_lvl
a 3.711413 3.711413
c 3.694689 4.024877
EDIT:
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
I think very close, better is say I have MultiIndex in columns.

Related

Compare two dataframes with varying index lengths and multiple occurances

df = pd.DataFrame({"_id": [1, 2, 3, 4], "names_e": ["emil", "emma", "enton", "emma"]})
df2 = pd.DataFrame({"id": [1, 3, 4], "name": ["emma", "emma", "emma"]})
#df2 = df2.set_index("id", drop="False")
#df = df.set_index("_id", drop="False")
df[(df['_id']==df2["id"]) & (df['names_e'] == df2["name"])] #-> Can only compare identically-labeled Series objects
#df[[x for x in (df2["name"] == df["names_e"].values)]] #->'Lengths must match to compare'
#df[[x for x in (df2["name"] == df["names_e"])]] # ->Can only compare identically-labeled Series objects
I'm trying to make an intersection of two dataframes based on the column name and the unique identifier id. The expected result would only include id:4 and name:'emma' but I keep running into the same errors
Let us do inner merge:
df2.merge(df.rename(columns={'_id': 'id', 'names_e': 'name'}))
id name
0 4 emma

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

series.str.split(expand=True) returns error: Wrong number of items passed 2, placement implies 1

I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1
You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')

pandas: Calculate the rowwise max of categorical columns

I have a DataFrame containing 2 columns of ordered categorical data (of the same category). I want to construct another column that contains the categorical maximum of the first 2 columns. I set up the following.
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
cats = CategoricalDtype(categories=['small', 'normal', 'large'], ordered=True)
data = {
'A': ['normal', 'small', 'normal', 'large', np.nan],
'B': ['small', 'normal', 'large', np.nan, 'small'],
'desired max(A,B)': ['normal', 'normal', 'large', 'large', 'small']
}
df = pd.DataFrame(data).astype(cats)
The columns can be compared, although the np.nan items are problematic, as running the following code shows.
df['A'] > df['B']
The manual suggests that max() works on categorical data, so I try to define my new column as follows.
df[['A', 'B']].max(axis=1)
This yields a column of NaN. Why?
The following code constructs the desired column using the comparability of the categorical columns. I still don't know why max() fails here.
dfA = df['A']
dfB = df['B']
conditions = [dfA.isna(), (dfB.isna() | (dfA >= dfB)), True]
cases = [dfB, dfA, dfB]
df['maxAB'] = np.select(conditions, cases)
Columns A and B are string-types. So you gotta assign integer values to each of these categories first.
# size string -> integer value mapping
size2int_map = {
'small': 0,
'normal': 1,
'large': 2
}
# integer value -> size string mapping
int2size_map = {
0: 'small',
1: 'normal',
2: 'large'
}
# create columns containing the integer value for each size string
for c in df:
df['%s_int' % c] = df[c].map(size2int_map)
# apply the int2size map back to get the string sizes back
print(df[['A_int', 'B_int']].max(axis=1).map(int2size_map))
and you should get
0 normal
1 normal
2 large
3 large
4 small
dtype: object

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)