I have a DataFrame
It has a multi-index and each row has a value
I want to unstack it but it has duplicated keys. Now I want to convet it into this format:
where index1 and index2 is one of the multi-index, and index3 and index4 is another multi-index, value1~3 are values in the original df, if the value doesn't exist, the value is nan.
How should I do this?
See if this example helps you:
import pandas as pd
df = pd.DataFrame(data={'value': [1, 3, 5, 8, 5, 7]}, index=[[1, 2, 3, 4, 5, 6], ['A', 'B', 'A', 'C', 'C', 'A']])
df = df.reset_index(level=1)
df = df.pivot_table(values='value', index=df.index, columns='level_1')
Original data:
Result:
Related
a=cosmos.isna().sum()
c=len(cosmos)
a=a/c*100
for i in range(len(a)):
if a[i]>80:
cosmos.drop(columns=cosmos.columns[i], axis=1, inplace=True)
index out of bounds error the a and cosmos.columns should basically have the same length.
i am trying to drop some columns. but it shows
IndexError: index 7 is out of bounds for axis 0 with size 6
i specifically mentioned axis=1 i don't know what it has to do with axis 0
i have no idea what to do i just want to drop all columns with more than 80 percent empty rows.
so i could do it one by one this time. i tried doing it all again but it didn't help.
The error you're having is likely due to the fact that you're dropping columns inplace. As you iterate, the dataframe cosmos gets shorter, yet you index those columns using the original integer index i. As a rule of thumb, you should avoid modifying a dataframe (or any sequence in general) while iterating that same object.
That aside, there are better panda-esque solutions that take (or drop) the relevant columns in one operation, which avoids iterating all together. Here is one:
import numpy as np
import pandas as pd
# Sample data
cosmos = pd.DataFrame({
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 0],
"b": [np.nan, 3, 4, 7, 5, 3, 2, np.nan, 1, 2],
"c": [np.nan, np.nan, np.nan, np.nan, 6, np.nan, np.nan, np.nan, np.nan, np.nan],
"d": [np.nan] * 10
})
# Use .mean instead of .sum, which avoids the `/ len(df)` step
nan_pct = cosmos.isna().mean()
cosmos = cosmos.loc[:, nan_pct <= 0.8]
which uses a boolean mask to select only those columns where less than 80% of its values are nan.
This question already has answers here:
Sorting columns of multiindex dataframe
(2 answers)
Closed 7 months ago.
I have a pandas dataframe with a multiindex with various data in it. Minimal example could be this one:
elev = [1, 100, 10, 1000]
number = [4, 3, 1, 2]
name = ['foo', 'bar', 'baz', 'qux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(4,4)
df = pd.DataFrame(data=data, columns=idx)
Now I want to sort if by its elevation or number. Seems like there's an inbuilt function for it: MultiIndex.sortlevel, but it just sorts the MultiIndex, and I can't figure out how to make it sort the dataframe along the index too.
df.columns.sortlevel(level=1) gives me a sorted Multiindex
(MultiIndex([('foo', 1, 4),
('baz', 10, 1),
('bar', 100, 3),
('qux', 1000, 2)],
names=['name', 'elev', 'number']),
array([0, 2, 1, 3], dtype=int64))
but trying to apply it with df.columns = df.columns.sortlevel(level=1) or df = ... just gives me ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements or turns the df into the sorted multiindex. The keywords axis or inplace I'm used to for similar actions aren't supported in sortlevel.
How do I apply my sorting to my dataframe?
Use DataFrame.sort_index:
df = df.sort_index(level=1, axis=1)
print (df)
name foo baz bar qux
elev 1 10 100 1000
number 4 1 3 2
0 0.009359 0.113384 0.499058 0.049974
1 0.685408 0.897657 0.486988 0.647452
2 0.896963 0.831353 0.721135 0.827568
3 0.833580 0.368044 0.957044 0.494838
So want to determine what values are in a Pandas Dataframe:
import pandas as pd
d = {'col1': [1,2,3,4,5,6,7], 'col2': [3, 4, 3, 5, 7,22,3]}
df = pd.DataFrame(data=d)
col2 hast the unique values 3,4,5,6,22 (domain). Each value that exists shall be determined. But only once.
Is there anyway to fastly extract what the domain is in a Pandas Dataframe Column?
Use df.max() and df.min() to find the range.
print(df["col2"].unique())
by Andrej Kesely is the solution. Perfect!
My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.
You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
I'm creating a function that accepts 3 inputs: a dataframe, a column and a list of columns.
The function should apply a short calculation to the single column, and a different short calculation to the list of other columns. It should return a dataframe containing just the amended columns (and their amended rows) from the original dataframe.
import numpy as np
df = pd.DataFrame([[1, 2, 3, 4], [1, 3, 5, 6], [4, 6, 7, 8], [5, 4, 3, 6], columns=['A', 'B', 'C', 'D'])
def pre_process(dataframe, y_col_name, x_col_names):
return = new_dataframe
The calculation to be applied to y_col_name's rows is each value of y_col_name divided by the mean of y_col_name.
The calculation to be applied to each of the list of columns in x_col_name is each value of each column, divided by the column's standard deviation.
I would like some help to write the function. I think I need to use an "apply" or a "lambda" function but I'm unsure.
This is what calling the command would look like:
pre_process_data = preprocess(df,'A', ['B','D'])
Thanks
def pre_process(dataframe, y_col_name, x_col_names):
new_dataframe = dataframe.copy()
new_dataframe[y_col_name] = new_dataframe[y_col_name]/new_dataframe[y_col_name].mean()
new_dataframe[x_col_names] = new_dataframe[x_col_names]/new_dataframe[x_col_names].std()
return new_dataframe
Is this what you mean?