Why I can't merge the columns together - pandas

My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.

You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Related

Subset multiindex dataframe keeps original index value

I found subsetting multi-index dataframe will keep original index values behind.
Here is the sample code for test.
level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df
Above code will show dataframe like this.
number
foo a 0
b 1
c 2
bar a 3
b 4
c 5
baz a 6
b 7
c 8
Code below subset the dataframe to contain only 'a' and 'b' for index level 1.
df_subset = df.query("(number%3) <=1")
df_subset
number
foo a 0
b 1
bar a 3
b 4
baz a 6
b 7
The dataframe itself is expected result.
BUT index level of it is still containing the original index level, which is NOT expected.
#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')
My first question is how can I remove the 'original' index after subsetting?
The Second question is this is expected behavior for pandas?
Thanks
Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:
df_subset.index = df_subset.index.remove_unused_levels()
print(df_subset.index.levels[1])
Output:
Index(['a', 'b'], dtype='object')
It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."
You can see that the index levels is a FrozenList using:
[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])
If you want to see only the used levels, you can use the get_level_values() or the unique() methods.
Here some example:
[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')
[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')
Hope it can help you!

How to convert pandas multiindex into table columns and index?

I have a DataFrame
It has a multi-index and each row has a value
I want to unstack it but it has duplicated keys. Now I want to convet it into this format:
where index1 and index2 is one of the multi-index, and index3 and index4 is another multi-index, value1~3 are values in the original df, if the value doesn't exist, the value is nan.
How should I do this?
See if this example helps you:
import pandas as pd
df = pd.DataFrame(data={'value': [1, 3, 5, 8, 5, 7]}, index=[[1, 2, 3, 4, 5, 6], ['A', 'B', 'A', 'C', 'C', 'A']])
df = df.reset_index(level=1)
df = df.pivot_table(values='value', index=df.index, columns='level_1')
Original data:
Result:

Pandas: Issue with min() on Categorical columns

I have the following df where columns A,B,C are categorical variables with strict ordering:
df = DataFrame([[0, 1, 'PASS', 'PASS', 'PASS'],
[0, 2, 'CHAIN', 'FAIL', 'PASS'],
[0, 3, 'PASS', 'PASS', 'TATPG'],
[0, 4, 'FAIL', 'PASS', 'FAIL'],
[0, 5, 'FAIL', 'ATPG', 'FAIL']],
columns = ['X', 'Y', 'A', 'B', 'C'])
for c in ['A','B','C']:
df[c] = df[c].astype('category', categories=['CHAIN', 'ATPG', 'TATPG', 'PASS', 'FAIL'], ordered=True)`
I want to create a new column D which is defined by the min('A', 'B', 'C'). For example, row 1 says 'CHAIN'. That is the smallest value. Hence, D[1] = CHAIN and so on. The D column should result as follows:
D[0] = PASS, D[1] = CHAIN, D[2] = TPATG, D[3] = PASS, D[4] = ATPG
I tried:
df['D'] = df[['A','B','C']].apply(min, axis=1)
However, this does not work as apply() makes the A/B/C column become of type object and hence min() sorts the values lexicographically instead of the ordering that I provided.
I also tried:
df['D'] = df[['A', 'B', 'C']].transpose().min(axis=0)
tranpose() too results in the columns A/B/C getting changed to type object instead of category.
Any ideas on how to do this correctly? I'd rather not recast the columns as categorical a 2nd time if using apply(). In general, I'll be creating a bunch of indicator columns using this formula:
df[indicator] = df[[any subset of (A,B,C)]].min()
I have found a solution that applies sorted with keys:
d = {'CHAIN': 0,
'ATPG': 1,
'TATPG': 2,
'PASS': 3,
'FAIL':4}
def func(row):
return sorted(row, key=lambda x:d[x])[0]
df['D'] = df[['A','B','C']].apply(func, axis=1)
It gives you the result you're looking for:
0 PASS
1 CHAIN
2 TATPG
3 PASS
4 ATPG
However it does not make use of panda's native sorting of categorical variables.

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.
pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.

How do I use pandas to add a calculated column in a pivot table?

I'm using pandas 0.16.0 & numpy 1.9.2
I did the following to add a calculated field (column) in the pivot table
Set up dataframe as follows,
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, 'B' : ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D' : np.random.randn(24), 'E' : np.random.randn(24), 'F' : [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
Pivoted the data frame as follows,
df1 = df.pivot_table(values=['D'],index=['A'],columns=['C'],aggfunc=np.sum,margins=False)
Tried adding a calculated field as follows, but I get an error (see below),
df1['D2'] = df1['D'] * 2
Error,
ValueError: Wrong number of items passed 2, placement implies 1
This is because you have a Hierarchical Index (i.e. MultiIndex) as columns in your 'pivot table' dataframe.
If you print out reslults of df1['D'] * 2 you will notice that you get two columns:
C bar foo
A
one -3.163 -10.478
three -2.988 1.418
two -2.218 3.405
So to put it back to df1 you need to provide two columns to assign it to:
df1[[('D2','bar'), ('D2','foo')]] = df1['D'] * 2
Which yields:
D D2
C bar foo bar foo
A
one -1.581 -5.239 -3.163 -10.478
three -1.494 0.709 -2.988 1.418
two -1.109 1.703 -2.218 3.405
A more generalized approach:
new_cols = pd.MultiIndex.from_product(('D2', df1.D.columns))
df1[new_cols] = df1.D * 2
You can find more info on how to deal with MultiIndex in the docs