persistent column label in pandas dataframe - pandas

I have an issue where trying to work with pandas' indexing, this first happened on a larger set and i was able to recreate it in this dummy dataframe. Apologies if my table formatting is terrible, I don't know how to make it better visually.
Unnamed: 0 col1 col2 col3
0 Name Sun Mon Tue
1 one 1 2 1
2 two 4 4 3
3 three 2 1 1
4 four 1 5 5
5 five 1 5 5
6 six 5 1 1
7 seven 5 5 6
8 eight 5 3 4
9 nine 5 3 3
So what i am trying to do is to rename the 1st column label ('Unnamed: 0') to something meaningful, but then when i finally try to reset_index, the index "column" has the name "test" for some reason, while the first actual column gets the label "index".
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
df.set_index('test', inplace=True)
dft = df.transpose()
dft
test Name one two three four five six seven eight nine
col1 Sun 1 4 2 1 1 5 5 5 5
col2 Mon 2 4 1 5 5 1 5 3 3
col3 Tue 1 3 1 5 5 1 6 4 3
First of all, if my understanding is correct, index is not even an actual column in the dataframe, why does it get to have a label when resetting index?
And more importantly, why are the labels "test" and "index" reversed?
dft.reset_index(inplace=True)
dft
test index Name one two three four five six seven eight nine
0 col1 Sun 1 4 2 1 1 5 5 5 5
1 col2 Mon 2 4 1 5 5 1 5 3 3
2 col3 Tue 1 3 1 5 5 1 6 4 3
I have tried every possible combination of set_index / reset_index i can think of, trying drop=True & inplace=True but i cannot find a way to create a proper index, like the one i started with.

Yes, the axis (index and column axis) can have names.
This is useful for multi-indexing.
When you call .reset_index, the index is extracted into a new column, which is named how your index was named (by default, 'index').
If you want, you can reset and rename index in one line:
df.rename_axis('Name').reset_index()
Why is 'test' printed not where I expect?
After your code, if you print(dft.columns), you will see:
Index(['index', 'Name', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'],
dtype='object',
name='test')
There are 11 columns. The column axis is called 'test' (see name='test' in the output above).
Also: print(dft.columns.name) prints test.
So what you actually see when you print your dataframe are the column names, to the left of which is the name of the column axis: 'test'.
It is NOT how the index axis is named. You can check: print(type(dft.index.name)) prints <class 'NoneType'>.
Now, why is column axis named 'test'?
Let's see how it got there step by step.
df.rename({df.columns[0]: 'test'}, axis=1, inplace=True)
First column is now named 'test'.
df.set_index('test', inplace=True)
First column has moved from being a column to being an index. The index name is 'test'. The old index disappeared.
dft = df.transpose()
The column axis is now named 'test'. The index is now named however the column axis was named before transposing.

Related

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

pandas read csv is returning extra unknown column

I am creating a csv file from pandas dataframe by combining two lists using:
df= pd.DataFrame(list(zip(patients_full, labels)),
columns=['id','cancer'])
df.to_csv("labels.csv")
but when I read the csv back there is an unknown column unnamed that shows up ? how do I remove that ?
Unnamed: 0 id cancer
0 0 HF0953.npy 1
1 1 HF1058.npy 3
2 2 HF1071.npy 3
3 3 HF1122.npy 3
4 4 HF1235.npy 1
5 5 HF1280.npy 2
6 6 HF1344.npy 1
7 7 HF1463.npy 1
8 8 HF1489.npy 1
9 9 HF1490.npy 2
10 10 HF1587.npy 2
11 11 HF1613.npy 2
This is happening because of the index column that is saved by default when you do to_csv("labels.csv"). As the index column in the data frame that you were saving didn't have a name, when you read your read_csv("labels.csv") it is treated as all other columns but with 'Blank' column named that is becoming Unnamed: 0. To avoid this you have 2 options:
Option 1 - not read the index:
read_csv("labels.csv", index_col=False)
Option 2 - not save the index:
to_csv("labels.csv", index=False)
What that column is in your output is the index of the dataframe. To not include it in the output: df.to_csv('labels.csv', index=False). More information is available on that method here in the pandas docs

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

Group one column by another column in pandas?

I would like to get the median value of one column and use the associated value of another column. For example,
col1 col2 index
0 1 3 A
1 2 4 A
2 3 5 A
3 4 6 B
4 5 7 B
5 6 8 B
6 7 9 B
I group by the index to get the median value of col 1, and use the associated value of col 2 to get
col1 col2 index
2 4 A
5 7 B
I can't use the actual median value for index B because it will average the two middle values and that value won't have a corresponding value in col 2.
What's the best way to do this? Will a groupby method work? Or somehow use sort? Do I need to define my own function?
Seems you need take middle position not median from origial df
df.groupby('index')[['col1','col2']].apply(lambda x : pd.Series(sorted(x.values.tolist())[len(x)//2]))
Out[297]:
0 1
index
A 2 4
B 6 8

Pandas dropping columns by index drops all columns with same name

Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me