Pandas dataframe first value shows up as column name - pandas

I am new to pandas, I have a pandas data frame, but the first value (0,0), is being used as an index/name? I want 0.9121 to be the first value, how do I do that?
0 0.2171
1 0.21163
2 0.87221
3 0.432735
4 0.3231
Name: 0.9121, dtype: float64
I would like to have:
0 0.9121
1 0.2171
2 0.21163
3 0.87221
4 0.432735
5 0.3231

Related

Trying to convert column to be row indexes, set_index error

data_new. set_index('Usual Mode of Transport to Work')
jupyter notebook
Trying to convert column to be row indexes, however, it shows up as NaN? How do i resolve it? Thanks. Im a beginner in python.
Lets start with a toy dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 3 1 2 1
1 2 2 3 4
2 2 4 4 1
3 1 0 3 2
4 1 2 4 0
Now, let's set column A as the index
df.set_index('A')
B C D
A
3 1 2 1
2 2 3 4
2 4 4 1
1 0 3 2
1 2 4 0
This sets the index correctly but doesn't save this newly indexed dataframe in the original data frame variable, i.e., df. So when you check the value of df you see find the original dataframe.
To save the new indexing, you can do one of the following
df = df.set_index('A)
or
df.set_index('A', inplace=True)
Coming to the NaN values, I believe it has got something to do with using Jupyter notebook. Since Jupyter allows jumping between cells, it does not necessarily follow the linear execution order like traditional coding. This can get confusing. You can use the "Variable View" in Jupyter to cross-check if you are passing the value you intend to. I hope this can help you figure out the NaN issue.

Add/subtract value of a column to the entire column of the dataframe pandas

I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.

sort and count values in a column DataFrame (Python Pandas)

I have the next DataFrame
df
I count the values this way
I want to have the category values in the next order:
1.0 1
4.0 1
7.0 2
10.0 1
and so on ...
In the ascending way with their respect amount of values
You can sort by index using sort_index()
df['col_1'].value_counts().sort_index()
You can sort on the index after calling value_counts. here's an example
df = pd.DataFrame({'x':[1,2,2,2,1,4,5,5,4,3,5,6,3]})
df['x'].value_counts().sort_index()
Output:
1 2
2 3
3 2
4 2
5 3
6 1
Name: x, dtype: int64
Created Data frame having values as below :
and then sort it via below code ,below is output shown
df1.groupby(by=['Cat']).count().sort_values(by='col1')

get second largest value in row in selected columns in dataframe in pandas

I have a dataframe with subset of it shown below. There are more columns to the right and left of the ones I am showing you
M_cols 10D_MA 30D_MA 50D_MA 100D_MA 200D_MA Max Min 2nd smallest
68.58 70.89 69.37 **68.24** 64.41 70.89 64.41 68.24
**68.32**71.00 69.47 68.50 64.49 71.00 64.49 68.32
68.57 **68.40** 69.57 71.07 64.57 71.07 64.57 68.40
I can get the min (and max is easy as well) with the following code
df2['MIN'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].max(axis=1)
But how do I get the 2nd smallest. I tried this and got the following error
df2['2nd SMALLEST'] = df2[['10D_MA','30D_MA','50D_MA','100D_MA','200D_MA']].nsmallest(2)
TypeError: nsmallest() missing 1 required positional argument: 'columns'
Seems like this should be a simple answer but I am stuck
For example you have following df
df=pd.DataFrame({'V1':[1,2,3],'V2':[3,2,1],'V3':[3,4,9]})
After pick up the value need to compare , we just need to sort value by axis=0(default)
sortdf=pd.DataFrame(np.sort(df[['V1','V2','V3']].values))
sortdf
Out[419]:
0 1 2
0 1 3 3
1 2 2 4
2 1 3 9
1st max:
sortdf.iloc[:,-1]
Out[421]:
0 3
1 4
2 9
Name: 2, dtype: int64
2nd max
sortdf.iloc[:,-2]
Out[422]:
0 3
1 2
2 3
Name: 1, dtype: int64

How to check dependency of one column to another in a pandas dataframe

I have the following dataframe:
import pandas as pd
df=pd.DataFrame([[1,11,'a'],[1,12,'a'],[1,11,'a'],[1,12,'a'],[1,7,'a'],
[1,12,'a']])
df.columns=['id','code','name']
df
id code name
0 1 11 a
1 1 12 a
2 1 11 a
3 1 12 a
4 1 7 a
5 1 12 a
As shown in the above dataframe, the value of column "id" is directly related to the value of column "name". If I have say, a million records, how can I know that a column is totally dependent on other column in a dataframe?
If they are totally dependent, then their factorizations will be the same
(df.id.factorize()[0] == df.name.factorize()[0]).all()
True