How to check dependency of one column to another in a pandas dataframe - pandas

I have the following dataframe:
import pandas as pd
df=pd.DataFrame([[1,11,'a'],[1,12,'a'],[1,11,'a'],[1,12,'a'],[1,7,'a'],
[1,12,'a']])
df.columns=['id','code','name']
df
id code name
0 1 11 a
1 1 12 a
2 1 11 a
3 1 12 a
4 1 7 a
5 1 12 a
As shown in the above dataframe, the value of column "id" is directly related to the value of column "name". If I have say, a million records, how can I know that a column is totally dependent on other column in a dataframe?

If they are totally dependent, then their factorizations will be the same
(df.id.factorize()[0] == df.name.factorize()[0]).all()
True

Related

Subtract a specific row from a csv using phyton

I have two csv files: one containing data, the other one containing a single row with the same columns as the first file. I am trying to subtract the one row from the second file from all the rows from the first file using pandas.
I have tried the following, but to no avail.
df = df.subtract(row, axis=1)
You're looking for the "drop" method. From pandas docs:
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
drop by index:
df.drop([0, 1])
A B C D
2 8 9 10 11
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Trying to convert column to be row indexes, set_index error

data_new. set_index('Usual Mode of Transport to Work')
jupyter notebook
Trying to convert column to be row indexes, however, it shows up as NaN? How do i resolve it? Thanks. Im a beginner in python.
Lets start with a toy dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 3 1 2 1
1 2 2 3 4
2 2 4 4 1
3 1 0 3 2
4 1 2 4 0
Now, let's set column A as the index
df.set_index('A')
B C D
A
3 1 2 1
2 2 3 4
2 4 4 1
1 0 3 2
1 2 4 0
This sets the index correctly but doesn't save this newly indexed dataframe in the original data frame variable, i.e., df. So when you check the value of df you see find the original dataframe.
To save the new indexing, you can do one of the following
df = df.set_index('A)
or
df.set_index('A', inplace=True)
Coming to the NaN values, I believe it has got something to do with using Jupyter notebook. Since Jupyter allows jumping between cells, it does not necessarily follow the linear execution order like traditional coding. This can get confusing. You can use the "Variable View" in Jupyter to cross-check if you are passing the value you intend to. I hope this can help you figure out the NaN issue.

Replace value in pandas dataframe based on where condition [duplicate]

This question already has answers here:
Efficiently replace values from a column to another column Pandas DataFrame
(5 answers)
Closed 10 months ago.
I have created a dataframe called df with this code:
import numpy as np
import pandas as pd
# initialize data of lists.
data = {'Feature1':[1,2,-9999999,4,5],
'Age':[20, 21, 19, 18,34,]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
The dataframe looks like this:
Feature1 Age
0 1 20
1 2 21
2 -9999999 19
3 4 18
4 5 34
Every time there is a value of -9999999 in column Feature1 I need to replace it with the correspondent value from column Age. so, the output dataframe would look this this:
Feature1 Age
0 1 20
1 2 21
2 19 19
3 4 18
4 5 34
Bear in mind that the actual dataframe that I am using has 200K records (the one I have shown above is just an example).
How do I do that in pandas?
You can use np.where or Series.mask
df['Feature1'] = df['Feature1'].mask(df['Feature1'].eq(-9999999), df['Age'])
# or
df['Feature1'] = np.where(df['Feature1'].eq(-9999999), df['Age'], df['Feature1'])

the column in csv that comes from the index of DataFrame does not have a header name

here is a pandas DataFrame
>>> print(df)
A B C
0 0 1 2
1 3 4 5
2 6 7 8
with df.to_csv('df.csv') I got this file
the column in csv that comes from the index of DataFrame does not have a header name. Is it possible to specify a column name with pandas?
Try with rename_axis
df.rename_axis('index').to_csv('df.csv')

pandas read csv is returning extra unknown column

I am creating a csv file from pandas dataframe by combining two lists using:
df= pd.DataFrame(list(zip(patients_full, labels)),
columns=['id','cancer'])
df.to_csv("labels.csv")
but when I read the csv back there is an unknown column unnamed that shows up ? how do I remove that ?
Unnamed: 0 id cancer
0 0 HF0953.npy 1
1 1 HF1058.npy 3
2 2 HF1071.npy 3
3 3 HF1122.npy 3
4 4 HF1235.npy 1
5 5 HF1280.npy 2
6 6 HF1344.npy 1
7 7 HF1463.npy 1
8 8 HF1489.npy 1
9 9 HF1490.npy 2
10 10 HF1587.npy 2
11 11 HF1613.npy 2
This is happening because of the index column that is saved by default when you do to_csv("labels.csv"). As the index column in the data frame that you were saving didn't have a name, when you read your read_csv("labels.csv") it is treated as all other columns but with 'Blank' column named that is becoming Unnamed: 0. To avoid this you have 2 options:
Option 1 - not read the index:
read_csv("labels.csv", index_col=False)
Option 2 - not save the index:
to_csv("labels.csv", index=False)
What that column is in your output is the index of the dataframe. To not include it in the output: df.to_csv('labels.csv', index=False). More information is available on that method here in the pandas docs