Pandas Function to Add Underscore to all Column Headers in a DataFrame - pandas

I am looking to write a pandas function that adds underscores to the beginning of all column headers of a given data frame.

DataFrame.add_prefix
Works even if the original column labels aren't strings.
import pandas as pd
df = pd.DataFrame([[1,1,1]], columns=['a', 0, 'foo'])
# a 0 foo
# 1 1 1
df.add_prefix('_')
# _a _0 _foo
#0 1 1 1

Related

Create new column based of two columns

I have two columns in a dataframe. I want to create third column such that if first column > second column than 1 ow 0. As below
Df
Value1 value 2. Newcolumn
101. 0
97. 1
Comparing two columns in a Pandas DataFrame and write the results of the comparison to a third column. It can do easily by these syntaxes
conditions=[(condition1),(condition2)]
choices=["choice1","choice2"]
df["new_column_name"]=np.select(conditions, choices, default)
conditions are the conditions to check for between the two columns
choices are the results to return based on the conditions
np.select is used to return the results to the new column
The dataframe is:
import numpy as np
import pandas as pd
#create DataFrame
df = pd.DataFrame({'Value1': [100,101],
'value 2': [101,97]})
#define conditions
conditions = [df['Value1'] < df['value 2'],
df['Value1'] > df['value 2']]
#define choices
choices = ['0', '1']
#create new column in DataFrame that displays results of comparisons
df['Newcolumn'] = np.select(conditions, choices, default='Tie')
Final dataframe
print(df)
Output:
Value1 value 2 Newcolumn
0 100 101 0
1 101 97 1

panda assign column names in method chain

I can assign a list as column names in pandas easily in one line, but (how) can I do the same thing in a method chain?
import pandas as pd
df = pd.DataFrame(data={'a':[1,2], 'b':[2,4]})
new_column_names =['aa', 'bb']
# classical way:
df.columns= new_column_names
What I want is to have this a longer method chain:
# method chain
(df.some_chain_method(...)
.another_chain_method(...)
.assign_columnnames(new_columns_names))
You can assume you know the number of columns and it matches new_column_names
Yes, use set_axis:
df.set_axis(new_column_names, axis=1)
Output:
aa bb
0 1 2
1 2 4
Note, in older version of pandas set_axis defaulted with inplace=True, so you'll need to add inplace=False to chain other methods. Recently, it was changed to inplace=False default.
Example with chaining:
df.set_axis(new_column_names, axis=1).eval('cc = aa + bb')
Output:
aa bb cc
0 1 2 3
1 2 4 6
I think you can also use the .rename() method, with the inplace option enabled, for example:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2], 'b':[2,4]})
df.rename(columns={"a": "aa", "b":"bb"}, inplace=True)
df
which results in
aa bb
0 1 2
1 2 4
Using rename means you can change just a subset of the column names.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

How to rename pandas dataframe column with another dataframe?

I really don't understand what I'm doing. I have two data frames. One has a list of column labels and another has a bunch of data. I want to just label the columns in my data with my column labels.
My Code:
airportLabels = pd.read_csv('airportsLabels.csv', header= None)
airportData = pd.read_table('airports.dat', sep=",", header = None)
df = DataFrame(airportData, columns = airportLabels)
When I do this, all the data turns into "NaN" and there is only one column anymore. I am really confused.
I think you need add parameter nrows to read_csv, if you need read only columns, remove header= None, because first row of csv is column names and then use parameter names in read_table with columns from DataFrame airportLabels :
import pandas as pd
import io
temp=u"""col1,col2,col3
1,5,4
7,8,5"""
#after testing replace io.StringIO(temp) to filename
airportLabels = pd.read_csv(io.StringIO(temp), nrows=0)
print airportLabels
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
temp=u"""
a,d,f
e,r,t"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_table(io.StringIO(temp), sep=",", header = None, names=airportLabels.columns)
print df
col1 col2 col3
0 a d f
1 e r t

How to shift the column headers in pandas

I have .txt files I'm reading in with pandas and the header line starts with '~A'. I need to ignore the '~A' and have the next header correspond to the data in the first column. Thanks!
You can do this:
import pandas as pd
data = pd.read_csv("./test.txt", names=[ 'A', 'B' ], skiprows=1)
print(data)
and the output for input:
~A, A, B
1, 2
3, 4
is:
c:\Temp\python>python test.py
A B
0 1 2
1 3 4
You have to name the columns yourself but given that your file seems to be malformed I guess it is not that bad.
If your header lines are not the same in all files, then you can just read them in Python:
import pandas as pd;
# read first line
with open("./test.txt") as myfile:
headRow = next(myfile)
# read column names
columns = [x.strip() for x in headRow.split(',')]
# process by pandas
data = pd.read_csv("./test.txt", names=columns[1:], skiprows=1)
print(data);

Equivalent of Rs which in pandas

How do I get the column of the min in the example below, not the actual number?
In R I would do:
which(min(abs(_quantiles - mean(_quantiles))))
In pandas I tried (did not work):
_quantiles.which(min(abs(_quantiles - mean(_quantiles))))
You could do it this way, call np.min on the df as a np array, use this to create a boolean mask and drop the columns that don't have at least a single non NaN value:
In [2]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df
Out[2]:
a b
0 -0.860548 -2.427571
1 0.136942 1.020901
2 -1.262078 -1.122940
3 -1.290127 -1.031050
4 1.227465 1.027870
In [15]:
df[df==np.min(df.values)].dropna(axis=1, thresh=1).columns
Out[15]:
Index(['b'], dtype='object')
idxmin and idxmax exist, but no general which as far as I can see.
_quantiles.idxmin(abs(_quantiles - mean(_quantiles)))