Altering rownames in tibble - data-manipulation

I have 2 different tibbles, and have to find out how many of the rows from the first tibble is also present in the second tibble. Both tibbles have a first column named GeneID, but the problem is, that in one tibble the genes are names as 1, 2, 3, 4 ect, and in the sencond tibble they are named Gene1, Gene2, Gene3, Gene4... Are there anyway to either add 'Gene' before the number in the first tibble or remove 'Gene' in the second?

It is always good to include a sample of your data so that the responders can answer correctly. For example, if the field ordering are identical between the 2 datasets, e.g. df1 and df2, you can make the names the same by a simple:
names(df1) <- names(df2)

Is this what you'd like to do?
library(tidyverse)
df1 <- tribble(
~gene,
1,
2,
5,
6
)
df2 <- tribble(
~gene,
"Gene1",
"Gene2",
"Gene3",
"Gene4",
"Gene5"
)
# df1 rows also in df2
df1 |>
mutate(gene = str_c("Gene", gene)) |>
inner_join(df2, by = "gene")
#> # A tibble: 3 × 1
#> gene
#> <chr>
#> 1 Gene1
#> 2 Gene2
#> 3 Gene5
Created on 2022-06-16 by the reprex package (v2.0.1)

Related

How to concatenate values of a column B grouped by the other column A SQL and R

I have this dataframe
df <- data.frame(scientific_name = c("Mandevilla grazielae",
"Aosa parvifolia",
"Mandevilla grazielae",
"Dyckia ebracteata"),
collection_number = c("254", "658","445", "568"))
And I'd like to combine the collection number of the plants in a column named "vouchers"
scientific_name vouchers
Mandevilla grazielae 254, 445
Aosa parvifolia 658
Dyckia ebracteata 568
So far this is what I was able to do:
x <- sqldf('SELECT * FROM df GROUP BY scientific_name ORDER BY scientific_name ASC')
And then I started to learn more...
x <- sqldf('SELECT scientific_name, COUNT(collection_number)
FROM df GROUP BY scientific_name ORDER BY scientific_name ASC')
I'd like to know if there's something like the COUNT function above to concatenate the column "collection_number" in a "vouchers" column. I also tried CONCAT and it did't work.
I'd be very grateful!
In base R, use aggregate with paste
aggregate(cbind(vouchers = collection_number) ~ scientific_name, df, toString)
-output
scientific_name vouchers
1 Aosa parvifolia 658
2 Dyckia ebracteata 568
3 Mandevilla grazielae 254, 445
1) Using the same setup as in the question use group_concat to concatenate all collection numbers associated with a scientific name. 1 means the first column in the output or write out the column name in its place, i.e. scientific_name.
library(sqldf)
sqldf("select scientific_name, group_concat(collection_number) as vouchers
from df
group by 1
order by 1")
giving:
scientific_name vouchers
1 Aosa parvifolia 658
2 Dyckia ebracteata 568
3 Mandevilla grazielae 254,445
2) Using tidyverse it would be done like this:
library(dplyr)
df %>%
group_by(scientific_name) %>%
summarize(vendors = toString(collection_number)) %>%
ungroup %>%
arrange(scientific_name)
giving:
# A tibble: 3 × 2
scientific_name vendors
<chr> <chr>
1 Aosa parvifolia 658
2 Dyckia ebracteata 568
3 Mandevilla grazielae 254, 445

Pandas - Is it possible to permanently change dataframe column labels to the default column numbers. The dataframe has at least 40 columns [duplicate]

I want to change the column labels of a Pandas DataFrame from
['$a', '$b', '$c', '$d', '$e']
to
['a', 'b', 'c', 'd', 'e']
Rename Specific Columns
Use the df.rename() function and refer the columns to be renamed. Not all the columns have to be renamed:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy)
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Minimal Code Example
df = pd.DataFrame('x', index=range(3), columns=list('abcde'))
df
a b c d e
0 x x x x x
1 x x x x x
2 x x x x x
The following methods all work and produce the same output:
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1) # new method
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')
df2 = df.rename(columns={'a': 'X', 'b': 'Y'}) # old method
df2
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
Remember to assign the result back, as the modification is not-inplace. Alternatively, specify inplace=True:
df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)
df
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
From v0.25, you can also specify errors='raise' to raise errors if an invalid column-to-rename is specified. See v0.25 rename() docs.
Reassign Column Headers
Use df.set_axis() with axis=1 and inplace=False (to return a copy).
df2 = df.set_axis(['V', 'W', 'X', 'Y', 'Z'], axis=1, inplace=False)
df2
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
This returns a copy, but you can modify the DataFrame in-place by setting inplace=True (this is the default behaviour for versions <=0.24 but is likely to change in the future).
You can also assign headers directly:
df.columns = ['V', 'W', 'X', 'Y', 'Z']
df
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
Just assign it to the .columns attribute:
>>> df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
>>> df
$a $b
0 1 10
1 2 20
>>> df.columns = ['a', 'b']
>>> df
a b
0 1 10
1 2 20
The rename method can take a function, for example:
In [11]: df.columns
Out[11]: Index([u'$a', u'$b', u'$c', u'$d', u'$e'], dtype=object)
In [12]: df.rename(columns=lambda x: x[1:], inplace=True)
In [13]: df.columns
Out[13]: Index([u'a', u'b', u'c', u'd', u'e'], dtype=object)
As documented in Working with text data:
df.columns = df.columns.str.replace('$', '')
Pandas 0.21+ Answer
There have been some significant updates to column renaming in version 0.21.
The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.
The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.
Examples for Pandas 0.21+
Construct sample DataFrame:
df = pd.DataFrame({'$a':[1,2], '$b': [3,4],
'$c':[5,6], '$d':[7,8],
'$e':[9,10]})
$a $b $c $d $e
0 1 3 5 7 9
1 2 4 6 8 10
Using rename with axis='columns' or axis=1
df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis='columns')
or
df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis=1)
Both result in the following:
a b c d e
0 1 3 5 7 9
1 2 4 6 8 10
It is still possible to use the old method signature:
df.rename(columns={'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'})
The rename function also accepts functions that will be applied to each column name.
df.rename(lambda x: x[1:], axis='columns')
or
df.rename(lambda x: x[1:], axis=1)
Using set_axis with a list and inplace=False
You can supply a list to the set_axis method that is equal in length to the number of columns (or index). Currently, inplace defaults to True, but inplace will be defaulted to False in future releases.
df.set_axis(['a', 'b', 'c', 'd', 'e'], axis='columns', inplace=False)
or
df.set_axis(['a', 'b', 'c', 'd', 'e'], axis=1, inplace=False)
Why not use df.columns = ['a', 'b', 'c', 'd', 'e']?
There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.
The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame. Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()
Since you only want to remove the $ sign in all column names, you could just do:
df = df.rename(columns=lambda x: x.replace('$', ''))
OR
df.rename(columns=lambda x: x.replace('$', ''), inplace=True)
Renaming columns in Pandas is an easy task.
df.rename(columns={'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}, inplace=True)
df.columns = ['a', 'b', 'c', 'd', 'e']
It will replace the existing names with the names you provide, in the order you provide.
Use:
old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)
This way you can manually edit the new_names as you wish. It works great when you need to rename only a few columns to correct misspellings, accents, remove special characters, etc.
One line or Pipeline solutions
I'll focus on two things:
OP clearly states
I have the edited column names stored it in a list, but I don't know how to replace the column names.
I do not want to solve the problem of how to replace '$' or strip the first character off of each column header. OP has already done this step. Instead I want to focus on replacing the existing columns object with a new one given a list of replacement column names.
df.columns = new where new is the list of new columns names is as simple as it gets. The drawback of this approach is that it requires editing the existing dataframe's columns attribute and it isn't done inline. I'll show a few ways to perform this via pipelining without editing the existing dataframe.
Setup 1
To focus on the need to rename of replace column names with a pre-existing list, I'll create a new sample dataframe df with initial column names and unrelated new column names.
df = pd.DataFrame({'Jack': [1, 2], 'Mahesh': [3, 4], 'Xin': [5, 6]})
new = ['x098', 'y765', 'z432']
df
Jack Mahesh Xin
0 1 3 5
1 2 4 6
Solution 1
pd.DataFrame.rename
It has been said already that if you had a dictionary mapping the old column names to new column names, you could use pd.DataFrame.rename.
d = {'Jack': 'x098', 'Mahesh': 'y765', 'Xin': 'z432'}
df.rename(columns=d)
x098 y765 z432
0 1 3 5
1 2 4 6
However, you can easily create that dictionary and include it in the call to rename. The following takes advantage of the fact that when iterating over df, we iterate over each column name.
# Given just a list of new column names
df.rename(columns=dict(zip(df, new)))
x098 y765 z432
0 1 3 5
1 2 4 6
This works great if your original column names are unique. But if they are not, then this breaks down.
Setup 2
Non-unique columns
df = pd.DataFrame(
[[1, 3, 5], [2, 4, 6]],
columns=['Mahesh', 'Mahesh', 'Xin']
)
new = ['x098', 'y765', 'z432']
df
Mahesh Mahesh Xin
0 1 3 5
1 2 4 6
Solution 2
pd.concat using the keys argument
First, notice what happens when we attempt to use solution 1:
df.rename(columns=dict(zip(df, new)))
y765 y765 z432
0 1 3 5
1 2 4 6
We didn't map the new list as the column names. We ended up repeating y765. Instead, we can use the keys argument of the pd.concat function while iterating through the columns of df.
pd.concat([c for _, c in df.items()], axis=1, keys=new)
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 3
Reconstruct. This should only be used if you have a single dtype for all columns. Otherwise, you'll end up with dtype object for all columns and converting them back requires more dictionary work.
Single dtype
pd.DataFrame(df.values, df.index, new)
x098 y765 z432
0 1 3 5
1 2 4 6
Mixed dtype
pd.DataFrame(df.values, df.index, new).astype(dict(zip(new, df.dtypes)))
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 4
This is a gimmicky trick with transpose and set_index. pd.DataFrame.set_index allows us to set an index inline, but there is no corresponding set_columns. So we can transpose, then set_index, and transpose back. However, the same single dtype versus mixed dtype caveat from solution 3 applies here.
Single dtype
df.T.set_index(np.asarray(new)).T
x098 y765 z432
0 1 3 5
1 2 4 6
Mixed dtype
df.T.set_index(np.asarray(new)).T.astype(dict(zip(new, df.dtypes)))
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 5
Use a lambda in pd.DataFrame.rename that cycles through each element of new.
In this solution, we pass a lambda that takes x but then ignores it. It also takes a y but doesn't expect it. Instead, an iterator is given as a default value and I can then use that to cycle through one at a time without regard to what the value of x is.
df.rename(columns=lambda x, y=iter(new): next(y))
x098 y765 z432
0 1 3 5
1 2 4 6
And as pointed out to me by the folks in sopython chat, if I add a * in between x and y, I can protect my y variable. Though, in this context I don't believe it needs protecting. It is still worth mentioning.
df.rename(columns=lambda x, *, y=iter(new): next(y))
x098 y765 z432
0 1 3 5
1 2 4 6
Column names vs Names of Series
I would like to explain a bit what happens behind the scenes.
Dataframes are a set of Series.
Series in turn are an extension of a numpy.array.
numpy.arrays have a property .name.
This is the name of the series. It is seldom that Pandas respects this attribute, but it lingers in places and can be used to hack some Pandas behaviors.
Naming the list of columns
A lot of answers here talks about the df.columns attribute being a list when in fact it is a Series. This means it has a .name attribute.
This is what happens if you decide to fill in the name of the columns Series:
df.columns = ['column_one', 'column_two']
df.columns.names = ['name of the list of columns']
df.index.names = ['name of the index']
name of the list of columns column_one column_two
name of the index
0 4 1
1 5 2
2 6 3
Note that the name of the index always comes one column lower.
Artefacts that linger
The .name attribute lingers on sometimes. If you set df.columns = ['one', 'two'] then the df.one.name will be 'one'.
If you set df.one.name = 'three' then df.columns will still give you ['one', 'two'], and df.one.name will give you 'three'.
BUT
pd.DataFrame(df.one) will return
three
0 1
1 2
2 3
Because Pandas reuses the .name of the already defined Series.
Multi-level column names
Pandas has ways of doing multi-layered column names. There is not so much magic involved, but I wanted to cover this in my answer too since I don't see anyone picking up on this here.
|one |
|one |two |
0 | 4 | 1 |
1 | 5 | 2 |
2 | 6 | 3 |
This is easily achievable by setting columns to lists, like this:
df.columns = [['one', 'one'], ['one', 'two']]
Many of pandas functions have an inplace parameter. When setting it True, the transformation applies directly to the dataframe that you are calling it on. For example:
df = pd.DataFrame({'$a':[1,2], '$b': [3,4]})
df.rename(columns={'$a': 'a'}, inplace=True)
df.columns
>>> Index(['a', '$b'], dtype='object')
Alternatively, there are cases where you want to preserve the original dataframe. I have often seen people fall into this case if creating the dataframe is an expensive task. For example, if creating the dataframe required querying a snowflake database. In this case, just make sure the the inplace parameter is set to False.
df = pd.DataFrame({'$a':[1,2], '$b': [3,4]})
df2 = df.rename(columns={'$a': 'a'}, inplace=False)
df.columns
>>> Index(['$a', '$b'], dtype='object')
df2.columns
>>> Index(['a', '$b'], dtype='object')
If these types of transformations are something that you do often, you could also look into a number of different pandas GUI tools. I'm the creator of one called Mito. It’s a spreadsheet that automatically converts your edits to python code.
Let's understand renaming by a small example...
Renaming columns using mapping:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # Creating a df with column name A and B
df.rename({"A": "new_a", "B": "new_b"}, axis='columns', inplace =True) # Renaming column A with 'new_a' and B with 'new_b'
Output:
new_a new_b
0 1 4
1 2 5
2 3 6
Renaming index/Row_Name using mapping:
df.rename({0: "x", 1: "y", 2: "z"}, axis='index', inplace =True) # Row name are getting replaced by 'x', 'y', and 'z'.
Output:
new_a new_b
x 1 4
y 2 5
z 3 6
Suppose your dataset name is df, and df has.
df = ['$a', '$b', '$c', '$d', '$e']`
So, to rename these, we would simply do.
df.columns = ['a','b','c','d','e']
Let's say this is your dataframe.
You can rename the columns using two methods.
Using dataframe.columns=[#list]
df.columns=['a','b','c','d','e']
The limitation of this method is that if one column has to be changed, full column list has to be passed. Also, this method is not applicable on index labels.
For example, if you passed this:
df.columns = ['a','b','c','d']
This will throw an error. Length mismatch: Expected axis has 5 elements, new values have 4 elements.
Another method is the Pandas rename() method which is used to rename any index, column or row
df = df.rename(columns={'$a':'a'})
Similarly, you can change any rows or columns.
If you've got the dataframe, df.columns dumps everything into a list you can manipulate and then reassign into your dataframe as the names of columns...
columns = df.columns
columns = [row.replace("$", "") for row in columns]
df.rename(columns=dict(zip(columns, things)), inplace=True)
df.head() # To validate the output
Best way? I don't know. A way - yes.
A better way of evaluating all the main techniques put forward in the answers to the question is below using cProfile to gage memory and execution time. #kadee, #kaitlyn, and #eumiro had the functions with the fastest execution times - though these functions are so fast we're comparing the rounding of 0.000 and 0.001 seconds for all the answers. Moral: my answer above likely isn't the 'best' way.
import pandas as pd
import cProfile, pstats, re
old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
col_dict = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df = pd.DataFrame({'$a':[1, 2], '$b': [10, 20], '$c': ['bleep', 'blorp'], '$d': [1, 2], '$e': ['texa$', '']})
df.head()
def eumiro(df, nn):
df.columns = nn
# This direct renaming approach is duplicated in methodology in several other answers:
return df
def lexual1(df):
return df.rename(columns=col_dict)
def lexual2(df, col_dict):
return df.rename(columns=col_dict, inplace=True)
def Panda_Master_Hayden(df):
return df.rename(columns=lambda x: x[1:], inplace=True)
def paulo1(df):
return df.rename(columns=lambda x: x.replace('$', ''))
def paulo2(df):
return df.rename(columns=lambda x: x.replace('$', ''), inplace=True)
def migloo(df, on, nn):
return df.rename(columns=dict(zip(on, nn)), inplace=True)
def kadee(df):
return df.columns.str.replace('$', '')
def awo(df):
columns = df.columns
columns = [row.replace("$", "") for row in columns]
return df.rename(columns=dict(zip(columns, '')), inplace=True)
def kaitlyn(df):
df.columns = [col.strip('$') for col in df.columns]
return df
print 'eumiro'
cProfile.run('eumiro(df, new_names)')
print 'lexual1'
cProfile.run('lexual1(df)')
print 'lexual2'
cProfile.run('lexual2(df, col_dict)')
print 'andy hayden'
cProfile.run('Panda_Master_Hayden(df)')
print 'paulo1'
cProfile.run('paulo1(df)')
print 'paulo2'
cProfile.run('paulo2(df)')
print 'migloo'
cProfile.run('migloo(df, old_names, new_names)')
print 'kadee'
cProfile.run('kadee(df)')
print 'awo'
cProfile.run('awo(df)')
print 'kaitlyn'
cProfile.run('kaitlyn(df)')
df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})
If your new list of columns is in the same order as the existing columns, the assignment is simple:
new_cols = ['a', 'b', 'c', 'd', 'e']
df.columns = new_cols
>>> df
a b c d e
0 1 1 1 1 1
If you had a dictionary keyed on old column names to new column names, you could do the following:
d = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df.columns = df.columns.map(lambda col: d[col]) # Or `.map(d.get)` as pointed out by #PiRSquared.
>>> df
a b c d e
0 1 1 1 1 1
If you don't have a list or dictionary mapping, you could strip the leading $ symbol via a list comprehension:
df.columns = [col[1:] if col[0] == '$' else col for col in df]
df.rename(index=str, columns={'A':'a', 'B':'b'})
pandas.DataFrame.rename
If you already have a list for the new column names, you can try this:
new_cols = ['a', 'b', 'c', 'd', 'e']
new_names_map = {df.columns[i]:new_cols[i] for i in range(len(new_cols))}
df.rename(new_names_map, axis=1, inplace=True)
Another way we could replace the original column labels is by stripping the unwanted characters (here '$') from the original column labels.
This could have been done by running a for loop over df.columns and appending the stripped columns to df.columns.
Instead, we can do this neatly in a single statement by using list comprehension like below:
df.columns = [col.strip('$') for col in df.columns]
(strip method in Python strips the given character from beginning and end of the string.)
It is real simple. Just use:
df.columns = ['Name1', 'Name2', 'Name3'...]
And it will assign the column names by the order you put them in.
# This way it will work
import pandas as pd
# Define a dictionary
rankings = {'test': ['a'],
'odi': ['E'],
't20': ['P']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
print(rankings_pd)
rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)
You could use str.slice for that:
df.columns = df.columns.str.slice(1)
Another option is to rename using a regular expression:
import pandas as pd
import re
df = pd.DataFrame({'$a':[1,2], '$b':[3,4], '$c':[5,6]})
df = df.rename(columns=lambda x: re.sub('\$','',x))
>>> df
a b c
0 1 3 5
1 2 4 6
My method is generic wherein you can add additional delimiters by comma separating delimiters= variable and future-proof it.
Working Code:
import pandas as pd
import re
df = pd.DataFrame({'$a':[1,2], '$b': [3,4],'$c':[5,6], '$d': [7,8], '$e': [9,10]})
delimiters = '$'
matchPattern = '|'.join(map(re.escape, delimiters))
df.columns = [re.split(matchPattern, i)[1] for i in df.columns ]
Output:
>>> df
$a $b $c $d $e
0 1 3 5 7 9
1 2 4 6 8 10
>>> df
a b c d e
0 1 3 5 7 9
1 2 4 6 8 10
Note that the approaches in previous answers do not work for a MultiIndex. For a MultiIndex, you need to do something like the following:
>>> df = pd.DataFrame({('$a','$x'):[1,2], ('$b','$y'): [3,4], ('e','f'):[5,6]})
>>> df
$a $b e
$x $y f
0 1 3 5
1 2 4 6
>>> rename = {('$a','$x'):('a','x'), ('$b','$y'):('b','y')}
>>> df.columns = pandas.MultiIndex.from_tuples([
rename.get(item, item) for item in df.columns.tolist()])
>>> df
a b e
x y f
0 1 3 5
1 2 4 6
If you have to deal with loads of columns named by the providing system out of your control, I came up with the following approach that is a combination of a general approach and specific replacements in one go.
First create a dictionary from the dataframe column names using regular expressions in order to throw away certain appendixes of column names and then add specific replacements to the dictionary to name core columns as expected later in the receiving database.
This is then applied to the dataframe in one go.
dict = dict(zip(df.columns, df.columns.str.replace('(:S$|:C1$|:L$|:D$|\.Serial:L$)', '')))
dict['brand_timeseries:C1'] = 'BTS'
dict['respid:L'] = 'RespID'
dict['country:C1'] = 'CountryID'
dict['pim1:D'] = 'pim_actual'
df.rename(columns=dict, inplace=True)
If you just want to remove the '$' sign then use the below code
df.columns = pd.Series(df.columns.str.replace("$", ""))
In addition to the solution already provided, you can replace all the columns while you are reading the file. We can use names and header=0 to do that.
First, we create a list of the names that we like to use as our column names:
import pandas as pd
ufo_cols = ['city', 'color reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo = pd.read_csv('link to the file you are using', names = ufo_cols, header = 0)
In this case, all the column names will be replaced with the names you have in your list.
Here's a nifty little function I like to use to cut down on typing:
def rename(data, oldnames, newname):
if type(oldnames) == str: # Input can be a string or list of strings
oldnames = [oldnames] # When renaming multiple columns
newname = [newname] # Make sure you pass the corresponding list of new names
i = 0
for name in oldnames:
oldvar = [c for c in data.columns if name in c]
if len(oldvar) == 0:
raise ValueError("Sorry, couldn't find that column in the dataset")
if len(oldvar) > 1: # Doesn't have to be an exact match
print("Found multiple columns that matched " + str(name) + ": ")
for c in oldvar:
print(str(oldvar.index(c)) + ": " + str(c))
ind = input('Please enter the index of the column you would like to rename: ')
oldvar = oldvar[int(ind)]
if len(oldvar) == 1:
oldvar = oldvar[0]
data = data.rename(columns = {oldvar : newname[i]})
i += 1
return data
Here is an example of how it works:
In [2]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns = ['col1', 'col2', 'omg', 'idk'])
# First list = existing variables
# Second list = new names for those variables
In [3]: df = rename(df, ['col', 'omg'],['first', 'ohmy'])
Found multiple columns that matched col:
0: col1
1: col2
Please enter the index of the column you would like to rename: 0
In [4]: df.columns
Out[5]: Index(['first', 'col2', 'ohmy', 'idk'], dtype='object')

Rename default column name in DataFrame [duplicate]

I want to change the column labels of a Pandas DataFrame from
['$a', '$b', '$c', '$d', '$e']
to
['a', 'b', 'c', 'd', 'e']
Rename Specific Columns
Use the df.rename() function and refer the columns to be renamed. Not all the columns have to be renamed:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy)
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Minimal Code Example
df = pd.DataFrame('x', index=range(3), columns=list('abcde'))
df
a b c d e
0 x x x x x
1 x x x x x
2 x x x x x
The following methods all work and produce the same output:
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1) # new method
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')
df2 = df.rename(columns={'a': 'X', 'b': 'Y'}) # old method
df2
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
Remember to assign the result back, as the modification is not-inplace. Alternatively, specify inplace=True:
df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)
df
X Y c d e
0 x x x x x
1 x x x x x
2 x x x x x
From v0.25, you can also specify errors='raise' to raise errors if an invalid column-to-rename is specified. See v0.25 rename() docs.
Reassign Column Headers
Use df.set_axis() with axis=1 and inplace=False (to return a copy).
df2 = df.set_axis(['V', 'W', 'X', 'Y', 'Z'], axis=1, inplace=False)
df2
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
This returns a copy, but you can modify the DataFrame in-place by setting inplace=True (this is the default behaviour for versions <=0.24 but is likely to change in the future).
You can also assign headers directly:
df.columns = ['V', 'W', 'X', 'Y', 'Z']
df
V W X Y Z
0 x x x x x
1 x x x x x
2 x x x x x
Just assign it to the .columns attribute:
>>> df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
>>> df
$a $b
0 1 10
1 2 20
>>> df.columns = ['a', 'b']
>>> df
a b
0 1 10
1 2 20
The rename method can take a function, for example:
In [11]: df.columns
Out[11]: Index([u'$a', u'$b', u'$c', u'$d', u'$e'], dtype=object)
In [12]: df.rename(columns=lambda x: x[1:], inplace=True)
In [13]: df.columns
Out[13]: Index([u'a', u'b', u'c', u'd', u'e'], dtype=object)
As documented in Working with text data:
df.columns = df.columns.str.replace('$', '')
Pandas 0.21+ Answer
There have been some significant updates to column renaming in version 0.21.
The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.
The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.
Examples for Pandas 0.21+
Construct sample DataFrame:
df = pd.DataFrame({'$a':[1,2], '$b': [3,4],
'$c':[5,6], '$d':[7,8],
'$e':[9,10]})
$a $b $c $d $e
0 1 3 5 7 9
1 2 4 6 8 10
Using rename with axis='columns' or axis=1
df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis='columns')
or
df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis=1)
Both result in the following:
a b c d e
0 1 3 5 7 9
1 2 4 6 8 10
It is still possible to use the old method signature:
df.rename(columns={'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'})
The rename function also accepts functions that will be applied to each column name.
df.rename(lambda x: x[1:], axis='columns')
or
df.rename(lambda x: x[1:], axis=1)
Using set_axis with a list and inplace=False
You can supply a list to the set_axis method that is equal in length to the number of columns (or index). Currently, inplace defaults to True, but inplace will be defaulted to False in future releases.
df.set_axis(['a', 'b', 'c', 'd', 'e'], axis='columns', inplace=False)
or
df.set_axis(['a', 'b', 'c', 'd', 'e'], axis=1, inplace=False)
Why not use df.columns = ['a', 'b', 'c', 'd', 'e']?
There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.
The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame. Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()
Since you only want to remove the $ sign in all column names, you could just do:
df = df.rename(columns=lambda x: x.replace('$', ''))
OR
df.rename(columns=lambda x: x.replace('$', ''), inplace=True)
Renaming columns in Pandas is an easy task.
df.rename(columns={'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}, inplace=True)
df.columns = ['a', 'b', 'c', 'd', 'e']
It will replace the existing names with the names you provide, in the order you provide.
Use:
old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)
This way you can manually edit the new_names as you wish. It works great when you need to rename only a few columns to correct misspellings, accents, remove special characters, etc.
One line or Pipeline solutions
I'll focus on two things:
OP clearly states
I have the edited column names stored it in a list, but I don't know how to replace the column names.
I do not want to solve the problem of how to replace '$' or strip the first character off of each column header. OP has already done this step. Instead I want to focus on replacing the existing columns object with a new one given a list of replacement column names.
df.columns = new where new is the list of new columns names is as simple as it gets. The drawback of this approach is that it requires editing the existing dataframe's columns attribute and it isn't done inline. I'll show a few ways to perform this via pipelining without editing the existing dataframe.
Setup 1
To focus on the need to rename of replace column names with a pre-existing list, I'll create a new sample dataframe df with initial column names and unrelated new column names.
df = pd.DataFrame({'Jack': [1, 2], 'Mahesh': [3, 4], 'Xin': [5, 6]})
new = ['x098', 'y765', 'z432']
df
Jack Mahesh Xin
0 1 3 5
1 2 4 6
Solution 1
pd.DataFrame.rename
It has been said already that if you had a dictionary mapping the old column names to new column names, you could use pd.DataFrame.rename.
d = {'Jack': 'x098', 'Mahesh': 'y765', 'Xin': 'z432'}
df.rename(columns=d)
x098 y765 z432
0 1 3 5
1 2 4 6
However, you can easily create that dictionary and include it in the call to rename. The following takes advantage of the fact that when iterating over df, we iterate over each column name.
# Given just a list of new column names
df.rename(columns=dict(zip(df, new)))
x098 y765 z432
0 1 3 5
1 2 4 6
This works great if your original column names are unique. But if they are not, then this breaks down.
Setup 2
Non-unique columns
df = pd.DataFrame(
[[1, 3, 5], [2, 4, 6]],
columns=['Mahesh', 'Mahesh', 'Xin']
)
new = ['x098', 'y765', 'z432']
df
Mahesh Mahesh Xin
0 1 3 5
1 2 4 6
Solution 2
pd.concat using the keys argument
First, notice what happens when we attempt to use solution 1:
df.rename(columns=dict(zip(df, new)))
y765 y765 z432
0 1 3 5
1 2 4 6
We didn't map the new list as the column names. We ended up repeating y765. Instead, we can use the keys argument of the pd.concat function while iterating through the columns of df.
pd.concat([c for _, c in df.items()], axis=1, keys=new)
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 3
Reconstruct. This should only be used if you have a single dtype for all columns. Otherwise, you'll end up with dtype object for all columns and converting them back requires more dictionary work.
Single dtype
pd.DataFrame(df.values, df.index, new)
x098 y765 z432
0 1 3 5
1 2 4 6
Mixed dtype
pd.DataFrame(df.values, df.index, new).astype(dict(zip(new, df.dtypes)))
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 4
This is a gimmicky trick with transpose and set_index. pd.DataFrame.set_index allows us to set an index inline, but there is no corresponding set_columns. So we can transpose, then set_index, and transpose back. However, the same single dtype versus mixed dtype caveat from solution 3 applies here.
Single dtype
df.T.set_index(np.asarray(new)).T
x098 y765 z432
0 1 3 5
1 2 4 6
Mixed dtype
df.T.set_index(np.asarray(new)).T.astype(dict(zip(new, df.dtypes)))
x098 y765 z432
0 1 3 5
1 2 4 6
Solution 5
Use a lambda in pd.DataFrame.rename that cycles through each element of new.
In this solution, we pass a lambda that takes x but then ignores it. It also takes a y but doesn't expect it. Instead, an iterator is given as a default value and I can then use that to cycle through one at a time without regard to what the value of x is.
df.rename(columns=lambda x, y=iter(new): next(y))
x098 y765 z432
0 1 3 5
1 2 4 6
And as pointed out to me by the folks in sopython chat, if I add a * in between x and y, I can protect my y variable. Though, in this context I don't believe it needs protecting. It is still worth mentioning.
df.rename(columns=lambda x, *, y=iter(new): next(y))
x098 y765 z432
0 1 3 5
1 2 4 6
Column names vs Names of Series
I would like to explain a bit what happens behind the scenes.
Dataframes are a set of Series.
Series in turn are an extension of a numpy.array.
numpy.arrays have a property .name.
This is the name of the series. It is seldom that Pandas respects this attribute, but it lingers in places and can be used to hack some Pandas behaviors.
Naming the list of columns
A lot of answers here talks about the df.columns attribute being a list when in fact it is a Series. This means it has a .name attribute.
This is what happens if you decide to fill in the name of the columns Series:
df.columns = ['column_one', 'column_two']
df.columns.names = ['name of the list of columns']
df.index.names = ['name of the index']
name of the list of columns column_one column_two
name of the index
0 4 1
1 5 2
2 6 3
Note that the name of the index always comes one column lower.
Artefacts that linger
The .name attribute lingers on sometimes. If you set df.columns = ['one', 'two'] then the df.one.name will be 'one'.
If you set df.one.name = 'three' then df.columns will still give you ['one', 'two'], and df.one.name will give you 'three'.
BUT
pd.DataFrame(df.one) will return
three
0 1
1 2
2 3
Because Pandas reuses the .name of the already defined Series.
Multi-level column names
Pandas has ways of doing multi-layered column names. There is not so much magic involved, but I wanted to cover this in my answer too since I don't see anyone picking up on this here.
|one |
|one |two |
0 | 4 | 1 |
1 | 5 | 2 |
2 | 6 | 3 |
This is easily achievable by setting columns to lists, like this:
df.columns = [['one', 'one'], ['one', 'two']]
Many of pandas functions have an inplace parameter. When setting it True, the transformation applies directly to the dataframe that you are calling it on. For example:
df = pd.DataFrame({'$a':[1,2], '$b': [3,4]})
df.rename(columns={'$a': 'a'}, inplace=True)
df.columns
>>> Index(['a', '$b'], dtype='object')
Alternatively, there are cases where you want to preserve the original dataframe. I have often seen people fall into this case if creating the dataframe is an expensive task. For example, if creating the dataframe required querying a snowflake database. In this case, just make sure the the inplace parameter is set to False.
df = pd.DataFrame({'$a':[1,2], '$b': [3,4]})
df2 = df.rename(columns={'$a': 'a'}, inplace=False)
df.columns
>>> Index(['$a', '$b'], dtype='object')
df2.columns
>>> Index(['a', '$b'], dtype='object')
If these types of transformations are something that you do often, you could also look into a number of different pandas GUI tools. I'm the creator of one called Mito. It’s a spreadsheet that automatically converts your edits to python code.
Let's understand renaming by a small example...
Renaming columns using mapping:
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # Creating a df with column name A and B
df.rename({"A": "new_a", "B": "new_b"}, axis='columns', inplace =True) # Renaming column A with 'new_a' and B with 'new_b'
Output:
new_a new_b
0 1 4
1 2 5
2 3 6
Renaming index/Row_Name using mapping:
df.rename({0: "x", 1: "y", 2: "z"}, axis='index', inplace =True) # Row name are getting replaced by 'x', 'y', and 'z'.
Output:
new_a new_b
x 1 4
y 2 5
z 3 6
Suppose your dataset name is df, and df has.
df = ['$a', '$b', '$c', '$d', '$e']`
So, to rename these, we would simply do.
df.columns = ['a','b','c','d','e']
Let's say this is your dataframe.
You can rename the columns using two methods.
Using dataframe.columns=[#list]
df.columns=['a','b','c','d','e']
The limitation of this method is that if one column has to be changed, full column list has to be passed. Also, this method is not applicable on index labels.
For example, if you passed this:
df.columns = ['a','b','c','d']
This will throw an error. Length mismatch: Expected axis has 5 elements, new values have 4 elements.
Another method is the Pandas rename() method which is used to rename any index, column or row
df = df.rename(columns={'$a':'a'})
Similarly, you can change any rows or columns.
If you've got the dataframe, df.columns dumps everything into a list you can manipulate and then reassign into your dataframe as the names of columns...
columns = df.columns
columns = [row.replace("$", "") for row in columns]
df.rename(columns=dict(zip(columns, things)), inplace=True)
df.head() # To validate the output
Best way? I don't know. A way - yes.
A better way of evaluating all the main techniques put forward in the answers to the question is below using cProfile to gage memory and execution time. #kadee, #kaitlyn, and #eumiro had the functions with the fastest execution times - though these functions are so fast we're comparing the rounding of 0.000 and 0.001 seconds for all the answers. Moral: my answer above likely isn't the 'best' way.
import pandas as pd
import cProfile, pstats, re
old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
col_dict = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df = pd.DataFrame({'$a':[1, 2], '$b': [10, 20], '$c': ['bleep', 'blorp'], '$d': [1, 2], '$e': ['texa$', '']})
df.head()
def eumiro(df, nn):
df.columns = nn
# This direct renaming approach is duplicated in methodology in several other answers:
return df
def lexual1(df):
return df.rename(columns=col_dict)
def lexual2(df, col_dict):
return df.rename(columns=col_dict, inplace=True)
def Panda_Master_Hayden(df):
return df.rename(columns=lambda x: x[1:], inplace=True)
def paulo1(df):
return df.rename(columns=lambda x: x.replace('$', ''))
def paulo2(df):
return df.rename(columns=lambda x: x.replace('$', ''), inplace=True)
def migloo(df, on, nn):
return df.rename(columns=dict(zip(on, nn)), inplace=True)
def kadee(df):
return df.columns.str.replace('$', '')
def awo(df):
columns = df.columns
columns = [row.replace("$", "") for row in columns]
return df.rename(columns=dict(zip(columns, '')), inplace=True)
def kaitlyn(df):
df.columns = [col.strip('$') for col in df.columns]
return df
print 'eumiro'
cProfile.run('eumiro(df, new_names)')
print 'lexual1'
cProfile.run('lexual1(df)')
print 'lexual2'
cProfile.run('lexual2(df, col_dict)')
print 'andy hayden'
cProfile.run('Panda_Master_Hayden(df)')
print 'paulo1'
cProfile.run('paulo1(df)')
print 'paulo2'
cProfile.run('paulo2(df)')
print 'migloo'
cProfile.run('migloo(df, old_names, new_names)')
print 'kadee'
cProfile.run('kadee(df)')
print 'awo'
cProfile.run('awo(df)')
print 'kaitlyn'
cProfile.run('kaitlyn(df)')
df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})
If your new list of columns is in the same order as the existing columns, the assignment is simple:
new_cols = ['a', 'b', 'c', 'd', 'e']
df.columns = new_cols
>>> df
a b c d e
0 1 1 1 1 1
If you had a dictionary keyed on old column names to new column names, you could do the following:
d = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df.columns = df.columns.map(lambda col: d[col]) # Or `.map(d.get)` as pointed out by #PiRSquared.
>>> df
a b c d e
0 1 1 1 1 1
If you don't have a list or dictionary mapping, you could strip the leading $ symbol via a list comprehension:
df.columns = [col[1:] if col[0] == '$' else col for col in df]
df.rename(index=str, columns={'A':'a', 'B':'b'})
pandas.DataFrame.rename
If you already have a list for the new column names, you can try this:
new_cols = ['a', 'b', 'c', 'd', 'e']
new_names_map = {df.columns[i]:new_cols[i] for i in range(len(new_cols))}
df.rename(new_names_map, axis=1, inplace=True)
Another way we could replace the original column labels is by stripping the unwanted characters (here '$') from the original column labels.
This could have been done by running a for loop over df.columns and appending the stripped columns to df.columns.
Instead, we can do this neatly in a single statement by using list comprehension like below:
df.columns = [col.strip('$') for col in df.columns]
(strip method in Python strips the given character from beginning and end of the string.)
It is real simple. Just use:
df.columns = ['Name1', 'Name2', 'Name3'...]
And it will assign the column names by the order you put them in.
# This way it will work
import pandas as pd
# Define a dictionary
rankings = {'test': ['a'],
'odi': ['E'],
't20': ['P']}
# Convert the dictionary into DataFrame
rankings_pd = pd.DataFrame(rankings)
# Before renaming the columns
print(rankings_pd)
rankings_pd.rename(columns = {'test':'TEST'}, inplace = True)
You could use str.slice for that:
df.columns = df.columns.str.slice(1)
Another option is to rename using a regular expression:
import pandas as pd
import re
df = pd.DataFrame({'$a':[1,2], '$b':[3,4], '$c':[5,6]})
df = df.rename(columns=lambda x: re.sub('\$','',x))
>>> df
a b c
0 1 3 5
1 2 4 6
My method is generic wherein you can add additional delimiters by comma separating delimiters= variable and future-proof it.
Working Code:
import pandas as pd
import re
df = pd.DataFrame({'$a':[1,2], '$b': [3,4],'$c':[5,6], '$d': [7,8], '$e': [9,10]})
delimiters = '$'
matchPattern = '|'.join(map(re.escape, delimiters))
df.columns = [re.split(matchPattern, i)[1] for i in df.columns ]
Output:
>>> df
$a $b $c $d $e
0 1 3 5 7 9
1 2 4 6 8 10
>>> df
a b c d e
0 1 3 5 7 9
1 2 4 6 8 10
Note that the approaches in previous answers do not work for a MultiIndex. For a MultiIndex, you need to do something like the following:
>>> df = pd.DataFrame({('$a','$x'):[1,2], ('$b','$y'): [3,4], ('e','f'):[5,6]})
>>> df
$a $b e
$x $y f
0 1 3 5
1 2 4 6
>>> rename = {('$a','$x'):('a','x'), ('$b','$y'):('b','y')}
>>> df.columns = pandas.MultiIndex.from_tuples([
rename.get(item, item) for item in df.columns.tolist()])
>>> df
a b e
x y f
0 1 3 5
1 2 4 6
If you have to deal with loads of columns named by the providing system out of your control, I came up with the following approach that is a combination of a general approach and specific replacements in one go.
First create a dictionary from the dataframe column names using regular expressions in order to throw away certain appendixes of column names and then add specific replacements to the dictionary to name core columns as expected later in the receiving database.
This is then applied to the dataframe in one go.
dict = dict(zip(df.columns, df.columns.str.replace('(:S$|:C1$|:L$|:D$|\.Serial:L$)', '')))
dict['brand_timeseries:C1'] = 'BTS'
dict['respid:L'] = 'RespID'
dict['country:C1'] = 'CountryID'
dict['pim1:D'] = 'pim_actual'
df.rename(columns=dict, inplace=True)
If you just want to remove the '$' sign then use the below code
df.columns = pd.Series(df.columns.str.replace("$", ""))
In addition to the solution already provided, you can replace all the columns while you are reading the file. We can use names and header=0 to do that.
First, we create a list of the names that we like to use as our column names:
import pandas as pd
ufo_cols = ['city', 'color reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo = pd.read_csv('link to the file you are using', names = ufo_cols, header = 0)
In this case, all the column names will be replaced with the names you have in your list.
Here's a nifty little function I like to use to cut down on typing:
def rename(data, oldnames, newname):
if type(oldnames) == str: # Input can be a string or list of strings
oldnames = [oldnames] # When renaming multiple columns
newname = [newname] # Make sure you pass the corresponding list of new names
i = 0
for name in oldnames:
oldvar = [c for c in data.columns if name in c]
if len(oldvar) == 0:
raise ValueError("Sorry, couldn't find that column in the dataset")
if len(oldvar) > 1: # Doesn't have to be an exact match
print("Found multiple columns that matched " + str(name) + ": ")
for c in oldvar:
print(str(oldvar.index(c)) + ": " + str(c))
ind = input('Please enter the index of the column you would like to rename: ')
oldvar = oldvar[int(ind)]
if len(oldvar) == 1:
oldvar = oldvar[0]
data = data.rename(columns = {oldvar : newname[i]})
i += 1
return data
Here is an example of how it works:
In [2]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns = ['col1', 'col2', 'omg', 'idk'])
# First list = existing variables
# Second list = new names for those variables
In [3]: df = rename(df, ['col', 'omg'],['first', 'ohmy'])
Found multiple columns that matched col:
0: col1
1: col2
Please enter the index of the column you would like to rename: 0
In [4]: df.columns
Out[5]: Index(['first', 'col2', 'ohmy', 'idk'], dtype='object')

dataframe slicing with loc [duplicate]

How do I select columns a and b from df, and save them into a new dataframe df1?
index a b c
1 2 3 4
2 3 4 5
Unsuccessful attempt:
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc.
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:
df.loc[:, 'C':'E']
is equivalent to
df[['C', 'D', 'E']] # or df.loc[:, ['C', 'D', 'E']]
and returns columns C through E.
A demo on a randomly generated DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
df.head()
Out:
A B C D E F
R0 99 78 61 16 73 8
R1 62 27 30 80 7 76
R2 15 53 80 27 44 77
R3 75 65 47 30 84 86
R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, E is included in the columns):
df.loc[:, 'C':'E']
Out:
C D E
R0 61 16 73
R1 30 80 7
R2 80 27 44
R3 47 30 84
R4 41 62 1
R5 5 58 0
...
The same works for selecting rows based on labels. Get the rows R6 to R10 from those columns:
df.loc['R6':'R10', 'C':'E']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc also accepts a Boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.
df.loc[:, df.columns.isin(list('BCD'))]
Out:
B C D
R0 78 61 16
R1 27 30 80
R2 53 80 27
R3 65 47 30
R4 9 41 62
R5 78 5 58
...
Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the
third and fourth columns. If you don't know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is zero-offset! The "third" entry is at slot two.
As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural, because it uses the vanilla one-dimensional Python list indexing/slicing syntax.
Warning: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, an Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of its elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.
In the latest version of Pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)
In [39]: df
Out[39]:
index a b c
0 1 2 3 4
1 2 3 4 5
In [40]: df1 = df[['b', 'c']]
In [41]: df1
Out[41]:
b c
0 3 4
1 4 5
With Pandas,
wit column names
dataframe[['column1','column2']]
to select by iloc and specific columns with index number:
dataframe.iloc[:,[1,2]]
with loc column names can be used like
dataframe.loc[:,['column1','column2']]
You can use the pandas.DataFrame.filter method to either filter or reorder columns like this:
df1 = df.filter(['a', 'b'])
This is also very useful when you are chaining methods.
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.
Just saying
colsToDrop = ['a']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b and c.
The drop method is documented here.
I found this method to be very useful:
# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]
More details can be found here.
Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:
df1 = df.reindex(columns=['b','c'])
In prior versions, using .loc[list-of-labels] would work as long as at least one of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().
Read more at Indexing and Selecting Data.
You can use Pandas.
I create the DataFrame:
import pandas as pd
df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]],
index=['Jane', 'Peter','Alex','Ann'],
columns=['Test_1', 'Test_2', 'Test_3'])
The DataFrame:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
To select one or more columns by name:
df[['Test_1', 'Test_3']]
Test_1 Test_3
Jane 1 5
Peter 5 5
Alex 7 8
Ann 7 9
You can also use:
df.Test_2
And you get column Test_2:
Jane 2
Peter 4
Alex 7
Ann 6
You can also select columns and rows from these rows using .loc(). This is called "slicing". Notice that I take from column Test_1 to Test_3:
df.loc[:, 'Test_1':'Test_3']
The "Slice" is:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
And if you just want Peter and Ann from columns Test_1 and Test_3:
df.loc[['Peter', 'Ann'], ['Test_1', 'Test_3']]
You get:
Test_1 Test_3
Peter 5 5
Ann 7 9
If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can imagine.
Or you can use df.ix[0,'b'] - mixed usage of index and label.
Note: Since v0.20, ix has been deprecated in favour of loc / iloc.
df[['a', 'b']] # Select all rows of 'a' and 'b'column
df.loc[0:10, ['a', 'b']] # Index 0 to 10 select column 'a' and 'b'
df.loc[0:10, 'a':'b'] # Index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # Index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # Index 3 of column 3 to 5
Try to use pandas.DataFrame.get (see the documentation):
import pandas as pd
import numpy as np
dates = pd.date_range('20200102', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.get(['A', 'C'])
One different and easy approach: iterating rows
Using iterows
df1 = pd.DataFrame() # Creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index, 'A'] = df.loc[index, 'A']
df1.loc[index, 'B'] = df.loc[index, 'B']
df1.head()
The different approaches discussed in the previous answers are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between 'C' : 'E').
pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
df = pd.DataFrame([[2,3,4], [3,4,5]], columns=['a','b','c'], index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)
The output would be:
b c
1 3 4
2 4 5
You can also use df.pop():
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
... ('parrot', 'bird', 24.0),
... ('lion', 'mammal', 80.5),
... ('monkey', 'mammal', np.nan)],
... columns=('name', 'class', 'max_speed'))
>>> df
name class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
Please use df.pop(c).
I've seen several answers on that, but one remained unclear to me. How would you select those columns of interest?
The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.
Example
print(extracted_features.shape)
print(extracted_features)
(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']
I have the following list/NumPy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use
dataset[extracted_features]
And you will end up with this
This something you would use quite often in machine learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other Stack Overflower users.
To exclude some columns you can drop them in the column index. For example:
A B C D
0 1 10 100 1000
1 2 20 200 2000
Select all except two:
df[df.columns.drop(['B', 'D'])]
Output:
A C
0 1 100
1 2 200
You can also use the method truncate to select middle columns:
df.truncate(before='B', after='C', axis=1)
Output:
B C
0 10 100
1 20 200
To select multiple columns, extract and view them thereafter: df is the previously named data frame. Then create a new data frame df1, and select the columns A to D which you want to extract and view.
df1 = pd.DataFrame(data_frame, columns=['Column A', 'Column B', 'Column C', 'Column D'])
df1
All required columns will show up!
def get_slize(dataframe, start_row, end_row, start_col, end_col):
assert len(dataframe) > end_row and start_row >= 0
assert len(dataframe.columns) > end_col and start_col >= 0
list_of_indexes = list(dataframe.columns)[start_col:end_col]
ans = dataframe.iloc[start_row:end_row][list_of_indexes]
return ans
Just use this function
I think this is the easiest way to reach your goal.
import pandas as pd
cols = ['a', 'b']
df1 = pd.DataFrame(df, columns=cols)
df1 = df.iloc[:, 0:2]

Match fields within one data frame with column names in another data frame

I have two data frames. In the last column ("Bill") in the first data frame, I want to apply a function (fixed price + Quantity*price/qty). In order to apply the function, R should match the values in the first column of df1 to the column names of df2.
I have solved the problem by creating a function and several ifelse statements, but I would want to use a statement that automatically matches the values in df1 with the column names in df2. The data set that I have contains more than 2 million rows and I would need to apply the same rationale into building other similar functions. It would be nice to use something that does not require a loop or takes too long to process.
### Set up your data frames like so ###
Code <- c("a1", "a2", "c3", "a1")
Name <- c("Dan", "David", "Anna", "Lisa")
Quantity <- c(30, 12, 10, 10)
df1 <- as.data.frame(cbind("Code" = Code, "Name" = Name, "Quantity" = Quantity), stringsAsFactors = F)
df1$Quantity <- as.numeric(df1$Quantity)
fixed_price <- c(12, 5, 23)
price_per_qty <- c(1, 4, 7)
df2 <- as.data.frame(rbind("fixed_price" = fixed_price, "price_per_qty" = price_per_qty))
colnames(df2) <- c("a1", "a2", "c3")
### Combine dataframe 1 and 2 into a single dataframe ###
# Code below pulls individual columns from df2 based on the
# index provided by the "Code" column in df1, transposes them
# so they'll line up with df1, then column binds them to df1
df3 <- cbind(df1, t(df2[,df1$Code]))
# the bill is calculated simply enough
bill <- df3[4] + df3[3] * df3[5]
colnames(bill) <- "bill"
# Finally, output the results as you wanted
cbind(df3, bill)
So I have a fairly similar answer to graggsd, but here is what worked for me. I merged two data frames based on the key word "Code" and then combined it into the big data frame into combined_data. I then used a function which I think is what you defined above and then passed the respective data frames through it.
df2 <- t(data.frame(c(12,1),c(5,4),c(23,7)))
rownames(df2) <- c("a1","a2","c3")
test <- rownames(df2)
df2 <- cbind.data.frame(df2,test)
colnames(df2) <- c("fixed price","price/qty","Code")
df1 <- data.frame(c("a1","a2","c3","a1"), c("Dan","David","Anna","Lisa"),c(30,12,10,10))
colnames(df1) <- c("Code","Name","Quantity")
combined_data <- dplyr::inner_join(df1,df2, by = "Code")
f1 <- function(x,y,z){
x + y * z
}
bill <- f1(combined_data[,4],combined_data[,3],combined_data[,5])
finalDataSet <- cbind.data.frame(combined_data,bill)
The final data set:
Code Name Quantity fixed price price/qty bill
1 a1 Dan 30 12 1 42
2 a2 David 12 5 4 53
3 c3 Anna 10 23 7 93
4 a1 Lisa 10 12 1 22