How to convert rows into columns by group? - dataframe

I'd like to do association analysis using apriori algorithm.
To do that, I have to make a dataset.
What I have data is like this.
data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
order_number order_item
1 100145 27684535
2 100155 15755576
3 100155 1357954
4 100155 124776249
5 500002 12478324
6 500002 15755576
7 500002 13577
8 500007 27684535
and I want to transfer the data like this
data.frame("order_number"=c("100145","100155","500002","500007"),
"col1"=c("27684535","15755576","12478324","27684535"),
"col2"=c(NA,"1357954","15755576",NA),
"col3"=c(NA,"124776249","13577",NA))
order_number col1 col2 col3
1 100145 27684535 <NA> <NA>
2 100155 15755576 1357954 124776249
3 500002 12478324 15755576 13577
4 500007 27684535 <NA> <NA>
Thank you for your help.

This would be a case of pivot_wider (or other functions for changing column layout). First step would be creating a row id variable to note whether each is 1, 2 or 3, then shaping this into the dataframe you want:
df <- data.frame("order_number"=c("100145", "100155", "100155", "100155",
"500002", "500002", "500002", "500007"),
"order_item"=c("27684535","15755576",
"1357954","124776249","12478324","15755576","13577","27684535"))
library(tidyr)
library(dplyr)
df |>
group_by(order_number) |>
mutate(rank = row_number()) |>
pivot_wider(names_from = rank, values_from = order_item,
names_prefix = "col")
#> # A tibble: 4 × 4
#> # Groups: order_number [4]
#> order_number col1 col2 col3
#> <chr> <chr> <chr> <chr>
#> 1 100145 27684535 <NA> <NA>
#> 2 100155 15755576 1357954 124776249
#> 3 500002 12478324 15755576 13577
#> 4 500007 27684535 <NA> <NA>

Related

How to melt a dataframe with tidyverse, and create a new column

I have pet survey data from 6 households.
The households are split into levels (a,b).
I would like to melt the dataframe by aminal name (id.var), household (var.name), abundance (value.name), whilst adding a new column ("level") for the levels a&b.
My dataframe looks like this:
pet abundance data
I can split it using reshape2:melt, but I don't know how to cut the a, b, from the column names and make a new column of them? Please help.
raw_data = as.dataframe(raw_data)
melt(raw_data,
id.variable = 'Animal', variable.name = 'Site', value.name = 'Abundance')
Having a go on some simulated data, pivot_longer is your best bet:
library(tidyverse)
df <- tibble(
Animal = c("dog", "cat", "fish", "horse"),
`1a` = sample(1:10, 4),
`1b` = sample(1:10, 4),
`2a` = sample(1:10, 4),
`2b` = sample(1:10, 4),
`3a` = sample(1:10, 4),
`3b` = sample(1:10, 4)
)
df |>
pivot_longer(
-Animal,
names_to = c("Site", "level"),
values_to = "Abundance",
names_pattern = "(.)(.)"
) |>
arrange(Site, level)
#> # A tibble: 24 × 4
#> Animal Site level Abundance
#> <chr> <chr> <chr> <int>
#> 1 dog 1 a 9
#> 2 cat 1 a 5
#> 3 fish 1 a 8
#> 4 horse 1 a 6
#> 5 dog 1 b 4
#> 6 cat 1 b 2
#> 7 fish 1 b 8
#> 8 horse 1 b 10
#> 9 dog 2 a 8
#> 10 cat 2 a 3
#> # … with 14 more rows

Calculate mean-deviated values (subtract mean of all columns except one from this one column)

I have a dataset with the following structure:
df <- data.frame(id = 1:5,
study = c("st1","st2","st3","st4","st5"),
a_var = c(10,20,30,40,50),
b_var = c(6,5,4,3,2),
c_var = c(3,4,5,6,7),
d_var = c(80,70,60,50,40))
I would like to calculate the difference between each column that has _var in its name and the mean of all other columns containing _var in their names, like this:
mean_deviated_value <- function(data, variable) {
md_value = data[,variable] - rowMeans(data[,names(data) != variable])
md_value
}
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
df$b_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "b_var")
df$c_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "c_var")
df$d_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "d_var")
Which gives me my desired output:
id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
1 1 st1 10 6 3 80 -19.666667 -12.33333 -9.80 83.80000
2 2 st2 20 5 4 70 -6.333333 -16.91667 -10.35 70.76667
3 3 st3 30 4 5 60 7.000000 -21.50000 -10.90 57.73333
4 4 st4 40 3 6 50 20.333333 -26.08333 -11.45 44.70000
5 5 st5 50 2 7 40 33.666667 -30.66667 -12.00 31.66667
How do I do it in one go, without repeating the code, preferably with dplyr/purrr?
I tried this:
df %>%
mutate(across(contains("_var"), ~ list(md = .x - rowMeans(select(., contains("_var") & !.x)))))
And got this error:
Error: Problem with `mutate()` input `..1`.
ℹ `..1 = across(...)`.
x no applicable method for 'select' applied to an object of class "c('double', 'numeric')"
We can use map_dfc with transmute to create *_md columns, and glue syntax for the names.
library(tidyverse)
nms <- names(df) %>%
str_subset('^.*_')
bind_cols(df, map_dfc(nms, ~transmute(df, '{.x}_md' := mean_deviated_value(select(df, contains("_var")), .x))))
#> id study a_var b_var c_var d_var a_var_md b_var_md c_var_md d_var_md
#> 1 1 st1 10 6 3 80 -19.666667 -25.00000 -29.00000 73.66667
#> 2 2 st2 20 5 4 70 -6.333333 -26.33333 -27.66667 60.33333
#> 3 3 st3 30 4 5 60 7.000000 -27.66667 -26.33333 47.00000
#> 4 4 st4 40 3 6 50 20.333333 -29.00000 -25.00000 33.66667
#> 5 5 st5 50 2 7 40 33.666667 -30.33333 -23.66667 20.33333
Note that if you use assigment. The first time rowMeans will compute with b_var, c_bar and d_bar. But the second time, contains("_var") will also capture the previously created a_var_md and use it to compute the means. I don't know if this is intended behaviour but it is worth mentioning.
df$a_var_md <- mean_deviated_value(dplyr::select(df, contains("_var")), "a_var")
select(df, contains("_var"))
#> a_var b_var c_var d_var a_var_md
#> 1 10 6 3 80 -19.666667
#> 2 20 5 4 70 -6.333333
#> 3 30 4 5 60 7.000000
#> 4 40 3 6 50 20.333333
#> 5 50 2 7 40 33.666667
We can avoid this by replacing contains("_var") with matches("^.*_var$")
Created on 2021-12-20 by the reprex package (v2.0.1)

Iteratively get the max of a data frame column, add one and repeat for all rows in r

I need to perform a database operation where I'll be adding new data to an existing table and then assigning the new rows a unique id. I'm asking about this in R so I can get the logic straight before I attempt to rewrite it in sql or pyspark.
Imagine that I've already added the new data to the existing data. Here's a simplified version of what it might look like:
library(tidyverse)
df <- tibble(id = c(1, 2, 3, NA, NA),
descriptions = c("dodgers", "yankees","giants", "orioles", "mets"))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
What I want is:
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets
An I can't use arrange with rowid_to_columns id's be deleted.
To get a unique id for the NA rows while not changing the existing ones, I want to get the max of the id column, add one, replace NA with that value and then move to the next row. My instinct was to do something like this: df %>% mutate(new_id = max(id, na.rm = TRUE) + 1) but that only get's the max plus one, not a new max for each row. I feel like I could do this with a mapping function but what I've tried returns a result identical to the input dataframe:
df %>%
mutate(id = ifelse(is.na(id),
map_dbl(id, ~max(.) + 1, na.rm = FALSE),
id))
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 NA orioles
5 NA mets
Thanks in advance--now if someone can help me directly in sql, that's also a plus!
SQL option, using sqldf for demo:
sqldf::sqldf("
with cte as (
select max(id) as maxid from df
)
select cte.maxid + row_number() over () as id, df.descriptions
from df
left join cte where df.id is null
union
select * from df where id is not null")
# id descriptions
# 1 1 dodgers
# 2 2 yankees
# 3 3 giants
# 4 4 orioles
# 5 5 mets
Here is one method where we add the max value with the cumulative sum of logical vector based on the NA values and coalesce with the original column 'id'
library(dplyr)
df <- df %>%
mutate(id = coalesce(id, max(id, na.rm = TRUE) + cumsum(is.na(id))))
-output
df
# A tibble: 5 x 2
id descriptions
<dbl> <chr>
1 1 dodgers
2 2 yankees
3 3 giants
4 4 orioles
5 5 mets

Ordering Columns in custom orders after unstacking [duplicate]

I have the following DataFrame (df):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
I add more column(s) by assignment:
df['mean'] = df.mean(1)
How can I move the column mean to the front, i.e. set it as first column leaving the order of the other columns untouched?
One easy way would be to reassign the dataframe with a list of the columns, rearranged as needed.
This is what you have now:
In [6]: df
Out[6]:
0 1 2 3 4 mean
0 0.445598 0.173835 0.343415 0.682252 0.582616 0.445543
1 0.881592 0.696942 0.702232 0.696724 0.373551 0.670208
2 0.662527 0.955193 0.131016 0.609548 0.804694 0.632596
3 0.260919 0.783467 0.593433 0.033426 0.512019 0.436653
4 0.131842 0.799367 0.182828 0.683330 0.019485 0.363371
5 0.498784 0.873495 0.383811 0.699289 0.480447 0.587165
6 0.388771 0.395757 0.745237 0.628406 0.784473 0.588529
7 0.147986 0.459451 0.310961 0.706435 0.100914 0.345149
8 0.394947 0.863494 0.585030 0.565944 0.356561 0.553195
9 0.689260 0.865243 0.136481 0.386582 0.730399 0.561593
In [7]: cols = df.columns.tolist()
In [8]: cols
Out[8]: [0L, 1L, 2L, 3L, 4L, 'mean']
Rearrange cols in any way you want. This is how I moved the last element to the first position:
In [12]: cols = cols[-1:] + cols[:-1]
In [13]: cols
Out[13]: ['mean', 0L, 1L, 2L, 3L, 4L]
Then reorder the dataframe like this:
In [16]: df = df[cols] # OR df = df.ix[:, cols]
In [17]: df
Out[17]:
mean 0 1 2 3 4
0 0.445543 0.445598 0.173835 0.343415 0.682252 0.582616
1 0.670208 0.881592 0.696942 0.702232 0.696724 0.373551
2 0.632596 0.662527 0.955193 0.131016 0.609548 0.804694
3 0.436653 0.260919 0.783467 0.593433 0.033426 0.512019
4 0.363371 0.131842 0.799367 0.182828 0.683330 0.019485
5 0.587165 0.498784 0.873495 0.383811 0.699289 0.480447
6 0.588529 0.388771 0.395757 0.745237 0.628406 0.784473
7 0.345149 0.147986 0.459451 0.310961 0.706435 0.100914
8 0.553195 0.394947 0.863494 0.585030 0.565944 0.356561
9 0.561593 0.689260 0.865243 0.136481 0.386582 0.730399
You could also do something like this:
df = df[['mean', '0', '1', '2', '3']]
You can get the list of columns with:
cols = list(df.columns.values)
The output will produce:
['0', '1', '2', '3', 'mean']
...which is then easy to rearrange manually before dropping it into the first function
Just assign the column names in the order you want them:
In [39]: df
Out[39]:
0 1 2 3 4 mean
0 0.172742 0.915661 0.043387 0.712833 0.190717 1
1 0.128186 0.424771 0.590779 0.771080 0.617472 1
2 0.125709 0.085894 0.989798 0.829491 0.155563 1
3 0.742578 0.104061 0.299708 0.616751 0.951802 1
4 0.721118 0.528156 0.421360 0.105886 0.322311 1
5 0.900878 0.082047 0.224656 0.195162 0.736652 1
6 0.897832 0.558108 0.318016 0.586563 0.507564 1
7 0.027178 0.375183 0.930248 0.921786 0.337060 1
8 0.763028 0.182905 0.931756 0.110675 0.423398 1
9 0.848996 0.310562 0.140873 0.304561 0.417808 1
In [40]: df = df[['mean', 4,3,2,1]]
Now, 'mean' column comes out in the front:
In [41]: df
Out[41]:
mean 4 3 2 1
0 1 0.190717 0.712833 0.043387 0.915661
1 1 0.617472 0.771080 0.590779 0.424771
2 1 0.155563 0.829491 0.989798 0.085894
3 1 0.951802 0.616751 0.299708 0.104061
4 1 0.322311 0.105886 0.421360 0.528156
5 1 0.736652 0.195162 0.224656 0.082047
6 1 0.507564 0.586563 0.318016 0.558108
7 1 0.337060 0.921786 0.930248 0.375183
8 1 0.423398 0.110675 0.931756 0.182905
9 1 0.417808 0.304561 0.140873 0.310562
For pandas >= 1.3 (Edited in 2022):
df.insert(0, 'mean', df.pop('mean'))
How about (for Pandas < 1.3, the original answer)
df.insert(0, 'mean', df['mean'])
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#column-selection-addition-deletion
In your case,
df = df.reindex(columns=['mean',0,1,2,3,4])
will do exactly what you want.
In my case (general form):
df = df.reindex(columns=sorted(df.columns))
df = df.reindex(columns=(['opened'] + list([a for a in df.columns if a != 'opened']) ))
import numpy as np
import pandas as pd
df = pd.DataFrame()
column_names = ['x','y','z','mean']
for col in column_names:
df[col] = np.random.randint(0,100, size=10000)
You can try out the following solutions :
Solution 1:
df = df[ ['mean'] + [ col for col in df.columns if col != 'mean' ] ]
Solution 2:
df = df[['mean', 'x', 'y', 'z']]
Solution 3:
col = df.pop("mean")
df = df.insert(0, col.name, col)
Solution 4:
df.set_index(df.columns[-1], inplace=True)
df.reset_index(inplace=True)
Solution 5:
cols = list(df)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
solution 6:
order = [1,2,3,0] # setting column's order
df = df[[df.columns[i] for i in order]]
Time Comparison:
Solution 1:
CPU times: user 1.05 ms, sys: 35 µs, total: 1.08 ms Wall time: 995 µs
Solution 2:
CPU times: user 933 µs, sys: 0 ns, total: 933 µs
Wall time: 800 µs
Solution 3:
CPU times: user 0 ns, sys: 1.35 ms, total: 1.35 ms
Wall time: 1.08 ms
Solution 4:
CPU times: user 1.23 ms, sys: 45 µs, total: 1.27 ms
Wall time: 986 µs
Solution 5:
CPU times: user 1.09 ms, sys: 19 µs, total: 1.11 ms
Wall time: 949 µs
Solution 6:
CPU times: user 955 µs, sys: 34 µs, total: 989 µs
Wall time: 859 µs
You need to create a new list of your columns in the desired order, then use df = df[cols] to rearrange the columns in this new order.
cols = ['mean'] + [col for col in df if col != 'mean']
df = df[cols]
You can also use a more general approach. In this example, the last column (indicated by -1) is inserted as the first column.
cols = [df.columns[-1]] + [col for col in df if col != df.columns[-1]]
df = df[cols]
You can also use this approach for reordering columns in a desired order if they are present in the DataFrame.
inserted_cols = ['a', 'b', 'c']
cols = ([col for col in inserted_cols if col in df]
+ [col for col in df if col not in inserted_cols])
df = df[cols]
Suppose you have df with columns A B C.
The most simple way is:
df = df.reindex(['B','C','A'], axis=1)
If your column names are too-long-to-type then you could specify the new order through a list of integers with the positions:
Data:
0 1 2 3 4 mean
0 0.397312 0.361846 0.719802 0.575223 0.449205 0.500678
1 0.287256 0.522337 0.992154 0.584221 0.042739 0.485741
2 0.884812 0.464172 0.149296 0.167698 0.793634 0.491923
3 0.656891 0.500179 0.046006 0.862769 0.651065 0.543382
4 0.673702 0.223489 0.438760 0.468954 0.308509 0.422683
5 0.764020 0.093050 0.100932 0.572475 0.416471 0.389390
6 0.259181 0.248186 0.626101 0.556980 0.559413 0.449972
7 0.400591 0.075461 0.096072 0.308755 0.157078 0.207592
8 0.639745 0.368987 0.340573 0.997547 0.011892 0.471749
9 0.050582 0.714160 0.168839 0.899230 0.359690 0.438500
Generic example:
new_order = [3,2,1,4,5,0]
print(df[df.columns[new_order]])
3 2 1 4 mean 0
0 0.575223 0.719802 0.361846 0.449205 0.500678 0.397312
1 0.584221 0.992154 0.522337 0.042739 0.485741 0.287256
2 0.167698 0.149296 0.464172 0.793634 0.491923 0.884812
3 0.862769 0.046006 0.500179 0.651065 0.543382 0.656891
4 0.468954 0.438760 0.223489 0.308509 0.422683 0.673702
5 0.572475 0.100932 0.093050 0.416471 0.389390 0.764020
6 0.556980 0.626101 0.248186 0.559413 0.449972 0.259181
7 0.308755 0.096072 0.075461 0.157078 0.207592 0.400591
8 0.997547 0.340573 0.368987 0.011892 0.471749 0.639745
9 0.899230 0.168839 0.714160 0.359690 0.438500 0.050582
Although it might seem like I'm just explicitly typing the column names in a different order, the fact that there's a column 'mean' should make it clear that new_order relates to actual positions and not column names.
For the specific case of OP's question:
new_order = [-1,0,1,2,3,4]
df = df[df.columns[new_order]]
print(df)
mean 0 1 2 3 4
0 0.500678 0.397312 0.361846 0.719802 0.575223 0.449205
1 0.485741 0.287256 0.522337 0.992154 0.584221 0.042739
2 0.491923 0.884812 0.464172 0.149296 0.167698 0.793634
3 0.543382 0.656891 0.500179 0.046006 0.862769 0.651065
4 0.422683 0.673702 0.223489 0.438760 0.468954 0.308509
5 0.389390 0.764020 0.093050 0.100932 0.572475 0.416471
6 0.449972 0.259181 0.248186 0.626101 0.556980 0.559413
7 0.207592 0.400591 0.075461 0.096072 0.308755 0.157078
8 0.471749 0.639745 0.368987 0.340573 0.997547 0.011892
9 0.438500 0.050582 0.714160 0.168839 0.899230 0.359690
The main problem with this approach is that calling the same code multiple times will create different results each time, so one needs to be careful :)
This question has been answered before but reindex_axis is deprecated now so I would suggest to use:
df = df.reindex(sorted(df.columns), axis=1)
For those who want to specify the order they want instead of just sorting them, here's the solution spelled out:
df = df.reindex(['the','order','you','want'], axis=1)
Now, how you want to sort the list of column names is really not a pandas question, that's a Python list manipulation question. There are many ways of doing that, and I think this answer has a very neat way of doing it.
You can reorder the dataframe columns using a list of names with:
df = df.filter(list_of_col_names)
I think this is a slightly neater solution:
df.insert(0, 'mean', df.pop("mean"))
This solution is somewhat similar to #JoeHeffer 's solution but this is one liner.
Here we remove the column "mean" from the dataframe and attach it to index 0 with the same column name.
I ran into a similar question myself, and just wanted to add what I settled on. I liked the reindex_axis() method for changing column order. This worked:
df = df.reindex_axis(['mean'] + list(df.columns[:-1]), axis=1)
An alternate method based on the comment from #Jorge:
df = df.reindex(columns=['mean'] + list(df.columns[:-1]))
Although reindex_axis seems to be slightly faster in micro benchmarks than reindex, I think I prefer the latter for its directness.
This function avoids you having to list out every variable in your dataset just to order a few of them.
def order(frame,var):
if type(var) is str:
var = [var] #let the command take a string or list
varlist =[w for w in frame.columns if w not in var]
frame = frame[var+varlist]
return frame
It takes two arguments, the first is the dataset, the second are the columns in the data set that you want to bring to the front.
So in my case I have a data set called Frame with variables A1, A2, B1, B2, Total and Date. If I want to bring Total to the front then all I have to do is:
frame = order(frame,['Total'])
If I want to bring Total and Date to the front then I do:
frame = order(frame,['Total','Date'])
EDIT:
Another useful way to use this is, if you have an unfamiliar table and you're looking with variables with a particular term in them, like VAR1, VAR2,... you may execute something like:
frame = order(frame,[v for v in frame.columns if "VAR" in v])
Simply do,
df = df[['mean'] + df.columns[:-1].tolist()]
Here's a way to move one existing column that will modify the existing dataframe in place.
my_column = df.pop('column name')
df.insert(3, my_column.name, my_column) # Is in-place
Just type the column name you want to change, and set the index for the new location.
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
For your case, this would be like:
df = change_column_order(df, 'mean', 0)
You could do the following (borrowing parts from Aman's answer):
cols = df.columns.tolist()
cols.insert(0, cols.pop(-1))
cols
>>>['mean', 0L, 1L, 2L, 3L, 4L]
df = df[cols]
Moving any column to any position:
import pandas as pd
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
cols = df.columns.tolist()
column_to_move = "C"
new_position = 1
cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
I wanted to bring two columns in front from a dataframe where I do not know exactly the names of all columns, because they are generated from a pivot statement before.
So, if you are in the same situation: To bring columns in front that you know the name of and then let them follow by "all the other columns", I came up with the following general solution:
df = df.reindex_axis(['Col1','Col2'] + list(df.columns.drop(['Col1','Col2'])), axis=1)
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
df['mean'] = df.mean(1)
df
0 1 2 3 4 mean
0 0.929616 0.316376 0.183919 0.204560 0.567725 0.440439
1 0.595545 0.964515 0.653177 0.748907 0.653570 0.723143
2 0.747715 0.961307 0.008388 0.106444 0.298704 0.424512
3 0.656411 0.809813 0.872176 0.964648 0.723685 0.805347
4 0.642475 0.717454 0.467599 0.325585 0.439645 0.518551
5 0.729689 0.994015 0.676874 0.790823 0.170914 0.672463
6 0.026849 0.800370 0.903723 0.024676 0.491747 0.449473
7 0.526255 0.596366 0.051958 0.895090 0.728266 0.559587
8 0.818350 0.500223 0.810189 0.095969 0.218950 0.488736
9 0.258719 0.468106 0.459373 0.709510 0.178053 0.414752
### here you can add below line and it should work
# Don't forget the two (()) 'brackets' around columns names.Otherwise, it'll give you an error.
df = df[list(('mean',0, 1, 2,3,4))]
df
mean 0 1 2 3 4
0 0.440439 0.929616 0.316376 0.183919 0.204560 0.567725
1 0.723143 0.595545 0.964515 0.653177 0.748907 0.653570
2 0.424512 0.747715 0.961307 0.008388 0.106444 0.298704
3 0.805347 0.656411 0.809813 0.872176 0.964648 0.723685
4 0.518551 0.642475 0.717454 0.467599 0.325585 0.439645
5 0.672463 0.729689 0.994015 0.676874 0.790823 0.170914
6 0.449473 0.026849 0.800370 0.903723 0.024676 0.491747
7 0.559587 0.526255 0.596366 0.051958 0.895090 0.728266
8 0.488736 0.818350 0.500223 0.810189 0.095969 0.218950
9 0.414752 0.258719 0.468106 0.459373 0.709510 0.178053
You can use a set which is an unordered collection of unique elements to do keep the "order of the other columns untouched":
other_columns = list(set(df.columns).difference(["mean"])) #[0, 1, 2, 3, 4]
Then, you can use a lambda to move a specific column to the front by:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.rand(10, 5))
In [4]: df["mean"] = df.mean(1)
In [5]: move_col_to_front = lambda df, col: df[[col]+list(set(df.columns).difference([col]))]
In [6]: move_col_to_front(df, "mean")
Out[6]:
mean 0 1 2 3 4
0 0.697253 0.600377 0.464852 0.938360 0.945293 0.537384
1 0.609213 0.703387 0.096176 0.971407 0.955666 0.319429
2 0.561261 0.791842 0.302573 0.662365 0.728368 0.321158
3 0.518720 0.710443 0.504060 0.663423 0.208756 0.506916
4 0.616316 0.665932 0.794385 0.163000 0.664265 0.793995
5 0.519757 0.585462 0.653995 0.338893 0.714782 0.305654
6 0.532584 0.434472 0.283501 0.633156 0.317520 0.994271
7 0.640571 0.732680 0.187151 0.937983 0.921097 0.423945
8 0.562447 0.790987 0.200080 0.317812 0.641340 0.862018
9 0.563092 0.811533 0.662709 0.396048 0.596528 0.348642
In [7]: move_col_to_front(df, 2)
Out[7]:
2 0 1 3 4 mean
0 0.938360 0.600377 0.464852 0.945293 0.537384 0.697253
1 0.971407 0.703387 0.096176 0.955666 0.319429 0.609213
2 0.662365 0.791842 0.302573 0.728368 0.321158 0.561261
3 0.663423 0.710443 0.504060 0.208756 0.506916 0.518720
4 0.163000 0.665932 0.794385 0.664265 0.793995 0.616316
5 0.338893 0.585462 0.653995 0.714782 0.305654 0.519757
6 0.633156 0.434472 0.283501 0.317520 0.994271 0.532584
7 0.937983 0.732680 0.187151 0.921097 0.423945 0.640571
8 0.317812 0.790987 0.200080 0.641340 0.862018 0.562447
9 0.396048 0.811533 0.662709 0.596528 0.348642 0.563092
Just flipping helps often.
df[df.columns[::-1]]
Or just shuffle for a look.
import random
cols = list(df.columns)
random.shuffle(cols)
df[cols]
You can use reindex which can be used for both axis:
df
# 0 1 2 3 4 mean
# 0 0.943825 0.202490 0.071908 0.452985 0.678397 0.469921
# 1 0.745569 0.103029 0.268984 0.663710 0.037813 0.363821
# 2 0.693016 0.621525 0.031589 0.956703 0.118434 0.484254
# 3 0.284922 0.527293 0.791596 0.243768 0.629102 0.495336
# 4 0.354870 0.113014 0.326395 0.656415 0.172445 0.324628
# 5 0.815584 0.532382 0.195437 0.829670 0.019001 0.478415
# 6 0.944587 0.068690 0.811771 0.006846 0.698785 0.506136
# 7 0.595077 0.437571 0.023520 0.772187 0.862554 0.538182
# 8 0.700771 0.413958 0.097996 0.355228 0.656919 0.444974
# 9 0.263138 0.906283 0.121386 0.624336 0.859904 0.555009
df.reindex(['mean', *range(5)], axis=1)
# mean 0 1 2 3 4
# 0 0.469921 0.943825 0.202490 0.071908 0.452985 0.678397
# 1 0.363821 0.745569 0.103029 0.268984 0.663710 0.037813
# 2 0.484254 0.693016 0.621525 0.031589 0.956703 0.118434
# 3 0.495336 0.284922 0.527293 0.791596 0.243768 0.629102
# 4 0.324628 0.354870 0.113014 0.326395 0.656415 0.172445
# 5 0.478415 0.815584 0.532382 0.195437 0.829670 0.019001
# 6 0.506136 0.944587 0.068690 0.811771 0.006846 0.698785
# 7 0.538182 0.595077 0.437571 0.023520 0.772187 0.862554
# 8 0.444974 0.700771 0.413958 0.097996 0.355228 0.656919
# 9 0.555009 0.263138 0.906283 0.121386 0.624336 0.859904
Hackiest method in the book
df.insert(0, "test", df["mean"])
df = df.drop(columns=["mean"]).rename(columns={"test": "mean"})
A pretty straightforward solution that worked for me is to use .reindex on df.columns:
df = df[df.columns.reindex(['mean', 0, 1, 2, 3, 4])[0]]
Here is a function to do this for any number of columns.
def mean_first(df):
ncols = df.shape[1] # Get the number of columns
index = list(range(ncols)) # Create an index to reorder the columns
index.insert(0,ncols) # This puts the last column at the front
return(df.assign(mean=df.mean(1)).iloc[:,index]) # new df with last column (mean) first
A simple approach is using set(), in particular when you have a long list of columns and do not want to handle them manually:
cols = list(set(df.columns.tolist()) - set(['mean']))
cols.insert(0, 'mean')
df = df[cols]
How about using T?
df = df.T.reindex(['mean', 0, 1, 2, 3, 4]).T
I believe #Aman's answer is the best if you know the location of the other column.
If you don't know the location of mean, but only have its name, you cannot resort directly to cols = cols[-1:] + cols[:-1]. Following is the next-best thing I could come up with:
meanDf = pd.DataFrame(df.pop('mean'))
# now df doesn't contain "mean" anymore. Order of join will move it to left or right:
meanDf.join(df) # has mean as first column
df.join(meanDf) # has mean as last column

How to make smaller table grow and match the contents of larger table in R?

I have three columns. The first is large and contains various letters. The second is the same size but contains fewer letters with some NAs. Each letter can be found in the larger column. The third is the same size also but contains values with the second column and corresponding NAs.
My question is how do I make it so that the second and third column are re-arranged so that the second column matches the first column where possible.
I feel the answer is something to do with left join but I can't figure it out.
A bit weird to explain in words but example shows it easily.
# Original Situation
Large <- c("B", "D", "C", "A", "E")
Small <- c("D", "A", NA, NA, NA)
Number <- c(5, 12, NA, NA, NA)
data.frame(Large, Small, Number)
#> Large Small Number
#> 1 B D 5
#> 2 D A 12
#> 3 C <NA> NA
#> 4 A <NA> NA
#> 5 E <NA> NA
# I want it to finish like this:
Large <- c("B", "D", "C", "A", "E")
Small <- c(NA, "D", NA, "A", NA)
Number <- c(NA, 5, NA, 12, NA)
data.frame(Large, Small, Number)
#> Large Small Number
#> 1 B <NA> NA
#> 2 D D 5
#> 3 C <NA> NA
#> 4 A A 12
#> 5 E <NA> NA
library(dplyr)
# I find `tibble` generally better than `data.frame`
# If you want to use `data.frame` remember to especify stringAsFactors = FALSE
df_large <- tibble(large = Large)
df_small <- tibble(small = Small, number = Number)
left_join(df_large, df_small, by = c("large" = "small"))
If you want to keep both large and small columns (I don't really see a reason to):
left_join(df_large, df_small, by = c("large" = "small")) %>%
mutate(small = if_else(!is.na(number), large, NA_character_))
# A tibble: 5 x 3
large number small
<chr> <dbl> <chr>
1 B NA NA
2 D 5 D
3 C NA NA
4 A 12 A
5 E NA NA
Here is a base way:
x <- df[1]
y <- setNames(df[c(2, 2, 3)], names(df))
merge(x, y, all.x = T)
# Large Small Number
# 1 A A 12
# 2 B <NA> NA
# 3 C <NA> NA
# 4 D D 5
# 5 E <NA> NA
Use the same logic on left_join():
library(tidyverse)
df %>%
mutate(Large = Small) %>%
right_join(df[1])
# Large Small Number
# 1 B <NA> NA
# 2 D D 5
# 3 C <NA> NA
# 4 A A 12
# 5 E <NA> NA