from long to wide format multiple variables in R - long-format-data

I have a table in long format like this:
gene tissue tpm
A liver 5
A brain 2
B ovary 10
B brain 1
C brain 15
C liver 6
I'd like to convert it into a wider format:
gene tissue1 tissue2 tpm1 tpm2
A liver brain 5 2
B ovary brain 10 1
C brain liver 15 6
I have tried with dcast and spread but I get this result:
gene liver brain ovary
A 5 2 NA
B NA 1 10
C 6 15 NA
Which is NOT what I want.
Thank you!

I am not aware of a function that can solve this puzzle all at once in R language, but you can use a for loop to rearrange you data frame.
The code is presented below:
data <- data.frame(gene=c("A","A","B","B","C","C"),
tissue=c("liver", "brain", "ovary", "brain", "brain", "liver"),
tpm=c(5,2,10,1,15,6))
gene.unique <- unique(data$gene)
i <- 1
for (dummy in gene.unique) {
genes.idx <- which(data$gene == dummy)
tissue1[i] <- data$tissue[genes.idx[1]]
tissue2[i] <- data$tissue[genes.idx[2]]
tpm1[i] <- data$tpm[genes.idx[1]]
tpm2[i] <- data$tpm[genes.idx[2]]
i <- i+1
}
data.final <- data.frame(gene=gene.unique, tissue1, tissue2, tpm1, tpm2)
gene tissue1 tissue2 tpm1 tpm2
1 A liver brain 5 2
2 B ovary brain 10 1
3 C brain liver 15 6
I hope it helps you.

Related

how to change specific value in one table based on coordinates listed in second table

I have two large csv files (tables).
First file look like this:
row_name
a
b
c
X
1
2
3
Y
1
2
3
Z
1
2
3
Second file looks like this:
a
X
a
Z
b
Y
c
X
c
Z
I need to find value in the first file according the coordinates in the second file to change it to NA.
The result should look like this:
row_name
a
b
c
X
NA
2
NA
Y
1
NA
3
Z
NA
2
NA
I have no experience with this kind of data manipulation, and I am lost for the moment. Can you help me?
I had a look on dplyr package in r, but it did not help. I have no idea how to do it.
With pandas, you can achieve what you want with crosstab and mask :
#pip install pandas
import pandas as pd
df1 = pd.read_csv("tmp/f1.csv", index_col="row_name")
df2 = pd.read_csv("tmp/f2.csv", header=None)
out = df1.mask(pd.crosstab(df2[1], df2[0]).astype(bool), other="NA").reset_index()
​
Output :
print(out)
row_name a b c
0 X NA 2 NA
1 Y 1 NA 3
2 Z NA 2 NA
Input used (.csv files) :

Replace value in column based on value in another column

I have a dataframe with 3240 rows and 3 columns. Column Block represents the block in which values in column A and B appeared. Unique number of blocks is 6 but they are repeating in sequence throughout whole dataframe from 1-6. Values in column A are repeating themselves in the sequences of exact order from 1-10 throughout the whole dataframe (blocks). Values in column B exist from a-j (n = 10), but they repeating themselves in random order in sequences from a-j, so they are never duplicated within the Block.
So in each of 6 Blocks, values in column A (1-10) repeat themselves in exact order from 1-10, while In column B, values (a-j) repeat themselves in random order.
Df looks like this:
Block A B ID
1 1 a XY
1 2 b XY
1 3 c XY
1 4 d XY
1 5 e XY
1 6 f XY
1 7 g XY
1 8 h XY
1 9 i XY
1 10 j XY
....
6 1 d XY
...
6 6 j XY
....
1 1 g XX
1 2 a XX
Throughout dataframe i would like to replace all values in column B based on corresponding value in column A for each separate Block. Logic would be to replace values in column B based on values in column A by this pattern 1=6, 2=7, 3=8, 4=9, 5=10.
Result would look like this:
Block A B ID
1 1 f XY
1 2 g XY
1 3 h XY
1 4 i XY
1 5 j XY
1 6 a XY
1 7 b XY
1 8 c XY
1 9 d XY
1 10 e XY
....
6 1 j XY
...
6 6 d XY
....
1 1 g XX
1 2 a XX
What would be an efficient to do this?
You want to identify the block of 5 within each block of 10 and swap them. This is my solution:
df['B'] = (df.assign(blk_5 = (np.arange(len(df))//5+1) % 2,
blk_10 = np.arange(len(df)) // 10
)
.sort_values(['Block','blk_10','blk_5'])
['B'].values
)

Python3 pandas: data frame grouped by a columns(such as name), then extract a number of rows for each group

There is data frame called df as following:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
The data frame is grouped by name, then I want to extract a number of rows by row index(for example: 2) for the new dataframe: df_new.
name id age text
a 1 1 very good, and I like him
a 1 1 I will visit china
b 2 2 I play basketball with his brother
b 2 2 no one can understand me, I will solve it
c 3 3 I hope to get a offer
c 3 3 I like followers
d 4 4 everything goes well, I think
d 4 4 maybe I will be good
df_new = (df.groupby('screen_name'))[0:2]
But there is error:
hash(key)
TypeError: unhashable type: 'slice'
Try using head() instead.
import pandas as pd
from io import StringIO
buff = StringIO('''
name,id,age,text
a,1,1,"very good, and I like him"
b,2,2,I play basketball with his brother
c,3,3,I hope to get a offer
d,4,4,"everything goes well, I think"
a,1,1,I will visit china
b,2,2,"no one can understand me, I will solve it"
c,3,3,I like followers
d,4,4,maybe I will be good
a,1,1,I should work hard to finish my research
b,2,2,"water is the source of earth, I agree it"
c,3,3,I hope you can keep in touch with me
d,4,4,"My baby is very cute, I like him"
''')
df = pd.read_csv(buff)
using head() instead of [:2] then sorting by name
df_new = df.groupby('name').head(2).sort_values('name')
print(df_new)
name id age text
0 a 1 1 very good, and I like him
4 a 1 1 I will visit china
1 b 2 2 I play basketball with his brother
5 b 2 2 no one can understand me, I will solve it
2 c 3 3 I hope to get a offer
6 c 3 3 I like followers
3 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good
Another solution with iloc:
df_new = df.groupby('name').apply(lambda x: x.iloc[:2]).reset_index(drop=True)
print(df_new)
name id age text
0 a 1 1 very good, and I like him
1 a 1 1 I will visit china
2 b 2 2 I play basketball with his brother
3 b 2 2 no one can understand me, I will solve it
4 c 3 3 I hope to get a offer
5 c 3 3 I like followers
6 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.

Calculating Growth-Rates by applying log-differences

I am trying to transform my data.frame by calculating the log-differences of each column
and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable.
So here is a random df with an id column, a time period colum p and three variable columns:
df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"),
p = c(1,2,3,1,2,1,2,3,4,5),
var1 = rnorm(10, 5),
var2 = rnorm(10, 5),
var3 = rnorm(10, 5)
)
df
id p var1 var2 var3
1 a 1 5.375797 4.110324 5.773473
2 a 2 4.574700 6.541862 6.116153
3 a 3 3.029428 4.931924 5.631847
4 c 1 5.375855 4.181034 5.756510
5 c 2 5.067131 6.053009 6.746442
6 d 1 3.846438 4.515268 6.920389
7 d 2 4.910792 5.525340 4.625942
8 d 3 6.410238 5.138040 7.404533
9 d 4 4.637469 3.522542 3.661668
10 d 5 5.519138 4.599829 5.566892
Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate
the shortcut.
Here is the function and the output for the posted data frame:
fct.logDiff <- function (df) {
df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)])))
list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff))))
ldply (list.nalog, data.frame)
}
fct.logDiff(df)
id p var1 var2 var3
1 a 1 NA NA NA
2 a 2 -0.16136569 0.46472004 0.05765945
3 a 3 -0.41216720 -0.28249264 -0.08249587
4 c 1 NA NA NA
5 c 2 -0.05914281 0.36999681 0.15868378
6 d 1 NA NA NA
7 d 2 0.24428771 0.20188025 -0.40279188
8 d 3 0.26646102 -0.07267311 0.47041227
9 d 4 -0.32372771 -0.37748866 -0.70417351
10 d 5 0.17405309 0.26683625 0.41891802
The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs to each id's first line, and afterwards transform the list back into a data.frame. That looks tedious.
Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?
How about this?
nadiff <- function(x, ...) c(NA, diff(x, ...))
ddply(df, "code", colwise(nadiff, c("var1", "var2", "var3")))