pandas iterate over 3 data frames element wise into a function - pandas

i wrote :
def revertcheck(basevalue,first,second):
if basevalue==1:
return 0
elif basevalue > first and first > second:
return -abs(first-second)
elif basevalue < first and first < second:
return -abs(first-second)
else:
return abs(first-second)
and now I have 3 same sized correlation matrices of the type
pandas.core.frame.DataFrame
I want to iterate over every element, and feed all those 3 values into my function at a time. Can someone give me a hint how to do that?
AAPL AMZN BAC GE GM GOOG GS SNP XOM
AAPL 1.000000 0.567053 0.410656 0.232328 0.562110 0.616592 0.800797 -0.139989 0.147852
AMZN 0.567053 1.000000 -0.012830 0.071066 0.271695 0.715317 0.146355 -0.861710 -0.015936
BAC 0.410656 -0.012830 1.000000 0.953016 0.958784 0.680979 0.843638 0.466912 0.942582
GE 0.232328 0.071066 0.953016 1.000000 0.935008 0.741110 0.667574 0.308813 0.995237
GM 0.562110 0.271695 0.958784 0.935008 1.000000 0.857678 0.857719 0.206432 0.899904
GOOG 0.616592 0.715317 0.680979 0.741110 0.857678 1.000000 0.632255 -0.326059 0.675568
GS 0.800797 0.146355 0.843638 0.667574 0.857719 0.632255 1.000000 0.373738 0.623147
SNP -0.139989 -0.861710 0.466912 0.308813 0.206432 -0.326059 0.373738 1.000000 0.369004
XOM 0.147852 -0.015936 0.942582 0.995237 0.899904 0.675568 0.623147 0.369004 1.000000

Let's assume basevalue, first and second are your three dataframes of exactly the same size and structure, then you can do what you want in a vectorised manner:
output = abs(first - second)
output = output.mask(basevalue == 1, 0)
output = output.mask((basevalue > first) & (first > second), -abs(first - second))
output = output.mask((basevalue < first) & (first < second), -abs(first - second))

Related

Splitting a coordinate string into X and Y columns with a pandas data frame

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()
try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))
using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

Finding the distances from each point to the rest, looping

I am new to python.
I have a csv file containing 400 pairs of x and y in two columns.
I want to loop over the data such that it starts from a pair (x_i,y_i) and finds the distance between that pair and the rest of the 399 points. I want the process to be repeated for all pairs of (x_i,y_i) and the result is appended to to a list Dist_i
import pandas as pd
x_y_data = pd.read_csv("x_y_points400_labeled_csv.csv")
x = x_y_data.loc[:,'x']
y = x_y_data.loc[:,'y']
i=0
j=0
while (i<len(x)):
Dist=np.sqrt((x[i]-x)**2 + (y[j]-y)**2)
i = 1 + i
j = 1 + j
print(Dist)
output:
0 676.144955
1 675.503342
2 674.642602
..
396 9.897127
397 21.659654
398 15.508062
399 0.000000
Length: 400, dtype: float64
This is how far I went, but it is not what I intend to obtain. My goal is to get something like in the picture attached.
Thanks for your help in advance
enter image description here
You can use broadcasting (arr[:, None]) to do this calculation all at once. This will give you the repetitive calculations you want. Otherwise scipy.spatial.distance.pdist gives you the upper triangle of the calculations.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
N = 6
df = pd.DataFrame(np.random.normal(0, 1, (N, 2)),
columns=['X', 'Y'],
index=[f'point{i}' for i in range(N)])
x = df['X'].to_numpy()
y = df['Y'].to_numpy()
result = pd.DataFrame(np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2),
index=df.index,
columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000
With scipy.
from scipy.spatial.distance import pdist
pdist(df[['X', 'Y']])
array([2.8532972 , 0.82759587, 1.95770875, 3.00078036, 1.16534282,
3.27316125, 2.91598992, 1.17270443, 1.70814458, 2.78266933,
3.1214628 , 1.74902298, 3.7184812 , 1.77945856, 2.09245472])
To turn this into the above DataFrame.
L = len(df)
arr = np.zeros((L, L))
arr[np.triu_indices(L, 1)] = pdist(df[['X', 'Y']])
arr = arr + arr.T # Lower triangle b/c symmetric
pd.DataFrame(arr, index=df.index, columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000

How to create a new column in a Pandas DataFrame using pandas.cut method?

I have a column with house prices that looks like this:
0 0.0
1 1480000.0
2 1035000.0
3 0.0
4 1465000.0
5 850000.0
6 1600000.0
7 0.0
8 0.0
9 0.0
Name: Price, dtype: float64
and I want to create a new column called data['PriceRanges'] which sets each price in a given range. This is what my code looks like:
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
data['PriceRange'] = pd.cut(data.Price, bins=bins, labels=labels, right=True)
And I get this Error message:
TypeError: len() of unsized object
I've been trying different approaches and seem to be stuck here. I'd really appreciate some help.
Thanks,
Hugo
There is problem you overwrite bins and labels in loop, so there is only last value.
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
print (bins)
11950000
print (labels)
11950000
There is no necessary loop, only instead range use numpy alternative arange and for labels create ranges. Last add parameter include_lowest=True to cut for include first value of bins (0) to first group.
bins = np.arange(0, 12000000, 50000)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
print (data)
Price PriceRange
0 0.0 0 - 50000
1 1480000.0 1450001 - 1500000
2 1035000.0 1000001 - 1050000
3 0.0 0 - 50000
4 1465000.0 1450001 - 1500000
5 850000.0 800001 - 850000
6 1600000.0 1550001 - 1600000
7 0.0 0 - 50000
8 0.0 0 - 50000
9 0.0 0 - 50000

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64

Faster way to split a string and count characters using R?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")