Finding the distances from each point to the rest, looping - pandas

I am new to python.
I have a csv file containing 400 pairs of x and y in two columns.
I want to loop over the data such that it starts from a pair (x_i,y_i) and finds the distance between that pair and the rest of the 399 points. I want the process to be repeated for all pairs of (x_i,y_i) and the result is appended to to a list Dist_i
import pandas as pd
x_y_data = pd.read_csv("x_y_points400_labeled_csv.csv")
x = x_y_data.loc[:,'x']
y = x_y_data.loc[:,'y']
i=0
j=0
while (i<len(x)):
Dist=np.sqrt((x[i]-x)**2 + (y[j]-y)**2)
i = 1 + i
j = 1 + j
print(Dist)
output:
0 676.144955
1 675.503342
2 674.642602
..
396 9.897127
397 21.659654
398 15.508062
399 0.000000
Length: 400, dtype: float64
This is how far I went, but it is not what I intend to obtain. My goal is to get something like in the picture attached.
Thanks for your help in advance
enter image description here

You can use broadcasting (arr[:, None]) to do this calculation all at once. This will give you the repetitive calculations you want. Otherwise scipy.spatial.distance.pdist gives you the upper triangle of the calculations.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
N = 6
df = pd.DataFrame(np.random.normal(0, 1, (N, 2)),
columns=['X', 'Y'],
index=[f'point{i}' for i in range(N)])
x = df['X'].to_numpy()
y = df['Y'].to_numpy()
result = pd.DataFrame(np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2),
index=df.index,
columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000
With scipy.
from scipy.spatial.distance import pdist
pdist(df[['X', 'Y']])
array([2.8532972 , 0.82759587, 1.95770875, 3.00078036, 1.16534282,
3.27316125, 2.91598992, 1.17270443, 1.70814458, 2.78266933,
3.1214628 , 1.74902298, 3.7184812 , 1.77945856, 2.09245472])
To turn this into the above DataFrame.
L = len(df)
arr = np.zeros((L, L))
arr[np.triu_indices(L, 1)] = pdist(df[['X', 'Y']])
arr = arr + arr.T # Lower triangle b/c symmetric
pd.DataFrame(arr, index=df.index, columns=df.index)
point0 point1 point2 point3 point4 point5
point0 0.000000 2.853297 0.827596 1.957709 3.000780 1.165343
point1 2.853297 0.000000 3.273161 2.915990 1.172704 1.708145
point2 0.827596 3.273161 0.000000 2.782669 3.121463 1.749023
point3 1.957709 2.915990 2.782669 0.000000 3.718481 1.779459
point4 3.000780 1.172704 3.121463 3.718481 0.000000 2.092455
point5 1.165343 1.708145 1.749023 1.779459 2.092455 0.000000

Related

Dirichlet regressioni coefficients

starting with this example of Dirichlet regression here.
My variable y is a vector of N = 3 elements and the Dirichlet regression model estimates N-1 coeff.
Let’s say I am interested in all 3 coefficients, how can I get them?
Thanks!
library(brms)
library(rstan)
library(dplyr)
bind <- function(...) cbind(...)
N <- 20
df <- data.frame(
y1 = rbinom(N, 10, 0.5), y2 = rbinom(N, 10, 0.7),
y3 = rbinom(N, 10, 0.9), x = rnorm(N)
) %>%
mutate(
size = y1 + y2 + y3,
y1 = y1 / size,
y2 = y2 / size,
y3 = y3 / size
)
df$y <- with(df, cbind(y1, y2, y3))
make_stancode(bind(y1, y2, y3) ~ x, df, dirichlet())
make_standata(bind(y1, y2, y3) ~ x, df, dirichlet())
fit <- brm(bind(y1, y2, y3) ~ x, df, dirichlet())
summary(fit)
Family: dirichlet
Links: muy2 = logit; muy3 = logit; phi = identity
Formula: bind(y1, y2, y3) ~ x
Data: df (Number of observations: 20)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
muy2_Intercept 0.29 0.10 0.10 0.47 1.00 2830 2514
muy3_Intercept 0.56 0.09 0.38 0.73 1.00 2833 2623
muy2_x 0.04 0.11 -0.17 0.24 1.00 3265 2890
muy3_x -0.00 0.10 -0.20 0.19 1.00 3229 2973
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 39.85 9.13 23.83 59.78 1.00 3358 2652
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

probability of sample of distribution

I am trying to generate a sample of 100 scenarios (X, Y) where both X and Y are normally distributed X=N(50,5^2), Y=N(30,2^2) and X and Y are correlated Cov(X,Y)=0.4.
I have been able to generate 100 scenarios with the Cholesky decomposition:
# We do a Cholesky decomposition to generate correlated scenarios
nScenarios = 10
Σ = [25 0.4; 0.4 4]
μ = [50, 30]
L = cholesky(Σ)
v = [rand(Normal(0, 1), nScenarios), rand(Normal(0, 1), nScenarios)]
X = reshape(zeros(nScenarios),1,nScenarios)
Y = reshape(zeros(nScenarios),1,nScenarios)
for i = 1:nScenarios
X[1, i] = sum(L.U[1, j] *v[j][i] for j = 1:nBreadTypes) + μ[1]
Y[1, i] = sum(L.U[2, j] *v[j][i] for j = 1:nBreadTypes) + μ[2]
end
However I need the probability of each scenario, i.e P(X=k and Y=p). My question would be, how can we get a sample of a certain distribution with the probability of each scenario?
Following the BatWannaBe explanation, normally I would do it like this:
julia> using Distributions
julia> d = MvNormal([50.0, 30.0], [25.0 0.4; 0.4 4.0])
FullNormal(
dim: 2
μ: [50.0, 30.0]
Σ: [25.0 0.4; 0.4 4.0]
)
julia> point = rand(d)
2-element Vector{Float64}:
52.807189619051485
32.693811008760676
julia> pdf(d, point)
0.0056519503173830515

RuntimeWarning: invalid value encountered

I'm trying to make my Philips hue lights change colors based on the Hz of a played song. But i faced a RuntimeWarning and can't figure out whats going on. I'd highly appreciate it if anyone could help me out here :)
wf = wave.open('visualize.wav', 'rb')
swidth = wf.getsampwidth()
RATE = wf.getframerate()
window = np.blackman(chunk)
p = pyaudio.PyAudio()
channels = wf.getnchannels()
stream = p.open(format =
p.get_format_from_width(wf.getsampwidth()),
channels = channels,
rate = RATE,
output = True)
data = wf.readframes(chunk)
print('switdth {} chunk {} data {} ch {}'.format(swidth,chunk,len(data), channels))
while len(data) == chunk*swidth*channels:
stream.write(data)
indata = np.fromstring(data, dtype='int16')
channel0 = indata[0::channels]
fftData=abs(np.fft.rfft(indata))**2
which = fftData[1:].argmax() + 1
if which != len(fftData)-1:
y0,y1,y2 = np.log(fftData[which-1:which+2:])
x1 = (y2 - y0) * .5 / (2 * y1 - y2 - y0)
thefreq = (which+x1)*RATE/chunk
print ("The freq is %f Hz." % (thefreq))
elif thefreq > 4000:
for i in cycle(color_list):
change_light_color(room, *color_list[i])
time.sleep(0.5)
else:
if thefreq < 4000:
for i in cycle(color_list_2):
change_light_color(room, *color_list_2[i])
time.sleep(0.5)
if data:
stream.write(data)
stream.close()
p.terminate()
This is what i end up with:
/usr/local/bin/python3 /Users/Sem/Desktop/hue_visualizer/visualize.py
Sem#Sems-MacBook-Pro hue_visualizer % /usr/local/bin/python3 /Users/Sem/Desktop/hue_visualizer/visualize.py
switdth 2 chunk 1024 data 4096 ch 2
/Users/Sem/Desktop/hue_visualizer/visualize.py:69: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
indata = np.fromstring(data, dtype='int16')
/Users/Sem/Desktop/hue_visualizer/visualize.py:74: RuntimeWarning: divide by zero encountered in log
y0,y1,y2 = np.log(fftData[which-1:which+2:])
/Users/Sem/Desktop/hue_visualizer/visualize.py:75: RuntimeWarning: invalid value encountered in double_scalars
x1 = (y2 - y0) * .5 / (2 * y1 - y2 - y0)
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.

pandas iterate over 3 data frames element wise into a function

i wrote :
def revertcheck(basevalue,first,second):
if basevalue==1:
return 0
elif basevalue > first and first > second:
return -abs(first-second)
elif basevalue < first and first < second:
return -abs(first-second)
else:
return abs(first-second)
and now I have 3 same sized correlation matrices of the type
pandas.core.frame.DataFrame
I want to iterate over every element, and feed all those 3 values into my function at a time. Can someone give me a hint how to do that?
AAPL AMZN BAC GE GM GOOG GS SNP XOM
AAPL 1.000000 0.567053 0.410656 0.232328 0.562110 0.616592 0.800797 -0.139989 0.147852
AMZN 0.567053 1.000000 -0.012830 0.071066 0.271695 0.715317 0.146355 -0.861710 -0.015936
BAC 0.410656 -0.012830 1.000000 0.953016 0.958784 0.680979 0.843638 0.466912 0.942582
GE 0.232328 0.071066 0.953016 1.000000 0.935008 0.741110 0.667574 0.308813 0.995237
GM 0.562110 0.271695 0.958784 0.935008 1.000000 0.857678 0.857719 0.206432 0.899904
GOOG 0.616592 0.715317 0.680979 0.741110 0.857678 1.000000 0.632255 -0.326059 0.675568
GS 0.800797 0.146355 0.843638 0.667574 0.857719 0.632255 1.000000 0.373738 0.623147
SNP -0.139989 -0.861710 0.466912 0.308813 0.206432 -0.326059 0.373738 1.000000 0.369004
XOM 0.147852 -0.015936 0.942582 0.995237 0.899904 0.675568 0.623147 0.369004 1.000000
Let's assume basevalue, first and second are your three dataframes of exactly the same size and structure, then you can do what you want in a vectorised manner:
output = abs(first - second)
output = output.mask(basevalue == 1, 0)
output = output.mask((basevalue > first) & (first > second), -abs(first - second))
output = output.mask((basevalue < first) & (first < second), -abs(first - second))

How to create a new column in a Pandas DataFrame using pandas.cut method?

I have a column with house prices that looks like this:
0 0.0
1 1480000.0
2 1035000.0
3 0.0
4 1465000.0
5 850000.0
6 1600000.0
7 0.0
8 0.0
9 0.0
Name: Price, dtype: float64
and I want to create a new column called data['PriceRanges'] which sets each price in a given range. This is what my code looks like:
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
data['PriceRange'] = pd.cut(data.Price, bins=bins, labels=labels, right=True)
And I get this Error message:
TypeError: len() of unsized object
I've been trying different approaches and seem to be stuck here. I'd really appreciate some help.
Thanks,
Hugo
There is problem you overwrite bins and labels in loop, so there is only last value.
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
print (bins)
11950000
print (labels)
11950000
There is no necessary loop, only instead range use numpy alternative arange and for labels create ranges. Last add parameter include_lowest=True to cut for include first value of bins (0) to first group.
bins = np.arange(0, 12000000, 50000)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
print (data)
Price PriceRange
0 0.0 0 - 50000
1 1480000.0 1450001 - 1500000
2 1035000.0 1000001 - 1050000
3 0.0 0 - 50000
4 1465000.0 1450001 - 1500000
5 850000.0 800001 - 850000
6 1600000.0 1550001 - 1600000
7 0.0 0 - 50000
8 0.0 0 - 50000
9 0.0 0 - 50000