Better way to concatenate panda matrices - pandas

I need to concatenate multiple matrices (containing numbers and strings) in a loop, so far I wrote this solution but I don't like to use a dummy variable (h) and I'm sure the code could be improved.
Here it is:
h = 0
for name in list_of_matrices:
h +=1
Matrix = pd.read_csv(name)
if h == 1:
Matrix_final = Matrix
continue
Matrix_final = pd.concat([Matrix_final,Matrix])
For some reason if I use the following code I end up having 2 matrices one after the other and not a joint one (so this code is not fitting):
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)

Related

Can't get dimensions of arrays equal to plot with MatPlotLib

I am trying to create a plot of arrays where one is calculated based on my x-axis calculated in a for loop. I've gone through my code multiple times and tested in between what exactly the lengths are for my arrays, but I can't seem to think of a solution that makes them equal length.
This is the code I have started with:
import numpy as np
import matplotlib.pyplot as plt
a = 1 ;b = 2 ;c = 3; d = 1; e = 2
t0 = 0
t_end = 10
dt = 0.05
t = np.arange(t0, t_end, dt)
n = len(t)
fout = 1
M = 1
Ca = np.zeros(n)
Ca[0] = a; Cb[0] = b
Cc[0] = 0;
k1 = 1
def rA(Ca, Cb, Cc, t):
-k1 * Ca**a * Cb**b * dt
return -k1 * Ca**a * Cb**b * dt
while e > 1e-3:
t = np.arange(t0, t_end, dt)
n = len(t)
for i in range(1,n-1):
Ca[i+1] = Ca[i] + rA(Ca[i], Cb[i], Cc[i], t[i])
e = abs((M-Ca[n-1])/M)
M = Ca[n-1]
dt = dt/2
plt.plot(t, Ca)
plt.grid()
plt.show()
Afterwards, I try to calculate a second function for different y-values. Within the for loop I added:
Cb[i+1] = Cb[i] + rB(Ca[i], Cb[i], Cc[i], t[i])
While also defining rB in a similar manner as rA. The error code I received at this point is:
IndexError: index 200 is out of bounds for axis 0 with size 200
I feel like it has to do with the way I'm initializing the arrays for my Ca. To put it in MatLab code, something I'm more familiar with, looks like this in MatLab:
Ca = zeros(1,n)
I have recreated the code I have written here in MatLab and I do receive a plot. So I'm wondering where I am going wrong here?
So I thought my best course of action was to change n to an int by just changing it in the while loop.
but after changing n = len(t) to n = 100 I received the following error message:
ValueError: x and y must have same first dimension, but have shapes (200,) and (400,)
As my previous question was something trivial I just kept on missing out on, I feel like this is the same. But I have spent over an hour looking and trying fixes without succes.

Vectorizing np.minimum & np.minimum over axes with broadcasting

I've roughly got something like
A = np.random.random([n, 2])
B = np.random.random([3, 2])
...
ret = 0
for b in B:
for a in A:
start = np.max([a[0], b[0]])
end = np.min([a[1], b[1]])
ret += np.max([0, end - start])
return ret
Putting it into words, A is an input array of n 2D intervals and B is a known array of 2D intervals, and I'm trying to compute the length of total intersection between all intervals.
Is there a way to vectorize it? My first though was using the np.maximize and np.minimize along with broadcasting, but nothing seems to work.
Broadcast after extending dimensions to vectorize things -
p1 = np.maximum(A[:,None,0],B[:,0])
p2 = np.minimum(A[:,None,1],B[:,1])
ret = np.maximum(0,p2-p1).sum()

vectorization of loop in pandas

I've been trying to vectorize the following with no such luck:
Consider two data frames. One is a list of dates:
cols = ['col1', 'col2']
index = pd.date_range('1/1/15','8/31/18')
df = pd.DataFrame(columns = cols )
What i'm doing currently is looping thru df and getting the counts of all rows that are less than or equal to the date in question with my main (large) dataframe df_main
for x in range(len(index)):
temp_arr = []
active = len(df_main[(df_main.n_date <= index[x])]
temp_arr = [index[x],active]
df= df.append(pd.Series(temp_arr,index=cols) ,ignore_index=True)
Is there a way to vectorize the above?
What about something like the following
#initializing
mycols = ['col1', 'col2']
myindex = pd.date_range('1/1/15','8/31/18')
mydf = pd.DataFrame(columns = mycols )
#create df_main (that has each of myindex's dates minus 10 days)
df_main = pd.DataFrame(data=myindex-pd.Timedelta(days=10), columns=['n_date'])
#wrap a dataframe around a list comprehension
mydf = pd.DataFrame([[x, len(df_main[df_main['n_date'] <= x])] for x in myindex])

How to plot 4-D data embedded in a dataframe in Julia using a subplots approach?

I have a Julia DataFrame where the first 4 columns are dimensions and the 5th one contains the actual data.
I would like to plot it using a subplots approach where the two main plot axis concern the first two dimensions and each subplot then is a contour plot over the remaining two dimensions.
I am almost there with the above code:
using DataFrames,Plots
# plotlyjs() # doesn't work with plotlyjs backend
pyplot()
X = [1,2,3,4]
Y = [0.1,0.15,0.2]
I = [2,4,6,8,10,12,14]
J = [10,20,30,40,50,60]
df = DataFrame(X=Int64[], Y=Float64[], I=Float64[], J=Float64[], V=Float64[] )
[push!(df,[x,y,i,j,(5*x+20*y+2)*(0.2*i^2+0.5*j^2+3*i*j+2*i^2*j+1)]) for x in X, y in Y, i in I, j in J]
minvalue = minimum(df[:V])
maxvalue = maximum(df[:V])
function toDict(df, dimCols, valueCol)
toReturn = Dict()
for r in eachrow(df)
keyValues = []
[push!(keyValues,r[d]) for d in dimCols]
toReturn[(keyValues...)] = r[valueCol]
end
return toReturn
end
dict = toDict(df, [:X,:Y,:I,:J], :V )
M = [dict[(x,y,i,j)] for j in J, i in I, y in Y, x in X ]
yL = length(Y)
xL = length(X)
plot(contour(M[:,:,3,1], ylabel="y = $(string(Y[3]))", zlims=(minvalue,maxvalue)), contour(M[:,:,3,2]), contour(M[:,:,3,3]), contour(M[:,:,3,4]),
contour(M[:,:,2,1], ylabel="y = $(string(Y[2]))", zlims=(minvalue,maxvalue)), contour(M[:,:,2,2]), contour(M[:,:,2,3]), contour(M[:,:,2,4]),
contour(M[:,:,1,1], ylabel="y = $(string(Y[1]))", xlabel="x = $(string(X[1]))"), contour(M[:,:,1,2], xlabel="x = $(string(X[2]))"), contour(M[:,:,1,3], xlabel="x = $(string(X[3]))"), contour(M[:,:,3,4], xlabel="x = $(string(X[4]))"),
layout=(yL,xL) )
This produces:
I remain however with the following concerns:
How do I automatize the creation of each subplot in the subplot call ? Do I need to write a macro ?
I would like each subplot to have the same limits in the z axis, but zlims seems not to work. Is zlims not yet supported ?
How do I hide the legend on the z axis on each subplot and plot it instead apart (best would be on the right side of the main/total plot) ?
EDIT:
For the first point I don't need a macro, I can create the subplots in a for loop, add them in a array and pass the array to the plot() call using the ellipsis operator:
plots = []
for y in length(Y):-1:1
for x in 1:length(X)
xlabel = y == 1 ? "x = $(string(X[x]))" : ""
ylabel = x==1 ? "y = $(string(Y[y]))" : ""
println("$y - $x")
plot = contour(I,J,M[:,:,y,x], xlabel=xlabel, ylabel=ylabel, zlims=(minvalue,maxvalue))
push!(plots,plot)
end
end
plot(plots..., layout=(yL,xL))

Iterating over multidimensional Numpy array

What is the fastest way to iterate over all elements in a 3D NumPy array? If array.shape = (r,c,z), there must be something faster than this:
x = np.asarray(range(12)).reshape((1,4,3))
#function that sums nearest neighbor values
x = np.asarray(range(12)).reshape((1, 4,3))
#e is my element location, d is the distance
def nn(arr, e, d=1):
d = e[0]
r = e[1]
c = e[2]
return sum(arr[d,r-1,c-1:c+2]) + sum(arr[d,r+1, c-1:c+2]) + sum(arr[d,r,c-1]) + sum(arr[d,r,c+1])
Instead of creating a nested for loop like the one below to create my values of e to run the function nn for each pixel :
for dim in range(z):
for row in range(r):
for col in range(c):
e = (dim, row, col)
I'd like to vectorize my nn function in a way that extracts location information for each element (e = (0,1,1) for example) and iterates over ALL elements in my matrix without having to manually input each locational value of e OR creating a messy nested for loop. I'm not sure how to apply np.vectorize to this problem. Thanks!
It is easy to vectorize over the d dimension:
def nn(arr, e):
r,c = e # (e[0],e[1])
return np.sum(arr[:,r-1,c-1:c+2],axis=2) + np.sum(arr[:,r+1,c-1:c+2],axis=2) +
np.sum(arr[:,r,c-1],axis=?) + np.sum(arr[:,r,c+1],axis=?)
now just iterate over the row and col dimensions, returning a vector, that is assigned to the appropriate slot in x.
for row in <correct range>:
for col in <correct range>:
x[:,row,col] = nn(data, (row,col))
The next step is to make
rows = [:,None]
cols =
arr[:,rows-1,cols+2] + arr[:,rows,cols+2] etc.
This kind of problem has come up many times, with various descriptions - convolution, smoothing, filtering etc.
We could do some searches to find the best, or it you prefer, we could guide you through the steps.
Converting a nested loop calculation to Numpy for speedup
is a question similar to yours. There's only 2 levels of looping, and sum expression is different, but I think it has the same issues:
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] +
t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
Here's what I ended up doing. Since I'm returning the xv vector and slipping it in to the larger 3D array lag, this should speed up the process, right? data is my input dataset.
def nn3d(arr, e):
r,c = e
n = np.copy(arr[:,r-1:r+2,c-1:c+2])
n[:,1,1] = 0
n3d = np.ma.masked_where(n == nodata, n)
xv = np.zeros(arr.shape[0])
for d in range(arr.shape[0]):
if np.ma.count(n3d[d,:,:]) < 2:
element = nodata
else:
element = np.sum(n3d[d,:,:])/(np.ma.count(n3d[d,:,:])-1)
xv[d] = element
return xv
lag = np.zeros(shape = data.shape)
for r in range(1,data.shape[1]-1): #boundary effects
for c in range(1,data.shape[2]-1):
lag[:,r,c] = nn3d(data,(r,c))
What you are looking for is probably array.nditer:
a = np.arange(6).reshape(2,3)
for x in np.nditer(a):
print(x, end=' ')
which prints
0 1 2 3 4 5