Julia: Collapsing DataFrame by multiple values retaining additional variables - dataframe

I have some data that has duplicate fields with the exception of a single field which I would like to join. In the data everything but the report should stay the same on each day and each company. Companies can file multiple reports on the same day.
I can join using the following code but I am losing the variables which are not in my by function. Any suggestions?
Mock Data
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v" * string(i))] = ""
end
for i in 1:size(x, 1),j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] =
join(rand('a':'z', 3), "")
end
Collapsed data
outdf = by(df, [:company, :day]) do sub
t = DataFrame(fullreport = join(sub.report, "\n(Joined)\n"))
end

Here are some minor tweaks in your data preparation code:
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v", i)] .= ""
end
for i in 1:size(x, 1), j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v", j)] .= join(rand('a':'z', 3), "")
end
and here is by that keeps all other variables (assuming they are constant per group, this code should be efficient even for relatively large data):
outdf = by(df, [:company, :day]) do sub
merge((fullreport = join(sub.report, "\n(Joined)\n"),),
copy(sub[1, Not([:company, :day, :report])]))
end
I put the fullreport variable as a first one.
Here is the code that would keep all rows from the original data frame:
outdf = by(df, [:company, :day]) do sub
insertcols!(select(sub, Not([:company, :day, :report])), 1,
fullreport = join(sub.report, "\n(Joined)\n"))
end
and now you can e.g. check that unique(outdf) produces the same data frame as the one that was generated by the fist by.
(in the codes above I dropped also :report variable as I guess you did not want it in the result - right?)

Related

How to calculate the number of scatterplot data points in a particular 'region' of the graph

As my questions says I'm trying to find a way to calculate the number of scatterplot data points (pink dots) in a particular 'region' of the graph or either side of the black lines/boundaries. Open to any ideas as I don't even know where to start. Thank you!!
The code:
################################
############ GES ##############
################################
p = fits.open('GES_DR17.fits')
pfeh = p[1].data['Fe_H']
pmgfe = p[1].data['Mg_Fe']
pmnfe = p[1].data['Mn_Fe']
palfe = p[1].data['Al_Fe']
#Calculate [(MgMn]
pmgmn = pmgfe - pmnfe
ax1a.scatter(palfe, pmgmn, c='thistle', marker='.',alpha=0.8,s=500,edgecolors='black',lw=0.3, vmin=-2.5, vmax=0.65)
ax1a.plot([-1,-0.07],[0.25,0.25], c='black')
ax1a.plot([-0.07,1.0],[0.25,0.25], '--', c='black')
x = np.arange(-0.15,0.4,0.01)
ax1a.plot(x,4.25*x+0.8875, 'k', c='black')
Let's call the two axes x and y. Any line in this plot can be written as
a*x + b*y + c = 0
for some value of a,b,c. But if we plug in a points with coordinates (x,y) in to the left hand side of the equation above we get positive value for all points of the one side of the line, and a negative value for the points on the other side of the line. So If you have multiple regions delimited by lines you can just check the signs. With this you can create a boolean mask for each region, and just count the number of Trues by using np.sum.
# assign the coordinates to the variables x and y as numpy arrays
x = ...
y = ...
line1 = a1*x + b1*y + c1
line2 = a2*x + b2*y + c2
mask = (line1 > 0) & (line2 < 0) # just an example, signs might vary
count = np.sum(mask)

Better way to concatenate panda matrices

I need to concatenate multiple matrices (containing numbers and strings) in a loop, so far I wrote this solution but I don't like to use a dummy variable (h) and I'm sure the code could be improved.
Here it is:
h = 0
for name in list_of_matrices:
h +=1
Matrix = pd.read_csv(name)
if h == 1:
Matrix_final = Matrix
continue
Matrix_final = pd.concat([Matrix_final,Matrix])
For some reason if I use the following code I end up having 2 matrices one after the other and not a joint one (so this code is not fitting):
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)

Create Dataframe name from 2 strings or variables pandas

i am extracting selected pages from a pdf file. and want to assign dataframe name based on the pages extracted:
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
for i in selected_pages():
df{str(i)} = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True,area = [100,10,740,950],pages= (i), index = False)
print (df{str(i)} )
The idea, ultimately, as in above example, is to have dataframes: df10, df11. I have tried "df" + str(i), "df" & str(i) & df{str(i)}. however all are giving error msg: SyntaxError: invalid syntax
Or any better way of doing it is most welcome. thanks
This is where a dictionary would be a much better option.
Also note the error you have at the start of the loop. selected_pages is a list, so you can't do selected_pages().
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
df = {}
for i in selected_pages:
df[i] = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True, area = [100,10,740,950], pages= (i), index = False)
i = int(i) - 1 # this will bring it to 10
dfB = df[str(i)]
#select row number to drop: 0:4
dfB.drop(dfB.index[0:4],axis =0, inplace = True)
dfB.columns = ['col1','col2','col3','col4','col5']

Incomplete line-plotting

I am working on an app that plots a graph based on the input from the user. However when I try plotting the graph, the line doesn't go all the way.
The code is below:
v = text_v.get()
i = text_i.get()
v2 = v.split(", ")
i2 = i.split(", ")
outcome = True
#Testing for pairing of values
if range(len(v2)) != range(len(i2)):
tkm.showerror("Incomplete values", "You did not enter an equal number of voltage and resistance values")
outcome = False
#Testing for v-value type
for x in range(len(v2)):
try:
float(v2[x])
outcome = True
except:
tkm.showerror("Wrong value type", "All your voltage values must be floats or integers")
outcome = False
break
#Testing for i-value type
for x in range(len(i2)):
try:
float(i2[x])
except:
tkm.showerror("Wrong value type", "All your current values must be floats or integers")
outcome = False
break
#------------------------Graph plotting function----------------------------
if outcome == True:
v = np.array(list(map(float, v.split(", "))))
i = np.array(list(map(float, i.split(", "))))
fit = np.polyfit(i,v,1)
fit_fn = np.poly1d(fit)
plt.plot(i,v, 'bo', v, fit_fn(v), '--k')

Iterating over multidimensional Numpy array

What is the fastest way to iterate over all elements in a 3D NumPy array? If array.shape = (r,c,z), there must be something faster than this:
x = np.asarray(range(12)).reshape((1,4,3))
#function that sums nearest neighbor values
x = np.asarray(range(12)).reshape((1, 4,3))
#e is my element location, d is the distance
def nn(arr, e, d=1):
d = e[0]
r = e[1]
c = e[2]
return sum(arr[d,r-1,c-1:c+2]) + sum(arr[d,r+1, c-1:c+2]) + sum(arr[d,r,c-1]) + sum(arr[d,r,c+1])
Instead of creating a nested for loop like the one below to create my values of e to run the function nn for each pixel :
for dim in range(z):
for row in range(r):
for col in range(c):
e = (dim, row, col)
I'd like to vectorize my nn function in a way that extracts location information for each element (e = (0,1,1) for example) and iterates over ALL elements in my matrix without having to manually input each locational value of e OR creating a messy nested for loop. I'm not sure how to apply np.vectorize to this problem. Thanks!
It is easy to vectorize over the d dimension:
def nn(arr, e):
r,c = e # (e[0],e[1])
return np.sum(arr[:,r-1,c-1:c+2],axis=2) + np.sum(arr[:,r+1,c-1:c+2],axis=2) +
np.sum(arr[:,r,c-1],axis=?) + np.sum(arr[:,r,c+1],axis=?)
now just iterate over the row and col dimensions, returning a vector, that is assigned to the appropriate slot in x.
for row in <correct range>:
for col in <correct range>:
x[:,row,col] = nn(data, (row,col))
The next step is to make
rows = [:,None]
cols =
arr[:,rows-1,cols+2] + arr[:,rows,cols+2] etc.
This kind of problem has come up many times, with various descriptions - convolution, smoothing, filtering etc.
We could do some searches to find the best, or it you prefer, we could guide you through the steps.
Converting a nested loop calculation to Numpy for speedup
is a question similar to yours. There's only 2 levels of looping, and sum expression is different, but I think it has the same issues:
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] +
t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
Here's what I ended up doing. Since I'm returning the xv vector and slipping it in to the larger 3D array lag, this should speed up the process, right? data is my input dataset.
def nn3d(arr, e):
r,c = e
n = np.copy(arr[:,r-1:r+2,c-1:c+2])
n[:,1,1] = 0
n3d = np.ma.masked_where(n == nodata, n)
xv = np.zeros(arr.shape[0])
for d in range(arr.shape[0]):
if np.ma.count(n3d[d,:,:]) < 2:
element = nodata
else:
element = np.sum(n3d[d,:,:])/(np.ma.count(n3d[d,:,:])-1)
xv[d] = element
return xv
lag = np.zeros(shape = data.shape)
for r in range(1,data.shape[1]-1): #boundary effects
for c in range(1,data.shape[2]-1):
lag[:,r,c] = nn3d(data,(r,c))
What you are looking for is probably array.nditer:
a = np.arange(6).reshape(2,3)
for x in np.nditer(a):
print(x, end=' ')
which prints
0 1 2 3 4 5