How do I split the input for a mxnet neural network? - mxnet

I'm making a neural network with a not very obvious connection architecture. I comes down to 4 chunks of inputs, each of them having their own path towards the final outputlayer, merging them in between in several steps.
I'm using mxnet on R. I defined 4 mx.symbol.Variable("data"), each being the input of the 'chunks' of input. Only 1 is shown on the graph (graph.viz(out)). I can understand that, it makes sense that I should provide only 1 large input-vector. However... how do I split that inputvector? (Oh, and to be sure, state.size of a lstm in mxnet, that points to the number of memory blocks, equal to the # input and outputs, right?)
Code (minimal example):
in1 <- mx.symbol.Variable("data1") # 7 values
in2 <- mx.symbol.Variable("data2") # 7 values
l1.1 <- mx.symbol.RNN(in1, name="lstm1.1", mode="lstm", num.layers=1, state.size=7) # first nn on the first 7 inputs
l1.2 <- mx.symbol.RNN(in2, name="lstm1.2", mode="lstm", num.layers=1, state.size=7) # second nn on the last 7 inputs
l2 <- mx.symbol.RNN(data=l1.1+l1.2, name="lstml2", mode="lstm", num.layers=1, state.size=14) # state.size must be 14, since each input separate has 7
out.dens <- mx.symbol.FullyConnected(c.lstm, name="out-dens", num.hidden=4) #
pred <- mx.symbol.LinearRegressionOutput(out, name="pred")
graph.viz(pred) # only lstm1.1 has a data input
So the question is: How to provide 2 inputs, or how to split 1 input-vector in 2? (in1 and in2 doesn't seem to work out fine)
-- EDIT 1
One could imagine that the name of the symbolic variable first two lines
in1 <- mx.symbol.Variable("data") # 7 values
in2 <- mx.symbol.Variable("data") # 7 values
should be different. I'm pretty sure they should, more like:
in1 <- mx.symbol.Variable("data1") # 7 values
in2 <- mx.symbol.Variable("data2") # 7 values
However, changing them doesn't change the visual computation graph. Well, it does: now again only 1 input is shown, but not as input (green circle), but as an actual layer (pink rectangle). This doesn't seem to help me out...

Changing the first lines to:
input <- mx.symbol.Variable("data")
in1 <- mx.symbol.slice(input, begin=0, end=6)
in2 <- mx.symbol.slice(input, begin=7, end=13)
and not adding but using concat
l2 <- mx.symbol.concat(c(l1.1, 1.2), num.args=14)
l2 <- mx.symbol.RNN(data=l2, name="lstml2", mode="lstm", num.layers=1, state.size=14) # state.size must be 14, since each input separate has 7
seems to be ok

Related

How to connect points with different indices (one data file) in gnuplot

I have a file "a_test.dat" with two data blocks that I can select via the corresponding index.
# first
x1 y1
3 1
6 2
9 8
# second
x2 y2
4 5
8 2
2 7
Now I want to connect the data points of both indices with an arrow.
set arrow from (x1,y1) to (x2,y2).
I can plot both blocks with one plot statement. But I cannot get the points to set the arrows.
plot "a_test.dat" index "first" u 1:2, "" index "second" u 1:2
From version 5.2 you can use gnuplot arrays:
stats "a_test.dat" nooutput
array xx[STATS_records]
array yy[STATS_records]
# save all data into two arrays
i = 1
fnset(x,y) = (xx[i]=x, yy[i]=y, i=i+1)
# parse data ignoring output
set table $dummy
plot "" using (fnset($1,$2)) with table
unset table
# x2,y2 data starts at midpoint in array
numi = int((i-1)/2)
plot for [i=1:numi] $dummy using (xx[i]):(yy[i]):(xx[numi+i]-xx[i]):(yy[numi+i]-yy[i]) with vectors
Use stats to count the number of lines in the file, so that the array can
be large enough. Create an array xx and another yy to hold the data.
Use plot ... with table to read the file again, calling your function
fnset() for each data line with the x and y column values. The function
saves them at the current index i, which it increments. It was
initialised to 1.
For 3+3 data lines, i ends up at 7, so we set numi to (i-1)/2 i.e. 3.
Use plot for ... vectors to draw the arrows. Each arrow needs 4 data
items from the array. Note that the second x,y must be a relative delta,
not an absolute position.

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.

Array Numpy Side Effect

I found a strange effect when permuting array with numpy:
def permute(yy, kmax) :
kmax=5
kk= np.random.uniform(1,kmax)
nn= int(np.floor(len(yy)/kk))
yy3= np.zeros_like(yy );
np.copyto(yy3,yy)
for ii in range(0, nn):
ax= kk*ii-kk*nn
aux= yy[ax]
aux2= yy[kk*ii]
yy3[ax] = aux
yy3[kk*ii] = aux2
return yy3
and
yy= np.random.normal(0,1,50000)
yy1= permute(yy,2)
( np.var(yy)- np.var(yy1) )
( np.mean(yy)- np.mean(yy1) )
Result is not zero !!!
Do you think this comes from reference assignment in the array ?
I ran your function with np.arange(10) and got
1752:~/mypy$ python stack35004877.py
0.0
0.0
[0 1 2 3 4 5 6 7 8 9] # yy
[0 1 2 3 4 5 6 7 8 9] # yy1
And repeated it with the large random array, with the same 0s for the statistics.
Note that your code did not permute the input
Maybe it will be clearer if I clean it up:
def permute(yy, kmax=5) :
kk= np.random.randint(1,kmax) # int rather than float
nn= int(np.floor(len(yy)/kk))
print(nn,kk)
yy3= yy.copy()
for ii in range(0, nn):
ind1 = kk*ii
ind2 = ind1-kk*nn
yy3[ind2] = yy[ind2]
yy3[ind1] = yy[ind1]
return yy3
You aren't moving anything; and with kmax=2 you just copy every thing from yy to yy3 - something you already did outside the loop. With kmax=5 you don't copy everything in the loop - but the initial copy hides that.
With random.uniform(), kk is a float, and the indexes are also floats. That's not desirable, but apparently not a problem.
But even if I switch the indices:
yy3[ind2] = yy[ind1]
yy3[ind1] = yy[ind2]
I don't permute anything, because ind2 a negative value, that maps on to the same element as ind1. yy[-1] is the last item of yy.
[(0, -10), (1, -9), (2, -8),... (9, -1)]
I could work out the details, but I think you should do that yourself - with a small test case. And skip that initial copyto, that just hides errors in the iteration. Print the details, not just summary statistics from large random arrays.
And in the long run you don't want to use an iteration like this. You want to do the permutation with one indexing call. But first get this version working correctly.

Create 20 unique bingo cards

I'm trying to create 20 unique cards with numbers, but I struggle a bit.. So basically I need to create 20 unique matrices 3x3 having numbers 1-10 in first column, numbers 11-20 in the second column and 21-30 in the third column.. Any ideas? I'd prefer to have it done in r, especially as I don't know Visual Basic. In excel I know how to generate the cards, but not sure how to ensure they are unique..
It seems to be quite precise and straightforward to me. Anyway, i needed to create 20 matrices that would look like :
[,1] [,2] [,3]
[1,] 5 17 23
[2,] 8 18 22
[3,] 3 16 24
Each of the matrices should be unique and each of the columns should consist of three unique numbers ( the 1st column - numbers 1-10, the 2nd column 11-20, the 3rd column - 21-30).
Generating random numbers is easy, though how to make sure that generated cards are unique?Please have a look at the post that i voted for as an answer - as it gives you thorough explanation how to achieve it.
(N.B. : I misread "rows" instead of "columns", so the following code and explanation will deal with matrices with random numbers 1-10 on 1st row, 11-20 on 2nd row etc., instead of columns, but it's exactly the same just transposed)
This code should guarantee uniqueness and good randomness :
library(gtools)
# helper function
getKthPermWithRep <- function(k,n,r){
k <- k - 1
if(n^r< k){
stop('k is greater than possibile permutations')
}
v <- rep.int(0,r)
index <- length(v)
while ( k != 0 )
{
remainder<- k %% n
k <- k %/% n
v[index] <- remainder
index <- index - 1
}
return(v+1)
}
# get all possible permutations of 10 elements taken 3 at a time
# (singlerowperms = 720)
allperms <- permutations(10,3)
singlerowperms <- nrow(allperms)
# get 20 random and unique bingo cards
cards <- lapply(sample.int(singlerowperms^3,20),FUN=function(k){
perm2use <- getKthPermWithRep(k,singlerowperms,3)
m <- allperms[perm2use,]
m[2,] <- m[2,] + 10
m[3,] <- m[3,] + 20
return(m)
# if you want transpose the result just do:
# return(t(m))
})
Explanation
(disclaimer tl;dr)
To guarantee both randomness and uniqueness, one safe approach is generating all the possibile bingo cards and then choose randomly among them without replacements.
To generate all the possible cards, we should :
generate all the possibilities for each row of 3 elements
get the cartesian product of them
Step (1) can be easily obtained using function permutations of package gtools (see the object allPerms in the code). Note that we just need the permutations for the first row (i.e. 3 elements taken from 1-10) since the permutations of the other rows can be easily obtained from the first by adding 10 and 20 respectively.
Step (2) is also easy to get in R, but let's first consider how many possibilities will be generated. Step (1) returned 720 cases for each row, so, in the end we will have 720*720*720 = 720^3 = 373248000 possible bingo cards!
Generate all of them is not practical since the occupied memory would be huge, thus we need to find a way to get 20 random elements in this big range of possibilities without actually keeping them in memory.
The solution comes from the function getKthPermWithRep, which, given an index k, it returns the k-th permutation with repetition of r elements taken from 1:n (note that in this case permutation with repetition corresponds to the cartesian product).
e.g.
# all permutations with repetition of 2 elements in 1:3 are
permutations(n = 3, r = 2,repeats.allowed = TRUE)
# [,1] [,2]
# [1,] 1 1
# [2,] 1 2
# [3,] 1 3
# [4,] 2 1
# [5,] 2 2
# [6,] 2 3
# [7,] 3 1
# [8,] 3 2
# [9,] 3 3
# using the getKthPermWithRep you can get directly the k-th permutation you want :
getKthPermWithRep(k=4,n=3,r=2)
# [1] 2 1
getKthPermWithRep(k=8,n=3,r=2)
# [1] 3 2
Hence now we just choose 20 random indexes in the range 1:720^3 (using sample.int function), then for each of them we get the corresponding permutation of 3 numbers taken from 1:720 using function getKthPermWithRep.
Finally these triplets of numbers, can be converted to actual card rows by using them as indexes to subset allPerms and get our final matrix (after, of course, adding +10 and +20 to the 2nd and 3rd row).
Bonus
Explanation of getKthPermWithRep
If you look at the example above (permutations with repetition of 2 elements in 1:3), and subtract 1 to all number of the results you get this :
> permutations(n = 3, r = 2,repeats.allowed = T) - 1
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 0 2
[4,] 1 0
[5,] 1 1
[6,] 1 2
[7,] 2 0
[8,] 2 1
[9,] 2 2
If you consider each number of each row as a number digit, you can notice that those rows (00, 01, 02...) are all the numbers from 0 to 8, represented in base 3 (yes, 3 as n). So, when you ask the k-th permutation with repetition of r elements in 1:n, you are also asking to translate k-1 into base n and return the digits increased by 1.
Therefore, given the algorithm to change any number from base 10 to base n :
changeBase <- function(num,base){
v <- NULL
while ( num != 0 )
{
remainder = num %% base # assume K > 1
num = num %/% base # integer division
v <- c(remainder,v)
}
if(is.null(v)){
return(0)
}
return(v)
}
you can easily obtain getKthPermWithRep function.
One 3x3 matrix with the desired value range can be generated with the following code:
mat <- matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30, 3)), nrow=3)
Furthermore, you can use a for loop to generate a list of 20 unique matrices as follows:
for (i in 1:20) {
mat[[i]] <- list(matrix(c(sample(1:10,3), sample(11:20,3), sample(21:30,3)), nrow=3))
print(mat[[i]])
}
Well OK I may fall on my face here but I propose a checksum (using Excel).
This is a unique signature for each bingo card which will remain invariate if the order of numbers within any column is changed without changing the actual numbers. The formula is
=SUM(10^MOD(A2:A4,10)+2*10^MOD(B2:B4,10)+4*10^MOD(C2:C4,10))
where the bingo numbers for the first card are in A2:C4.
The idea is to generate a 10-digit number for each column, then multiply each by a constant and add them to get the signature.
So here I have generated two random bingo cards using a standard formula from here plus two which are deliberately made to be just permutations of each other.
Then I check if any of the signatures are duplicates using the formula
=MAX(COUNTIF(D5:D20,D5:D20))
which shouldn't given an answer more than 1.
In the unlikely event that there were duplicates, then you would just press F9 and generate some new cards.
All formulae are array formulae and must be entered with CtrlShiftEnter
Here is an inelegant way to do this. Generate all possible combinations and then sample without replacement. These are permutations, combinations: order does matter in bingo
library(dplyr)
library(tidyr)
library(magrittr)
generate_samples = function(n) {
first = data_frame(first = (n-9):n)
first %>%
merge(first %>% rename(second = first)) %>%
merge(first %>% rename(third = first)) %>%
sample_n(20)
}
suffix = function(df, suffix)
df %>%
setNames(names(.) %>%
paste0(suffix))
generate_samples(10) %>% suffix(10) %>%
bind_cols(generate_samples(20) %>% suffix(20)) %>%
bind_cols(generate_samples(30) %>% suffix(30)) %>%
rowwise %>%
do(matrix = t(.) %>% matrix(3)) %>%
use_series(matrix)

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.