Cannot make sense of keras.datasets.imdb

Cannot make sense of keras.datasets.imdb - tensorflow

I have two problems:
First off, the documentation for tf.keras.datasets.imdb.get_word_index says
Retrieves the dictionary mapping word indices back to words.
While in fact it's the contrary,
print(tf.keras.datasets.imdb.get_word_index())
{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007
I tried to run this in TensorFlow 2.0
(train_data_raw, train_labels), (test_data_raw, test_labels) = keras.datasets.imdb.load_data()
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}
i = 0
train_ex = [idx2words[x] for x in train_data_raw[0]]
train_ex = ' '.join(train_ex)
print(train_ex)
This result in a nonsense string
the as you with out themselves powerful lets loves their [...]
Shouldn't I get a valid movie review?

I did a bit of digging and found that there are a few "offsets" in the processing which need to be undone in order to return a sensible review language. I modified your line to subtract 3 from the index that appears in the raw sequence (since the default is to start real words with index=3), and also the first character is a dummy marker (set to 1), so the real text starts at position 2 (or index 1 in python).
train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]]
Using the above modification gives me the following for the review you originally selected:
this film was just brilliant casting location scenery story direction everyone's really suited the part they played ...
It appears that some punctuation and capitalization is removed etc, but this seems to return sensible reviews.
I hope this helps.

Related

VTK cutter output

I am seeking a solution of connecting all the lines that have the same slope and share a common point. For example, after I load a STL file and cut it using a plane, the cutter output includes the points defining the contour. Connecting them one by one forms a (or multiple) polyline. However, some lines can be merged when their slopes are the same and they share a common point. E.g., [[0,0,0],[0,0,1]] and [[0,0,1],[0,0,2]] can be represented by one single line [[0,0,0],[0,0,2]].
I wrote a function that can analyse all the lines and connect them if they can be merged. But when the number of lines are huge, this process is slow. I am thinking in the VTK pipeline, is there a way to do the line merging?
Cheers!
plane = vtk.vtkPlane()
plane.SetOrigin([0,0,5])
plane.SetNormal([0,0,1])
cutter = vtk.vtkCutter()
cutter.SetCutFunction(plane)
cutter.SetInput(triangleFilter.GetOutput())
cutter.Update()
cutStrips = vtk.vtkStripper()
cutStrips.SetInputConnection(cutter.GetOutputPort())
cutStrips.Update()
cleanDataFilter = vtk.vtkCleanPolyData()
cleanDataFilter.AddInput(cutStrips.GetOutput())
cleanDataFilter.Update()
cleanData = cleanDataFilter.GetOutput()
print cleanData.GetPoint(0)
print cleanData.GetPoint(1)
print cleanData.GetPoint(2)
print cleanData.GetPoint(3)
print cleanData.GetPoint(4)
The output is:
(0.0, 0.0, 5.0)
(5.0, 0.0, 5.0)
(10.0, 0.0, 5.0)
(10.0, 5.0, 5.0)
(10.0, 10.0, 5.0)
Connect the above points one by one will form a polyline representing the cut result. As we can see, the line [point0, point1] and [point1, point2] can be merged.
Below is the code for merging the lines:
Assume that the LINES are represented by list: [[(p0),(p1)],[(p1),(p2)],[(p2),(p3)],...]
appended = 0
CurrentLine = LINES[0]
CurrentConnectedLine = CurrentLine
tempLineCollection = LINES[1:len(LINES)]
while True:
for HL in tempLineCollection:
QCoreApplication.processEvents()
if checkParallelAndConnect(CurrentConnectedLine, HL):
appended = 1
LINES.remove(HL)
CurrentConnectedLine = ConnectLines(CurrentConnectedLine, HL)
processedPool.append(CurrentConnectedLine)
if len(tempLineCollection) == 1:
processedPool.append(tempLineCollection[0])
LINES.remove(CurrentLine)
if len(LINES) >= 2:
CurrentLine = LINES[0]
CurrentConnectedLine = CurrentLine
tempLineCollection = LINES[1:len(LINES)]
appended = 0
else:
break
Solution:
I figured out a way of further accelerating this process using some vtk data structure. I found out that a polyline line will be stored in a cell, which can be checked by using GetCellType(). Since the point order for a polyline is sorted already, We do not need to search globally which lines are colinear with the current one. For each point on the polyline, I just need to check the point[i-1], point[i], point[i+1]. And if they are colinear, the end of the line will be updated to the next point. This process continues until the end of the polyline is reached. The speed increases by a huge amount compared with the global search approach.

Not sure if it is the main source of slowness (depends on how many positive hits on the colinearity you have), but removing items from a vector is costly (O(n)), since it requires reorganizing the rest of the vector, you should avoid it. But even without hits on colinearity, the LINES.remove(CurrentLine) call is surely slowing things down and there isn't really any need for it - just leave the vector untouched, write the final results to a new vector (processedPool) and get rid of the LINES vector in the end. You can modify your algorithm by making a bool array (vector), initialized at "false" for each item, then when you remove a line, you don't actually remove it, but only mark it as "true" and you skip all lines for which you have "true", i.e. something like this (I don't speak python so the syntax is not accurate):
wasRemoved = bool vector of the size of LINES initialized at false for each entry
for CurrentLineIndex = 0; CurrentLineIndex < sizeof(LINES); CurrentLineIndex++
if (wasRemoved[CurrentLineIndex])
continue // skip a segment that was already removed
CurrentConnectedLine = LINES[CurrentLineIndex]
for HLIndex = CurrentLineIndex + 1; HLIndex < sizeof(LINES); HLIndex++:
if (wasRemoved[HLIndex])
continue;
HL = LINES[HLIndex]
QCoreApplication.processEvents()
if checkParallelAndConnect(CurrentConnectedLine, HL):
wasRemoved[HLIndex] = true
CurrentConnectedLine = ConnectLines(CurrentConnectedLine, HL)
processedPool.append(CurrentConnectedLine)
wasRemoved[CurrentLineIndex] = true // this is technically not needed since you won't go back in the vector anyway
LINES = processedPool
BTW, the really correct data structure for LINES to use for that kind of algorithm would be a linked list, since then you would have O(1) complexity for removal and you wouldn't need the boolean array. But a quick googling showed that that's not how lists are implemented in Python, also don't know if it would not interfere with other parts of your program. Alternatively, using a set might make it faster (though I would expect times similar to my "bool array" solution), see python 2.7 set and list remove time complexity
If this does not do the trick, I suggest you measure times of individual parts of the program to find the bottleneck.

error in LDA in r: each row of the input matrix needs to contain at least one non-zero entry

I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:
Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, :
Each row of the input matrix needs to contain at least one non-zero entry.
I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:
table(is.na(dtm[[1]]))
#FALSE
#57100956
table(is.na(dtm[[2]]))
#FALSE
#57100956
A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:
ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Can anyone provide some insights? Thanks.

In my opinion the problem is not the presence of missing values, but the presence of all 0 rows.
To check it:
raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table
Then you can delete all raws which are all 0 doing:
table=table[raw.sum!=0,]
Now table should has all "non 0" raws.

I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!

Efficiently: Random numbers in fixed range without repetitions

Hey guys, I know that there are a million questions on random numbers, but exactly because of that I searched a lot but I couldn't find something similar to mine - without implying it's not there. In any case, pardon me if I am repeating a question, just point me to it if that's the case.
So, I wanna do something simple in the most efficient way.
I want to generate randomly all N integers in the range [0, N], one by one, such that there are no repetitions.
I know, I can do this by inserting everything in a list, shuffle it, get the head and then remove head from the list. But then I will have shuffled my list of length N, N-1 times.
Any better / faster idea?

You can just do one shuffle, and then step through the list.
I'd recommend a Fisher-Yates shuffle.

This question has been asked a few times, and in each case the correct answer given is to shuffle an array (either the original, or an array of indices), however this isn't a satisfactory answer in cases where the number of possible indices is prohibitively large (either it's huge, or memory is tight, or you simply crave maximum efficiency for whatever reason).
As such I want to add an alternative for the sake of completeness. Now, this isn't truly random, so if that's what you need then do not use this, however, if your goal is simply "good enough" with minimal memory requirements then the following pseudo-code may be of interest:
function init:
start = random [0, length) // Pick a fully random starting index
stride = random [1, length - 1) // Pick a random step size
next_index = start
function advance_next_index:
next_index = (next_index + stride) % length
if next_index is equal to start then
start = (start + 1) % length
next_index = start
Here's an example of how to implement a re-usable function for grabbing pseudo-random values:
counter = length
function pseudo_random:
counter = counter + 1
if counter is equal to length then
init()
counter = 0
advance_next_index()
return next_index
Quite simply pseudo_random will call init once every length iterations, thus re-shuffling the "random" pattern of results produced by advance_next_index, and ensure that for every length values there is not a single duplicate.
To reiterate; this isn't a particularly random algorithm, so it must not be used in situations where true randomness is required. However, the results are random enough for some basic, non-critical, tasks, and it has a tiny memory footprint. For example, if you just want to randomise some behaviour in a game to avoid something becoming repetitive, or the data-set is large and never exposed to the user (in which case it is effectively random to them) it would take a long time to piece together the order and somehow exploit it.
If anyone knows of any better algorithms with similar properties then please share!

File Input output in Mathematica

I am writing test-cases for a problem,I want to check my test-cases with Mathematica but I am facing some problems with file input/output.
I have to take Input from a file say "Test.in",the date consists of an Integer/String in each line and the input is terminated by EOF,I have to take the input (every line,one at a time) and in each step I have to process the input and output to a file say "output.out".How can we do this in Mathematica?
PS:I am using Mathematica 7.0
ADDED:
A sample of Test.in would be like this.

You asked to read (every line,one at a time). Well, that is surely not the Mathematica way of doing things, but as you asked for it, try something along the lines of:
strInp = OpenRead ["Test.in"];
strOut = OpenWrite ["Test.out"];
While[(str=Read[strInp, Number) != EndOfFile,
out = yourprocess[str];
Write [strOut,out];
];
Close [strOut];
Close [strInp];
(* Now show the output file *)
FilePrint ["Test.out]
Edit Other answers posted better and more Mathematica-ish ways of doing this, but that generally implies NOT reading one at a time, as Mathematica favors functional, list-wide programming rather than the iterative way.

It's rather clunky to read each value one at a time, but it's natural in M- to read them all at once and then process each one.
Here's a simple infrastructure I use all the time:
(* step one: get data *)
data = Import["ideone_fM0rs.txt", "Lines"];
(* step two: ??? *)
res = {};
Module[{value, result},
value = #;
result = yourCodeHere[value];
AppendTo[res, result];
]& /# data;
(* step three: PROFIT! *)
Export["out.txt", res, "Lines"];
but see Jon McLoone on AppendTo vs Sow/Reap for large data sets: http://blog.wolfram.com/2011/12/07/10-tips-for-writing-fast-mathematica-code.
Here's a variation with Sow/Reap for the times you'd like to collect values under various tags or categories or genuses or whatever:
data = Import["ideone_fM0rs.txt", "Lines"];
res = Reap[Module[{value, result, tag},
value = #;
result = yourCodeHere[value];
tag = generateTag[value]
Sow[result, tag];
]& /# data, _, Rule][[2]];
Export["out.txt", res, "Lines"];
It's tempting to roll all that up into a single awe-inspiring one-liner, but for maintainability I like to keep it unrolled with each step made explicit.
Of course, yourCodeHere[value] could instead be many lines of well-commented code.
Note: I downloaded your data to a local file ideone_fM0rs.txt using the download link at http://ideone.com/fM0rs

That's straightforward to do, and like everything in Mathematica there is more than one way to do it. Personally, I'd use
data = ReadList["Test.in", Number, RecordLists-> True];
and then process data using Map. There's also Import and your data probably is best loaded as type Table, although you can check the full list and see what's there. You could also use Read, but you'd have to control the file open/close yourself.

For the input aspect, this might give you a start.
vals = Import["http://ideone.com/fM0rs",
"Table"] /. {aa_ /; ! NumberQ[aa] && FreeQ[aa, List], ___} :>
Sequence[] /. {} :> Sequence[]
I think others on this forum may have better ways to go about it though; I'm not much versed in the Import/Export realm.
Daniel Lichtblau
Wolfram Research

Circumventing R's `Error in if (nbins > .Machine$integer.max)`

This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata variable came from):
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
This error seems to come from the tabulate function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max
> .Machine$integer.max <- 2^40
and when that didn't work the whole source code of tabulate:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
Neither circumvented the problem. Apparently this is one reason why the ff package was created, but what worries me is the extent to which this is a problem I cannot avoid in R. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql (either sqlite or postgresql) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.
Attempting to switch back to Stata doesn't solve the problem either. Again see the previous post for how I use svyset, but the calculation I would like to run causes Stata to hang:
svy: mean age, over(strata)
Whether throwing more memory at it will solve the problem I don't know. I run R on my desktop which has 16 gigs, and I use Stata through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.
So in sum:
Is this a hard limit in R?
Would sql solve my R problems?
If I split it up into many separate files would that fix it (a lot of work...)?
Would throwing a lot of memory at Stata do it?
Am I seriously barking up the wrong tree somehow?

Yes, R uses 32-bit indexes for vectors so they can contain no more than 2^31-1 entries and you are trying to create something with 2^40. There is talk of introducing 64-bit indexes but that will be some way off before appearing in R. Vectors have the stated hard limit and that is it as far as base R is concerned.
I am unfamiliar with the details of what you are doing to offer any further advice on the other parts of your Q.
Why do you want to work with the full data set? Wouldn't a smaller sample that can fit in to the restrictions R places on you be just as useful? You could use SQL to store all the data and query it from R to return a random subset of more appropriate size.

Since this question was asked some time ago, I'd like to point that my answer here uses the version 3.3 of the survey package.
If you check the code of svydesign, you can see that the function that causes all the problem is within a check step that looks whether you should set the nest parameter to TRUE or not. This step can be disabled setting the option check.strata=FALSE.
Of course, you shouldn't disable a check step unless you know what you are doing. In this case, you should be able to decide yourself whether you need to set the nest option to TRUE or FALSE. nest should be set to TRUE when the same PSU (cluster) id is recycled in different strata.
Concretely for the IPUMS dataset, since you are using the serial variable for cluster identification and serial is unique for each household in a given sample, you may want to set nest to FALSE.
So, your survey design line would be:
ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt, check.strata=FALSE, nest=FALSE)
Extra advice: even after circumventing this problem you will find that the code is pretty slow unless you remap strata to a range from 1 to length(unique(ipums$strata)):
ipums$strata <- match(ipums$strata,unique(ipums$strata))

Both #Gavin and #Martin deserve credit for this answer, or at least leading me in the right direction. I'm mostly answering it separately to make it easier to read.
In the order I asked:
Yes 2^31 is a hard limit in R, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convert strata or id variables to factors, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).
sql could probably help, provided I learn how to use it correctly. I did the following test:
library(multicore) # make svy fast!
ri.ny <- subset(ipums, statefips_num %in% c(36, 44))
ri.ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny)
svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE)
ri <- subset(ri.ny, statefips_num==44)
ri.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri)
ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)
ny <- subset(ri.ny, statefips_num==36)
ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny)
ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE)
And found the means to be the same, which seems like a reasonable test.
So: in theory, provided I can split up the calculation by either using plyr or sql, the results should still be fine.
See 2.
Throwing a lot of memory at Stata definitely helps, but now I'm running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can't figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, Stata is much better out of the box.
In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I'm far from figuring it out even now. I was using the svydesign function correctly, but I didn't really know what's going on. I have a (very slightly) better grasp now, and it's heartening to know I was generally correct about how to solve the problem. #Gavin's general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both #Gavin and #Martin.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cannot make sense of keras.datasets.imdb - tensorflow

Related

VTK cutter output

error in LDA in r: each row of the input matrix needs to contain at least one non-zero entry

Efficiently: Random numbers in fixed range without repetitions

File Input output in Mathematica

Circumventing R's `Error in if (nbins > .Machine$integer.max)`

Categories

Resources