OpenRefine speed puzzlement - openrefine

I am curious about OpenRefine's speed. I have two projects about 5 MB in size about 35,000-40,000 lines in length.
This dataset works normally: https://raw.githubusercontent.com/whanley/egypt-data/main/exp-manifests-rough(1).tsv
This dataset works slowly: https://raw.githubusercontent.com/whanley/b-g/master/bslc-members/bslc-members-to-1900-tsv.tsv
I notice the slow speed when faceting. For example, faceting the second dataset column "Surname" by count is very slow.
I have tried increasing memory, etc. What confuses me is the difference in speeds between two quite similar projects. Any insight or tips would be appreciated.

Related

Does increasing the number of iterations affect log-lik, AIC etc.?

Whenever I try to solve a convergence issue in one of my glmer models with the help of a different optimizer, I repeat the entire model optimization procedure with the new optimizer. That is, I re-run all the models I've computed so far with the new optimizer and again conduct comparisons with anova (). I do this because as far as I know different optimizers may lead to differences in AICs and log-lik ratios for one and the same model, making comparisons between two models that use different optimizers problematic.
In my most recent analysis, I've increased the number of iterations with optCtrl=list(maxfun=100000) to avoid convergence errors. I'm now wondering whether this can also lead to differences in AIC/log-lik etc. for one and the same model? Is it equally problematic to compare two models that differ with regard to the inclusion of the optCtrl=list(maxfun=100000) argument?
I actually thought that increasing the number of iterations would simply lead to longer computation times (rather than different results), but I was unable to verify this online. Any hint/explanation is appreciated.
As far as I know, you should be fine. As long as the models were fit with the same number of observations you should be able to compare them using the AIC. Hopefully someone else can comment on the nuances of the computations of the AIC itself, but I just fit a bunch of models with the same formula and dataset and different number of max iterations, getting the AIC each time. It didn't change as a function of the iterations. The iterations are just the time the model fitting process can take to maximize the likelihood, which for complex models can be tricky. Once a model is fit, and has converged on an answer, the number of iterations shouldn't change anything about the model itself.
If you look at this question, the top answer explains the AIC quite well:https://stats.stackexchange.com/questions/232465/how-to-compare-models-on-the-basis-of-aic

Large Dataframe slow with Lifelines Survival Analysis

I'm trying to run a survival analysis on a large dataset (about 80 rows x 12,000 cols) in python.
Currently I'm using:
from lifelines import CoxPHFitter
cf = CoxPHFitter()
cf.fit(df, duration_col='Time', event_col='Status')
But it is extremely slow. Breaking up the dataframe into chunks of 100 and running cf.fit multiple times is slightly faster, but it's still clocking in at around 80s. This is notably slower than R's coxph, and I'd really prefer not to use rpy2 to run the analysis in R.
I'm a bit at a loss for how to make this faster, so any suggestions would be greatly appreciated.

Word2Vec: Any way to train model fastly?

I use Gensim Word2Vec to train word sets in my database.
I have about 400,000 phrase(Each phrase is short. Total 700MB) in my PostgreSQL database.
This is how I train these data using Django ORM:
post_vector_list = []
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
post_vector_list.append(post_vector)
word2vec_model = gensim.models.Word2Vec(post_vector_list, window=10, min_count=2, size=300)
But this job getting a lot of time and feels like not efficient.
Especially, creating post_vector_list part took a lot of time and space..
I want to improve speed of training but have no idea how to do.
Want to get your advices. Thanks.
To optimize such code, you need to collect good information about where the time is spent.
Is most of the time spent preparing post_vector_list?
If so, you will want to make sure my_tokenizer (whose code is not shown) is as efficient as possible. You may want to try to minimize the number of extend()s and append()s that are done on large lists. You might have to even take a look at your DB's configuration or options to speed up the DB-to-Object mapping started inside Post.objects.all().
Is most of the time spent in the call to Word2Vec()?
If so, other steps may help:
ensure you're using gensim's Cython-optimized routines – if not, you should be seeing a logged warning (and training will be up to 100X slower)
consider using a workers=4 or workers=8 optional argument to use more threads, if your machine has at least 4 or 8 CPU cores
consider using a larger min_count, which speeds training somewhat (and since vectors for words where there are only a few examples typically aren't very good anyway, doesn't lose much and can even improve the quality of the surviving words)
consider using a smaller window, since training takes longer for larger windows
consider using a smaller vector_size (previously called size), since training takes longer for larger-size vectors
consider using a more-aggressive (smaller) value for the optional sample argument, which randomly skips more of the most-frequent words. The default is 1e-04, but values of 1e-05 or 1e-06 (especially on larger corpuses) can offer additional speedup, and even often improve the final vectors (by spending relatively less training time on words with an excess of usage examples)
consider using a lower-than-default (5) value for the optional epochs parameter (previously called iter). (I wouldn't recommend this unless the corpus is very large – so it already has many redundant, equally-good examples of the same words throughout.)
you could use a python generator instead of loading all the data into the list. Gensim works with python generators too. The code will look something like this
class Post_Vectors(object):
def __init__(self, Post):
self.Post = Post
def __iter__(self):
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
yield post_vector
post_vectors = Post_Vectors(Post)
word2vec_model = gensim.models.Word2Vec(post_vectors, window=10, min_count=2, size=300, workers=??)
For the gensim speedup, if you have a multi-core CPU, you could use the workers parameter. (By default it is 3)

Why embedding_lookup_sparse and string_to_hash_bucket in tensorflow slow with large number of rows of embeddings

In tensorflow embedding_lookup_sparse lookup the row of embeddings according the sp_ids. I think it's similar to random access. However when the shape of embeddings is large, i.e 10M rows, the inference spent more time than when the embeddings only has about 1M rows. As I think, the lookup phase and is similar to random access and the hash function spent constant time which is all fast and less sensitive with the size. Is there any wrong with my thought? Is there any way to optimize so that the inference can be faster? Thank you!
Are you sure it is caused by the embedding_lookup? In my case I also have millions of rows to lookup. It is very fast if I use GradientDecend optimizer. It is very slow if I use Adam or the others. Probably it is not the embedding_lookup opr slows down your training but other oprs that depend on the total number of params.
It is true that "embedding_lookup" works slowly when there are many rows in table.
And you may figure out why by reading its source code. Here is the source code in "embedding_lookup":
image of the source code: variable "np" is the length of table
image of the source code: loop with np
As you see there is a loop with a time complexity of O(table length) appearing here. In fact "embedding_lookup" use dynamic partition to separate input data into several partition of ids, and then use this loop to embed words vectors to each id's partition. In my opinion, this trick can fix the time complexity to O(table length) no matter how big the input data is.
So I think the best way for you to increase training speed is to input more samples in each batch.

Getting each example exactly once

For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!
You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere