Is there a way to iterate over Bio.trie in a memory-friendly way? - trie

I would like to use a Bio.trie as an index for sequences to count how many time each of them occurs in a dataset. The index is created like this:
seqs = trie()
for seq in dataset:
if seqs.has_key(seq):
seqs[seq] += 1
else:
seqs[seq] = 1
And after that I would like to extract some information from the trie, e.g. a list of sequences that have been counted only once. It's possible to do it like this:
singletons = []
for seq in seqs.keys():
if seqs[seq] == 1:
singletons.append(seq)
The problem is that the machine I'm working on is low on memory and calling trie.keys() returns a list object which is roughly as large as the trie itself. Is it possible to iterate over the keys using a generator instead of a list, so that you would only get one key at a time?

Related

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)
I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.
Below is the code:
df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)
def f(node_1 ,rel_type, node_2):
try:
tx = graph.begin()
tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'label1': node_1, 'label2': node_2})
tx.commit()
except Exception as e:
print(str(e))
for index, row in df.iterrows():
if(index%1000000 == 0):
print(index)
try:
f(row["node_1"],row["rel_type"],row["node_2"])
except:
print("error index: " + index)
Can someone help me what I did wrong here. Thanks!
You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.
But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.
Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):
def f(label_1, id_1, rel_type, label_2, id_2):
try:
tx = graph.begin()
tx.evaluate(
'MATCH' +
'(a:' + label_1 + '{id:$id1}),' +
'(b:' + label_2 + '{id:$id2}) ' +
'MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'id1': id_1, 'id2': id_2})
tx.commit()
except Exception as e:
print(str(e))
The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).
If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.
Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

Why does random always return an even number

I want to learn Elm and currently I want to create a random List with tuples containing an index and a random number.
My current approach is to create a list and for each element create a random value:
randomList =
List.map randomEntry (List.range 0 1000)
randomEntry index =
let
seed = Random.initialSeed index
randomResult = Random.step (Random.int 1 10) seed
in
(index, Tuple.first randomResult)
But this only creates even numbers.
Why does it always create even numbers and what is the correct way of doing this?
Strange - using your randomEntry function, the first odd numbers don't start showing up until 53668, and then it's odd numbers for a while. An example from the REPL:
> List.range 0 100000 |> List.map randomEntry |> List.filter (\(a,b) -> b % 2 /= 0) |> List.take 10
[(53668,35),(53669,87),(53670,1),(53671,15),(53672,29),(53673,43),(53674,57),(53675,71),(53676,85),(53677,99)]
: List ( Int, Int )
Now I can't tell you why there is such stickiness in this range (here's the source for the int generator if you're curious), but hopefully I can shed some light on randomness in Elm.
There is really no way to create a truly random number generator using standard computing technology (quantum computers aside), so the typical way of creating randomness is to provide a function that takes a seed value, returns a pseudo-random number and the next seed value to use.
This makes random number generation predictable, since you will always get the same "random" number for the same seed.
This is why you always get identical results for the input you've given: You are using the same seed values of 0 through 1000. In addition, you are ignoring the "next seed" value passed back from the step function, which is returned as the second value of the tuple.
Now, when dealing with random number generators, it is a good rule of thumb to avoid dealing with the seed as much as possible. You can write generators without referring to seeds by building on smaller generators like int, list, and so on.
The way you execute a generator is either by returning a Cmd generated from Random.generate from your update function, which leaves the responsibility of deciding which seed to use to the Elm Architecture (which probably uses some time-based seed), or you can pass in the seed using Random.step, which you've done above.
So, going back to your original example, if you were to write a generator for returning a list of random numbers of a certain size, where each number is within a certain range, it could look something like this:
randomListGenerator : Int -> (Int, Int) -> Random.Generator (List Int)
randomListGenerator size (low, high) =
Random.list size (Random.int low high)
Executing this using step in the REPL shows how it can be used:
> initialSeed 0 |> step (randomListGenerator 20 (1, 10)) |> Tuple.first
[6,6,6,1,3,10,4,4,4,9,6,3,5,3,7,8,3,4,8,5] : List Int
You'll see that this includes some odd numbers, unlike your initial example. The fact that it is different than your example is because the generator returns the next seed to use each consecutive step, whereas your example used the integers 0 through 1000 in order. I still have no explanation to your original question of why there is such a big block of evens using your original input, other than to say it is very odd.

Getting Term Frequencies For Query

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

Redis scan count: How to force SCAN to return all keys matching a pattern?

I am trying to find out values stored in a list of keys which match a pattern from redis. I tried using SCAN so that later on i can use MGET to get all the values, The problem is:
SCAN 0 MATCH "foo:bar:*" COUNT 1000
does not return any value whereas
SCAN 0 MATCH "foo:bar:*" COUNT 10000
returns the desired keys.
How do i force SCAN to look through all the existing keys? Do I have to look into lua for this?
With the code below you will scan the 1000 first object from cursor 0
SCAN 0 MATCH "foo:bar:*" COUNT 1000
In result, you will get a new cursor to recall
SCAN YOUR_NEW_CURSOR MATCH "foo:bar:*" COUNT 1000
To scan 1000 next object. Then when you increase COUNT from 1000 to 10000 and retrieve data you scan more keys then in your case match more keys.
To scan the entire list you need to recall SCAN until the cursor give in response return zero (i.e entire scan)
Use INFO command to get your amount of keys like
db0:keys=YOUR_AMOUNT_OF_KEYS,expires=0,avg_ttl=0
Then call
SCAN 0 MATCH "foo:bar:*" COUNT YOUR_AMOUNT_OF_KEYS
Just going to put this here for anyone interested in how to do it using the python redis library:
import redis
redis_server = redis.StrictRedis(host=settings.redis_ip, port=6379, db=0)
mid_results = []
cur, results = redis_server.scan(0,'foo:bar:*',1000)
mid_results += results
while cur != 0:
cur, results = redis_server.scan(cur,'foo:bar:*',1000)
mid_results += results
final_uniq_results = set(mid_results)
It took me a few days to figure this out, but basically each scan will return a tuple.
Examples:
(cursor, results_list)
(5433L, [... keys here ...])
(3244L, [... keys here, maybe ...])
(6543L, [... keys here, duplicates maybe too ...])
(0L, [... last items here ...])
Keep scanning cursor until it returns to 0.
There is a guarantee it will return to 0.
Even if the scan returns an empty results_list between scans.
However, as noted by #Josh in the comments, SCAN is not guaranteed to terminate under a race condition where inserts are happening at the same time.
I had a hard time figuring out what the cursor number was and why I would randomly get an empty list, or repeated items, but even though I knew I had just put items in.
After reading:
https://github.com/antirez/redis/blob/unstable/src/dict.c#L772-L855
It made more sense, but still there is some deep programming magic and compromises happening to iterate the sets.
If your use case involves Python, or if you just want to get the values once and has Python installed on your machine, this is a trivial task if you use the scan_iter method on the redis python library:
from redis import StrictRedis
redis = StrictRedis.from_url(REDIS_URI)
keys = []
for key in redis.scan_iter('foo:bar:*', 1000):
keys.append(key)
In the end, keys will contain all the keys you would get by applying #khanou 's method.
This is also more efficient than doing shell scripts, since those spawn a new client on each iteration of the loop.

Why is iterating through a large Django QuerySet consuming massive amounts of memory?

The table in question contains roughly ten million rows.
for event in Event.objects.all():
print event
This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.
I also tried Event.objects.iterator() which behaved the same way.
I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).
What have I misunderstood?
(I don't know whether it's relevant, but I'm using PostgreSQL.)
Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():
print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
Memory Efficient Django QuerySet iterator
batch querysets
QuerySet Foreach
Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import model
paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can
# change this to desired chunk size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can do whatever you want with the row
print "done processing page %s" % page
Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:
for event in Event.objects.all().iterator():
print event
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
from django.core.paginator import Paginator
def chunked_iterator(queryset, chunk_size=10000):
paginator = Paginator(queryset, chunk_size)
for page in range(1, paginator.num_pages + 1):
for obj in paginator.page(page).object_list:
yield obj
for event in chunked_iterator(Event.objects.all()):
print event
For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.
Django doesn't have good solution for fetching large items from database.
import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)
for index, eid in enumerate(eids):
event = Event.object.get(id=eid)
# do necessary work with event
if index % 100 == 0:
gc.collect()
print("completed 100 items")
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.
This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
No database activity actually occurs until you do something to evaluate the queryset.
So when the print event is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.
But if you do something like:
Event.objects.all()[300:900]
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.
Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first pip3 install tqdm
from tqdm import tqdm
def spoonfeed(qs, func, chunk=1000, start=0):
"""
Chunk up a large queryset and run func on each item.
Works with automatic primary key fields.
chunk -- how many objects to take on at once
start -- PK to start from
>>> spoonfeed(Spam.objects.all(), nom_nom)
"""
end = qs.order_by('pk').last()
progressbar = tqdm(total=qs.count())
if not end:
return
while start < end.pk:
for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
func(o)
progressbar.update(1)
start += chunk
progressbar.close()
To use this you write a function that does operations on your object:
def set_population(town):
town.population = calculate_population(...)
town.save()
and than run that function on your queryset:
spoonfeed(Town.objects.all(), set_population)
Here a solution including len and count:
class GeneratorWithLen(object):
"""
Generator that includes len and count for given queryset
"""
def __init__(self, generator, length):
self.generator = generator
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.generator
def __getitem__(self, item):
return self.generator.__getitem__(item)
def next(self):
return next(self.generator)
def count(self):
return self.__len__()
def batch(queryset, batch_size=1024):
"""
returns a generator that does not cache results on the QuerySet
Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size
:param batch_size: Size for the maximum chunk of data in memory
:return: generator
"""
total = queryset.count()
def batch_qs(_qs, _batch_size=batch_size):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset.
"""
for start in range(0, total, _batch_size):
end = min(start + _batch_size, total)
yield (start, end, total, _qs[start:end])
def generate_items():
queryset.order_by() # Clearing... ordering by id if PK autoincremental
for start, end, total, qs in batch_qs(queryset):
for item in qs:
yield item
return GeneratorWithLen(generate_items(), total)
Usage:
events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
# Do something with the Event
There are a lot of outdated results here. Not sure when it was added, but Django's QuerySet.iterator() method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.
I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
import MySQLdb
db_config = {} # config your db here
connection = MySQLdb.connect(
host=db_config['HOST'], user=db_config['USER'],
port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
record = cursor.fetchone()
if record is None:
break
# Do something with record here
cursor.close()
connection.close()
Ref:
Retrieving million of rows from MySQL
How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once