how to keep visited urls and maintain the job queue when writing a crawler

how to keep visited urls and maintain the job queue when writing a crawler - redis

I'm writing a crawler. I keep the visited urls in redis set,and maintain the job queue using redis list. As data grows,memory is used up, my memory is 4G. How to maintain these without redis? I have no idea,if I store these in files,they also need to be in memory.
If I use a mysql to store that,I think it maybe much slower than redis.
I have 5 machines with 4G memory,if anyone has some material to set up a redis cluster,it also helps a lot. I have some material to set up a cluster to be failover ,but what I need is to set a load balanced cluster.
thx

If you are just doing the basic operations of adding/removing from sets and lists, take a look at twemproxy/nutcracker. With it you can use all of the nodes.
Regarding your usage pattern itself, are you removing or expiring jobs and URLs? How much repetition is there in the system? For example, are you repeatedly crawling the same URLs? If so, perhaps you only need a mapping of URLs to their last crawl time, and instead of a job queue you pull URLs that are new or outside a given window since their last run.
Without the details on how your crawler actually runs or interacts with Redis, that is about what I can offer. If memory grows continually, it likely means you aren't cleaning up the DB.

Related

Redis cache in a clustered web farm? Sync between two member nodes?

Ok, so what I have are 2 web servers running inside of a Windows NLB clustered environment. The servers are identical in every respect, and as you'd expect in an NLB clustered environment, everybody is hitting the cluster name and not the individual members. We also have affinity turned off on the members in the cluster.
But, what I'm trying to do is to turn on some caching for a few large files (MP3s). It's easy enough to dial up a Redis node on one particular member and hit it, everything works like you'd expect. I can pull the data from the cache and serve it up as needed.
Now, let's add the overhead of the NLB. With an NLB in play, you may not be hitting the same web server each time. You might make your first hit to member 01, and the second hit to 02. So, I'd need a way to sync between the two servers. That way it doesn't matter which cluster member you hit, you are going to get the same data.
I don't need to worry about one cache being out of date, the only thing I'm storing in there is read only data from an internal web service.
I've only got 2 servers and it looks like redis clusters need 3. So I guess that's out.
Is this the best approach? Or perhaps there is something else better?
Reasons for redis: We only want the cache to use in-memory only. No writes to the database. Thought this would be a good fit, but need to make sure the data is available in both servers.

It's not possible to have redis multi master (writing on both). And I might say it's replication is blazing fast (check the slaveof command of Redis).
But why you need it in the same server? Access it as a service. So every node will access the actual data. If the main server goes down, the slave will promptly turn itself into a master.
One observation: you might notice that Redis makes use of disk in an async way. An append only file that it does checkpoint depending on the size from time to time so.

Redis configuration for production

I'm developing project with redis.My redis configuration is normal redis setup configuration.
I don't know how should I do redis configuration? Master-Slave? Cluster?
Do you have anything suggestion redis configuration for production?

Standard approach would be to have one master and at least one slave. Depending on your I/O requirements and number of ops/sec, you can always have multiple read-only slaves. Slaves can be read from but not written to. So you'll want to design your application to take advantage of doing round-robin requests to the slaves and writes only to the single master.
Depending on your data storage/backup requirement, you can set fsync for append-only mode to be every second. So while this means you can lose up to one second worth of data, it's really much less than that because your slaves serve as hot backups, and they will have the data within milliseconds.
You'll at least want to do a BGSAVE every hour to get a dump.rdp produced. You can then save this file live while the server is still running, and store it to some off-site backup facility.
But if you're just using Redis as a standard memcache replacement and don't care about data, then you can ignore all of this. Much of it will be changing in Redis Cluster in the 3.0 version.

It depends on what your Read/Writes requirements are. Could you give us more informations on that matter ?
I think 10,000 people use instant my application.I persist member login token on redis.It's important for me.If I don't write redis, member don't login on application.
Even a Redis single instance will be enough to process 10K users (start redis-bench to the throughput available), so just to be sure use a Master/Slave configuration with autopromotion of the slave if the master goes down.
Since you want persistence, use RDB (maybe along with AOF), see this topic on Redisio.

How to share the APC user cache between CLI and Web Server instances?

I am using PHP's APC to store a large amount of information (with apc_fetch(), etc.). This information occasionally needs analyzed and dumped elsewhere.
The story goes, I'm getting several hundred hits/sec. These hits increase various counters (with apc_inc(), and friends). Every hour, I would like to iterate over all the values I've accumulated, and do some other processing with them, and then save them on disk.
I could do this as a random or time-based switch in each request, but it's a potentially long operation (may require 20-30 sec, if not several minutes) and I do not want to hang a request for that long.
I thought a simple PHP cronjob would do the task. However, I can't even get it to read back cahe information.
<?php
print_r(apc_cache_info());
?>
Yeilds a seemingly different APC memory segment, with:
[num_entries] => 1
(The single entry seems to be a opcode cache of itself)
While my webserver, powered by nginx/php5-fpm, yields:
[num_entries] => 3175
So, they are obviously not sharing the same chunk of memory. How can I either access the same chunk of memory in the CLI script (preferred), or if that is simply not possible, what would be the absolute safest way to execute a long running sequence on say, a random HTTP request every hour?
For the latter, would using register_shutdown_function() and immediately set_time_limit(0) and ignore_user_abort(true) do the trick to ensure execution completes and doesn't "hang" anyone's browser?
And yes, I am aware of redis, memcache, etc that would not have this problem, but I am stuck to APC for now as neither could demonstrate the same speed as APC.

This is really a design issue and a matter of selecting preferred costs vs. payoffs.
You are thrilled by the speed of APC since you do not spend time to persist the data. You also want to persist the data but now the performance hit is too big. You have to balance these out somehow.
If persistence is important, take the hit and persist (file, DB, etc.) on every request. If speed is all you care about, change nothing - this whole question becomes moot. There are cache systems with persistent storage that can optimize your disk writes by aggregating what gets written to disk and when but you will generally always have a payoff between the two with varying tipping points. You just have to choose which of those suits your objectives.
There will probably never exist an enduring, wholesome technological solution to the wolf being sated and the lamb being whole.
If you really must do it your way, you could have a cron that CURLs a special request to your application which would trigger persisting your cache to disk. That way you control the request, its timeout, etc., and don't have to worry about everything users might do to kill their requests.
Potential risks in this case, however, are data integrity (as you will be writing the cache to disk while it is being updated by other requests in the meantime) as well as requests being served while you are persisting the cache paying the performance hit of your server being busy.
Essentially, we introduced a bundle of hay to the wolf/lamb dilemma ;)

Sharing a Redis database?

I'm using Redis as a session store in my app. Can I use the same instance (and db) of Redis for my job queue? If it's of any significance, it's hosted with redistogo.

It is perfectly fine to use the same redis for multiple operations.
We had a similar use case where we used Redis as a key value store as well as a job queue.
However you may want to consider other aspects like the performance requirements for your application. Redis can ideally handle around 70k operations per second and if at some time in future you think you may hit these benchmarks it's much better to split your operations to multiple redis instances based on the kind of operations you perform. This will allow you to make decisions about availability and replication at a more finer level depending on the requirements. As a simple use case once your key size grows you may be able to flush your session app redis or shard your keys using redis cluster without affecting job queing infrastructure.

Solr approaches to re-indexing large document corpus

We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:
New features are introduced around
searching the existing corpus that
require additional schema fields
which we can't always anticipate in
advance
The corpus is indexed across multiple
shards. When it grows past a certain
threshold, we need to create more
shards and re-balance documents
evenly across all of them (which
SolrCloud does not seem to yet
support).
The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.
The approaches we are looking into at the moment are:
Create a new cluster of shards and
batch re-index there while the old
cluster is still available for
searching. New documents that are not
part of the re-indexed batch are sent
to both the old cluster and the new
cluster. When ready to switch, point
the load balancer to the new cluster.
Use CoreAdmin: spawn a new core per
shard and send the re-indexed batch
to the new cores. New documents that
are not part of the re-indexed batch
are sent to both the old cores and
the new cores. When ready to switch,
use CoreAdmin to dynamically swap
cores.
We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.

This may not be applicable to you guys, but I'll offer my approach to this problem.
Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.
With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.
The approach we ended up going with was a very simple clustered master/slave setup.
Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.
When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.
So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.
Hope this helps!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas