So, I've been extracting lot of data with import.io desktop app for quite some time; but what always bugged me is when you try to bulk extract multiple URLs it always skips around half of them.
It's not URL problem, if you take same let's say 15 URLs it will return for example first time 8, second time 7, third time 9; some links will be extracted first time but will be skipped second time and so on.
I am wondering is there a way to make it process all URL I feed it?
I have encountered this issue a few times when I am extracting data. This typically is due to the speed of the Bulk Extract requesting URLs from the site's servers.
A workaround is to use a Crawler like an Extractor. You can paste the URLs that you created/collected into the Where to Start, Where to Crawl, and Where to Get Data From sections (you need to click on the advanced settings button in the Crawler).
Make sure to turn on 0 depth Crawl. (This turns the Crawler into an Extractor; i.e. no discovery of additional URLs)
Increase the Pause Between Pages.
Here is screenshot of one I built sometime ago.
http://i.gyazo.com/92de3b7c7fbca2bc4830c27aefd7cba4.png
Related
I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.
Architecture
Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.
WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index.
"Note: I am creating index on Content as well putting whole Content as StringField."
SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.
Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.
I have following questions
Anyone has done any similar work? Any issues/best practices I should follow.
Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.
I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?
Please let me know if anyone need anything specific.
Thanks,
Rohit Goyal
Architecture looks fine.
Few things
Consider using TextField instead of StringField. TextField will be tokenized and hence user would be able to search on tokens. StringField is not tokenized and hence for document to match search, full text should match.
No problem in performance for lucene. Check out Lucene performance graphs. Lucene can generate index for over a billion wikipedia documents in minutes. Searching is fast too.
I need to create a list of all DNS Queries required to display a large number of sites (ideally up to 1 000 000). The list needs to assign the queries to the page that required them.
Example: Visiting google.com required a DNS query for google.com, ssl.gstatic.com, apis.google.com and other sites. My List would read something along the lines of
google.com:google.com,ssl.gstatic.com,apis.google.com,...
(exact format not relevant here)
I currently have two ideas on how to do this:
Set up a DNS Server with logging, build a script that visits a given list of domains using my DNS Server as a resolver
Building a script that loads the source code of the site (think python's urllib2, for example), parsing all embedded content and constructing a list of queries that would be needed
Both ideas have problems though. Visiting 1 000 000 Domains with a space of 2 seconds between visits (to make it possible to assign queries to the visited site afterwards), taking about 1 second to load (which is pretty optimistic) would take over 34 days, probably longer. But to build a parser I would need a complete list of all possible forms of embedded content that would result in a DNS Query, and I would need to query some of the target URLs as well (think iframes), and some content would be impossible to check for further queries (think flash content which connects to other servers).
I'm kind of stuck here, and would appreciate some input on how to deal with this. It would be possible to shorten the List of URLs to maybe 100 000, but any less would dramatically reduce the use of the result.
For context: I need this list for my bachelor thesis dealing with a attack strategy on a proposed DNS privacy extension.
You can use PhantomJS to do this, as it provides an interface that will let you capture network requests and log them, something along the lines of this example.
You'd need to write some simple Javascript, but as it's Node, it should be fairly easy to run this asynchronously to gather the data you need within a reasonable time.
There is a tool that can do this and produce a graphic representation. It is part of dnssec-tools called DNSpktflow (DNS Packet Flow)
It may not do what you want exactly but it is open source so you can see how they do it.
i trying to build a json file that is future proof. meaning the urls for media in the json file will need to be changed from time to time (like if i change a host or similar ).
so if my url is
http://mediaserver.com/media/video.mp4
then later i need to change all to
http://new-mediaserver.com/media/video.mp4
or
rtmp://s2ziuhvw5dd27c.cloudfront.net/cfx/
is this possible to avoid hand input of hundreds or urls every time sources change?
maybe have to do it using a database? any tips?
Context:
I am load testing a prototype enterprise web app that performs quick searches on a large dataset. It's backed by a database and uses JQuery datables backed by a servlet to narrow the results upon each keystroke.
I want to find out how it will behave under load and measure response time, stability and usability under various loads and come up with a SLA. The load in this case would be a number of users logging in, typing various search string, simultaneously.
Tools:
I am using Apache Jmeter to do this.
Question:
To truly make my load tests random and eliminate the effect of caching at database level (or anywhere else), I want my HTTP requests for each search to be random. I want to do something like this: send a character, wait, send another character, send backspace, send one more character, send two backspaces, etc.
What is the most elegant/efficient way of doing something like that using JMeter?
Right now I am looking into using CSV dataset and read random characters from a large file, but I'm wondering if there is a better way.
You can achieve the random search strings by using functions.
Specifically, look at RANDOM and CHAR.
basically, you'd have something like ${__CHAR(${__RANDOM(0,82)})} to generate a single character.
I'd also recommend having a CSV file with the top 100 most popular search terms to test against.
I don't know your use case, but it seems unlikely that people are going to be typing in random characters. If I am correct, simulating random keystrokes could be just as misleading as using a very small set of search keywords.
Instead, you should locate or develop a set of keywords that people are likely to use - possibly be scanning the content they will be searching? Then use that to populate what users will enter when they are searching.
We have our production server running our website. Then we have a test server which has exact same data but with changes to code to do some new functionality. This web app has over 500 pages.
Is there any program that can
Login to the test site
Crawl through each page and then save the page as html
Compare with the same page saved with live site?
This way we can make sure that new features that we add to our test site will not break the live site when code updates are applied to production.
I am currently trying to use WinHTTrack website copier and then comparing the test and live folders with some code comparison tool like beyond compare. This works ok but there are lot of files changed because of the domain name changes.
Looking forward to ideas / solutions for this problem.
Regards
Have you looked at using Watir for this? It's not exactly the thing you are looking for but it might allow you some more granularity in your tests and ensure the site is functionally identical rather than getting caught up on changing guids, timestamps and all the other things that tend to change across any significant size website from day to day as part of it's standard functionality.
Apparently you can't make consistent, reproduceable builds in your project, can you? I would recommend moving towards that in the long run, it will save you a lot of headaches. That way you would know exactly what was deployed to which server when, so there would be no more need to bend around backwards to get the deployed sources back like this...
I know this is not a direct solution to your problem... but maybe it is worth comparing, whether you would save more in the long run by investing the efforts into your build process now, instead of implementing this workaround (and then improving your build process anyway - because one day you will almost surely need to do that).
wget has a --convert-links option, there are also some options to preserve cookies that might let you do it logged in http://drupal.org/node/118759#comment-664498
use an Offline Downloader, download all files to your computer from both sources, then compare the folder contents using a free tool like Total Commander.
EDIT
Load both of your sources into a CVS, and compare it there.