Download Malware samples by searching hash values - malware

I am conducting a research to download ransomware samples, in order to analyze them. The challenge lies in downloading the ransomware binaries.
I have gone through carious websites, as virusign, malshare and malwre and downloaded more than 60000,000 samples. Then, I used the av-class script (https://github.com/malicialab/avclass) to label the samples and extracted the ransomware ones. However, out of the 60,000 samples I only got 200 Ransomware.
Later on, I downloaded hashes of ransomware binaries from VT. The current challenge lies in how to download these hash samples.
- I cannot use VT since it requires private API
- I used malware search script, but seems broken and out of support (https://medium.com/#unixbr/malwaresearch-a-command-line-tool-to-find-malwares-on-openmalware-org-53bafeadb5e2)
I was wondering the best approach to download Ransomware samples for the hash values I have.

Related

Store tesseract OCR traineddata on S3?

I have an app hosted on Heroku. I seek to extract text from various PDFs. I'm currently using tesseract for this.
Since Heroku does not offer that much storage space and the .traineddata files are big in size (need to use all of them), is it possible to somehow store the tessdata language data on S3? I was not able to find any solution to this yet.
All I could find is that I can define the --tessdata-dir PATH, but that's for a directory.
Sadly, I'm not sure Heroku is a good fit for your needs if you can't make all the data fit within the heroku slug. Even if you could get it to work, it would be quite a performance hit.
You'd probably be better off setting the Tesseract as an API with it's own server(s), then sending whatever you need to that API from heroku (or moving the entire app over). Depending on the size of the rest of your app and how quickly Tesseract is growing in size, that might just mean Tesseract gets it's own heroku app with absolutely minimal dependencies or might mean moving that part of the app to AWS or something.

What is the concept of CNTKTextFormatDeserializer and why use?

I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.

Could someone explain to me about the training Tesseract OCR?

I'm trying to do the training process, but I don't understand even how to start. I would like to train for read it numbers. My images are from real world, so it didn't go so good with the reading process.
It says that I have to have a ".tif" image with the examples... is a single image of every number (in this case) or a image with a lot of different types of number (same font, though)?
And what about the makebox? The command didn't work here.
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Could someone explain me better, at least how to start?
I saw a few softwares that do this more quickly, but I tryied one (SunnyPage 1.8) but isn't free. Anyone know any free software that does this? Or a good tutorial?
Using Tesseract 3, Windows 8 (32bits).
It is important to patiently follow the training wiki google code project site. If needed multiple times. It is an open source library and is constantly evolving.
You will have to create a training image(tiff) with a lot of different types of numbers probably should have all the numbers you wish the engine to recognize.
Please consider posting the exact error message you got with make box.
I think Tesseract is the best free solution available. You have to keep working and seek help from community.
There is a very good post from Cédric here explaining the training process for Tesseract.
A good free OCR software is PDF OCR X which is also based on Tesseract. I tried to copy my notes from German which I had scanned at 1200dpi, and the results were commendable but not perfect. I found that this website - http://onlineocr.net - is a lot more accurate. If you are not registered, it allows a maximum of 4mb file size from most image formats (BMP, PNG, JPEG etc.) and PDF. It can output them as a Word file, an Excel file or an txt file.
Hope this helps.

Looking for a lossless compression api similar to smushit

Anyone know of an lossless image compression api/service similar to smushit from yahoo?
From their own FAQ:
WHAT TOOLS DOES SMUSH.IT USE TO SMUSH IMAGES?
We have found many good tools for reducing image size. Often times these tools are specific to particular image formats and work much better in certain circumstances than others. To "smush" really means to try many different image reduction algorithms and figure out which one gives the best result.
These are the algorithms currently in use:
ImageMagick: to identify the image type and to convert GIF files to PNG files.
pngcrush: to strip unneeded chunks from PNGs. We are also experimenting with other PNG reduction tools such as pngout, optipng, pngrewrite. Hopefully these tools will provide improved optimization of PNG files.
jpegtran: to strip all metadata from JPEGs (currently disabled) and try progressive JPEGs.
gifsicle: to optimize GIF animations by stripping repeating pixels in different frames.
More information about the smushing process is available at the Optimize Images section of Best Practices for High Performance Web pages.
It mentions several good tools. By the way, the very same FAQ mentions that Yahoo will make Smush.It a public API sooner or later so that you can run at it your own. Until then you can just upload images separately for Smush.It here.
Try Kraken Image Optimizer: https://kraken.io/signup
The developer's plan is free - but only returns dummy results. You must subscribe to one of the paid plans to use the API, however, the Web Interface is free and unlimited for images of up to 1MB.
Find out more in the Kraken documentation.
See this:
http://github.com/thebeansgroup/smush.py
It's a Python implementation of smushit that can be run off-line to optimise your images without uploading them to Yahoo's service.
As I know the best image compression for me is : Tinypng
They have also API : https://tinypng.com/developers
Once you retrieve your key, you can immediately start shrinking
images. Official client libraries are available for Ruby, PHP,
Node.js, Python and Java. You can also use the WordPress plugin, the
Magento 1 extension or improved Magento 2 extension to compress your
JPEG and PNG images.
And First 500 images per month is for free
Tip : Via using their API, you have no limit about file-size (not max 5MB each as their online tool)

ArXiv replication brainstorming

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.
full pdf content is in the amazon cloud.
while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB
http://arxiv.org/help/bulk_data_s3
T.
arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?
My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.