Create discrepancy between size on disk and actual size in NTFS - ntfs

I keep finding files which show a size of 10kb but a size on disk on 10gb. Trying to figure out how this is done, anyone have any ideas?

You can make sparse files on NTFS, as well as on any real filesystem. :-)
Seek to (10 GB - 10 kB), write 10 kB of data. There, you have a so-called 10 GB file, which in reality is only 10 kB big. :-)

You can create streams in NTFS files. It's like a separate file, but with the same filename. See here: Alternate Data Streams

I'm not sure about your case (or it might be a mistake in your question) but when you create a NTFS sparse file it will show different sizes for these fields.
When I create a 10MB sparse file and fill it with 1MB of data windows explorer will show:
Size: 10MB
Size on disk: 1MB
But in your case its the opposite. (or a mistake.)

Related

MDF File Size Not Growing

I am having an issue with the MDF and IDF file sizes. I can see the physical file size paths are fine. It is set to auto-growth by 64 MB. The table sizes are changed frequently whereas MDF and LDF files sizes are still the same.
I am curious if it is some data loss or something which is not growing its size. I have any tables in the database and there are multiple transactions almost in each table every minute.
Can anyone in database expert help me out in this regard? I will be very thankful.

Reading partitioned Parquet file with Pyarrow uses too much memory

I have a large Impala database composed of partitioned Parquet files.
I copied one Parquet partition to the local disk using HDFS directly. This partition has 15GB total and is composed of lots of files with 10MB each. I'm trying to read this using Pandas with the Pyarrow engine or Pyarrow directly, but its size in memory uses more than 60GB of RAM and it doesn't read the entire dataset before using all memory. What could be the reason of such large memory usage?
The size of Parquet files on disk and in memory can vary up to a magnitude. Parquet using efficient encoding and compression techniques to store columns. When you load this data into RAM, the data is unpacked into its uncompressed form. Thus for a data set of files with a size of 15G, a RAM usage of 150G would be expected.
When you're unsure if this is your problem, load a single file using df = pandas.read_parquet and inspect its memory usage with df.memory_usage(deep=True). This should give you a good indication of the scaling between disk and RAM of your whole dataset.

Estimate size of .tar.gz file before compressing

We are working on a system (on Linux) that has very limited transmission resources. The maximum file size that can be sent as one file is defined, and we would like to send the minimum number of files. Because of this, all files sent are packed and compressed in GZip format (.tar.gz).
There are a lot of small files of different type (binary, text, images...) that should be packed in the most efficient way to send the maximum amount of data everytime.
The problem is: is there a way to estimate the size of the tar.gz file without running the tar utility? (So the best combination of files can be calculated)
Yes, there is a way to estimate tar size before running the command.
tar -czf - /directory/to/archive/ | wc -c
Meaning:
This will create the archive as standar output and will pipe it to the wc command, a tool that will count the bytes. The output will be the amount of KB in the archive. Technically, it runs the tool but doesn't save it.
Source: The Ultimate Tar Command Tutorial with 10 Practical Examples
It depends on what you mean by "small files", but generally, no. If you have a large file that is relatively homogenous in its contents, then you could compress 100K or 200K from the middle and use that compression ratio as an estimate for the remainder of the file.
For files around 32K or less, you need to compress it to see how big it will be. Also when you concatenate many small files in a tar file, you will get better compression overall than you would individually on the small files.
I would recommend a simple greedy approach where you take the largest file whose size plus some overhead is less than the remaining space in the "maximum file size". The overhead is chosen to cover the tar header and the maximum expansion from compression (a fraction of a percent). Then add that to the archive. Repeat.
You can flush the compression at each step to see how big the result is.

How to increase memory size of jpg file?

My jpg file size is 5 KB. How can I make it 15 KB file?
please Go through these Link.
http://www.mkyong.com/java/how-to-resize-an-image-in-java/
http://www.java2s.com/Code/Java/2D-Graphics-GUI/Imagesize.htm
JPEG is a compression algorithm, and hence its 5 KB. Save it as a 32-bit or higher Bitmap to get maximum size possible for that image. And your question has a lot of scope for improvement. Give context, reasons, and methods you already tried.

Efficient thumbnail generation of huge pdf file?

In a system I'm working on we're generating thumbnails as part of the workflow.
Sometimes the pdf files are quite large (print size 3m2) and can contain huge bitmap images.
Are there thumbnail generation capable programs that are optimized for memory footprint handling such large pdf files?
The resulting thumbnail can be png or jpg.
ImageMagick is what I use for all my CLI graphics, so maybe it can work for you:
convert foo.pdf foo-%png
This produces three separate PNG files:
foo-0.png
foo-1.png
foo-2.png
To create only one thumbnail, treat the PDF as if it were an array ([0] is the first page, [1] is the second, etc.):
convert foo.pdf[0] foo-thumb.png
Since you're worrying about memory, with the -cache option, you can restrict memory usage:
-cache threshold megabytes of memory available to the pixel cache.
Image pixels are stored in memory
until threshold megabytes of memory have been
consumed. Subsequent pixel operations
are cached on disk. Operations to
memory are significantly faster but
if your computer does not have a
sufficient amount of free memory you
may want to adjust this threshold
value.
So to thumbnail a PDF file and resize it,, you could run this command which should have a max memory usage of around 20mb:
convert -cache 20 foo.pdf[0] -resize 10%x10% foo-thumb.png
Or you could use -density to specify the output density (900 scales it down quite a lot):
convert -cache 20 foo.pdf[0] -density 900 foo-thumb.png
Should you care? Current affordable servers have 512 GB ram. That supports storing a full colour uncompressed bitmap of over 9000 inches (250 m) square at 1200 dpi. The performance hit you take from using disk is large.