Why received ZFS dataset uses less space than original? - backup

I have a dataset on the server1 that I want to back up to the second server2.
Server1 (original):
zfs list -o name,used,avail,refer,creation,usedds,usedsnap,origin,compression,compressratio,refcompressratio,mounted,atime,lused storage/iscsi/webhost-old produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
storage/iscsi/webhost-old 67,8G 1,87T 67,8G Út kvě 31 6:54 2016 67,8G 16K - lz4 1.00x 1.00x - - 67,4G
Sending volume to the 2nd server:
zfs send storage/iscsi/webhost-old | pv | ssh -c arcfour,aes128-gcm#openssh.com root#10.0.0.2 zfs receive -Fduv pool/bkp-storage
received 69,6GB stream in 378 seconds (189MB/sec)
Server2 zfs list produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
pool/bkp-storage/iscsi/webhost-old 36,1G 3,01T 36,1G Pá pro 29 10:25 2017 36,1G 0 - lz4 1.15x 1.15x - - 28,4G
Why is there such a difference in sizes? Thanks.

From what you posted, I noticed 3 things that seemed odd:
the compressratio is 1.15x on system 2, but 1.00x on system 1
on system 2, used is 1.27x higher than logicalused
the logicalused and the number zfs receive report are ~2.3x higher on system 1 than system 2
These terms are all defined in the man page, but are still confusing to reverse-engineer explanations for in practice.
(1) could happen if you enabled compression on the source dataset after you wrote all the data to it, since ZFS doesn't rewrite the data to compress it when you enable that setting. The data sent by zfs send is uncompressed unless you use -c, but system 2 will try to compress it as it runs zfs receive if the setting is enabled on the destination dataset. If both system 1 and system 2 had the same compression settings before the data was written, they would have the same compressratio as well.
(2) can happen due to metadata written along with your data, but in this case it's too high for "normal" metadata, which accounts for 1-2% of most pools. It's probably caused by a pool-wide setting, like configuring RAID-Z, or a weird combination of striping and mirroring (like 4 stripes, but with one of them being a mirror).
For (3), I re-read the man page to try to figure it out:
logicalused
The amount of space that is "logically" consumed by this dataset and
all its descendents. See the used property. The logical space
ignores the effect of the compression and copies properties, giving a
quantity closer to the amount of data that applications see.
If you were sending a dataset (instead of a single iSCSI volume) and the send size matched system 2's logicalused value (instead of system 1's), I would guess you forgot to send some child datasets (i.e. by using zfs send -R). However, neither of those are true in this case.
I had to do some additional digging -- this blog post from 2005 might contain the explanation. If system 1 didn't have compression enabled when the data was written (like I guessed above for (1)), the function responsible for not writing zeroed-out blocks (zio_compress_data) would not be run, so you probably have a bunch of empty blocks written to disk, and accounted for in the logicalused size. However, since lz4 is configured on system 2, it would run there, and those blocks would not be counted.

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Log file size calculated using len(_raw) in Splunk does not match even close to the actual file size on the host?

I am using a Splunk query to calculate the size of logs files sent to Splunk. This is the Splunk query I have used:
index="<my_index>" path="/<my_path>/<my_log_file>"
| eval raw_len=len(_raw)
| eval raw_len_kb = raw_len/1024
| eval raw_len_mb = raw_len/1024/1024
| eval raw_len_gb = raw_len/1024/1024/1024
| stats sum(raw_len) as Bytes sum(raw_len_kb) as KB sum(raw_len_mb) as MB sum(raw_len_gb) as GB by source
| addcoltotals
Splunk reports the size as 17 GB. On the other hand, when I do this on the Unix host:
ls -l /<my_path>/<my_log_file>
the value is just a few MB.
Any idea why there is so much difference?
One should not expect the size of data indexed in Splunk to exactly match the size reported by an OS. This is because Splunk by default removes line ends and because the len function counts characters rather than bytes.
Also, the query shown does not account for multiple hosts sending data to Splunk. There's no time window indicated so we don't know if the file may have been truncated at time point while Splunk still retains all of the data the file ever held.

How to restore Virtualbox ? lost last two months of work

https://forums.virtualbox.org/viewtopic.php?f=7&t=90893
Hello im desesperate and need help because i have lost about two months of work in my Windows 10 guest system.
Everything worked smoothly till i need to have more free space ( although i have a dynamic hd). So i have follow some tutorials and made some changes:
1 - I have the original almost full disk in: /Maquinas VirtualBox/Clientes Windows/Windows 10/Windows10-disk1.vmdk
2 - I made a copy in an external usb device.
3 - Convert to vdi: VBoxManage clonehd /media/eduardo/Seagate\ Backup\ Plus\ Drive/Windows10-disk1.vmdk /media/eduardo/Seagate\ Backup\ Plus\ Drive/Windows10-disk.vdi --format vdi
4 - Tried to resize the disk ( from 80gb to 100gb): VBoxManage modifyhd /media/eduardo/Seagate Backup Plus Drive/Windows10-disk1.vmdk --resize 100000 and VBoxManage modifymedium disk /media/eduardo/Seagate Backup Plus Drive/Windows10-disk1.vmdk --resize 100000 ( think this could be an error as i had to chage size to vdi file).
5 - Then i had to change the uuid ( because an error of uuid in use arised):VBoxManage internalcommands sethduuid "/media/eduardo/Seagate Backup Plus Drive/Windows10-disk1.vmdk"
6 - Then comeback to: VBoxManage clonehd "/media/eduardo/Seagate Backup Plus Drive/Windows10-disk1.vmdk" " " --format vdi
and resize VBoxManage modifymedium disk "/media/eduardo/Seagate Backup Plus Drive/Windows10-disk.vdi" --resize 120000
I tried to change my virutal machine with the new vdi file to test if everything was fine ( change my /Maquinas VirtualBox/Clientes Windows/Windows 10/Windows10-disk1.vmdk disk connection to the new/media/eduardo/Seagate Backup Plus Drive/Windows10-disk.vdi) . But i detected somewhat that the system has turned back two months ago !!!!
I was not worried and decided to go back to my "untouch" vmdk, but the most strange thing is that the original "untouch" file: /Maquinas VirtualBox/Clientes Windows/Windows 10/Windows10-disk1.vmdk also boots with things and files and state about two months ago. So im quite nervous.
Selección_058.png
Selección_058.png (65.19 KiB) Viewed 9 times
As watching files the 6c***** has to be the "good status" as was modified yesterday at night. Here is my file manager:
Selección_059.png
Selección_059.png (54.06 KiB) Viewed 9 times
Here is my VM ( made an snapshot about two months ago i dont remember when exactly)
https://imagebin.ca/v/4QlKV3Equ1fW
My log:
https://pastebin.com/JSLFRNMs
Hope anybody can help...
i think that the key is to return somewhat to 6c**** state of my vmdk file, i dont understand how this vmdk got changed as it was not touched
Thanks in advance
The problem was solved. It was nothing to do with resizing disks. I select the { 6cc3c***-*****} hard disk ( although it was "only" 47 gb), for surprise for me it load its "snapshot" part of 47 gb with the whole disk windows10-disk1.vmdk....
Sorry for my bad english, but its difficult to explain, in the settings of the virtual machine in storage section, select as main disk the 6cc***** and start/boot the VM
Once was loaded and working fine, i deleted the snapshot ( to bring all together to the present state) and then made another snapshot for backup.
Thanks

required size of a configuration file for a HX1K (in "SPI slave" mode)

I am reworking the programmer for the Olimex iCE40HX1K board (targetted towards a STM32F103 ma) where I also would like to implement the "SPI Slave" mode to configure an image directly into RAM without using the serial flash.
Looking at the Lattice "programming and Configuration guide" (page 11), it is noted in table 8 that a EPROM for a ICE40-LP/LX1K must be at least 34112 bytes. (which -I guess- means that the configuration-files can be up to that size).
However, all images I have (sofar) created with the icestorm tools are 32220 octets.
I am a bit puzzled here.
Can somebody explain the difference between these two figures?
Does the HX1K need a configuration-file of 32220 or 34112 bytes?
I don't know how Lattice arrived at this number. A complete HX1K bin file with BRAM initialization but without comment and without multiboot header is 32220 bytes in size. The (optional) multiboot header would add another 160 bytes (32220 + 160 = 32380). The lattice tools usually add about 80 bytes to the comment field (32220 + 80 = 32300). Whatever I do, all numbers I have are more than 1000 short of 34112.
I don't know if there is a maximum length for the comment. Maybe there is and 34112 is the size of a bit stream with a comment of maximum length?
34112 - 32220 = 1892. Maybe someone decided to add 8kB (8192 bytes) just in case, but that person accidentally swapped the first two digits? Idk..
If you don't care about comments or multiboot headers, then iCE40 1K bit-streams have a fixed size, and that size is 32220 bytes.

Content of the fsimage hdfs

I have a question on what is the metadata in the fsimage all about. I read that All mutations to the file system namespace, such as file renames, permission changes, file creations, block allocations are inside the fsimage. But the block location data as well?
Does it contain the information about where (on which datanode) the blocks are stores as well?
I get from this source: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ that the metadata on where blocks is stored is build by the block repots of the datanodes.
Is this true? So the Fsimage does not contain information about the block locations?
Namenode maintains two type of data
Block Location data : Since files are chopped into blocks, NN should know which piece is where.
This data is kept in memory and never persisted on disk, DNs talk to NN periodically and share the blockreport.
file system (metadata) : such as the file system hierarchy, permissions, etc. This info is persisted to the disk
when namenodes starts up it loads "snapshot" of filesystem from fsimage and applies the edit logs from edits onto it, after this process we get a new snapshot. from this point on namenode can accept files system requests from clients / DNs
Yes as far as I know fsimage does not contains any information about blocks. This information is stored by data nodes. Namenode gets this information when it starts up from datanodes.
Hadoop provides a tool that converts the fsimage file into human readable formats. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
Sample output:
bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
FSImage
ImageVersion = -19
NamespaceID = 2109123098
GenerationStamp = 1003
INodes [NumInodes = 12]
Inode
INodePath =
Replication = 0
ModificationTime = 2009-03-16 14:16
AccessTime = 1969-12-31 16:00
BlockSize = 0
Blocks [NumBlocks = -1]
NSQuota = 2147483647
DSQuota = -1
Permissions
Username = theuser
GroupName = supergroup
PermString = rwxr-xr-x
...remaining output omitted...