Rsync - destination is smaller, even though files are the same - ssh

I am running a regular rsync between a local folder and a remoter one via ssh. I got confused when I saw that the remote (and target) folder had a different size, it was smaller. I first suspected excluded files, but that wasn't the case. Instead I discovered the following.
The sizes in a local folder (a subfolder of the one I am syncing) look like this:
112K .
48K ./workspace.xml
12K ./vcs.xml
12K ./preferred-vcs.xml
12K ./pm-client.iml
12K ./modules.xml
12K ./misc.xml
the remote ones, however, like this:
64K .
40K ./workspace.xml
4,0K ./vcs.xml
4,0K ./preferred-vcs.xml
4,0K ./pm-client.iml
4,0K ./modules.xml
4,0K ./misc.xml
When I check the file contents, however, they look just the same. I see this a lot in the target folder, which ultimately leads to big differences in folder sizes.
The rsync I am running looks like this:
rsync -aPEh -e ssh --delete --delete-excluded --stats --exclude-from=<some-ignorelist> /source/folder/ /target/backup/folder
What can be the reason for this?

The sizes that du and ls report are different: du reports the amount of space actually allocated on the filesystem while ls reports the the logical file size.
There are several questions on various StackExchange sites about this.
Why does du report different sizes on your two machines? Because they are either using different filesystems or they are configured differently. It all boils down to the block sizes used on the filesystem, which is what du reports.

Related

Using PDFtk to Update Web Server Files in Many Directories

long time reader, first time poster. Trying to automate a process to take many .PDF floorplan files and combine them into a single .PDF floorplan which will be referenced by a website.
To cut down on manual cut-and-paste from network shares to a web server as is current practice, I've written a PowerShell command as follows:
$SourcePath = '\\network\share\location\CAD Miniatures'
$DestinationPath = 'C:\inetpub\wwwroot\floorplans'
$LogFile = 'C:\Floorplan Transfer Logs\TransferLog.txt'
Robocopy $SourcePath $DestinationPath *.pdf /E /MIR /ZB /DCOPY:DAT /R:5 /W:10 /LOG+:$LogFile
My plan is to have this script run every hour as a Scheduled Task to mirror our local files and web files to ensure they remain up-to-date automatically.
The curve ball is the files being copied are individual files, within directories. I would like to take all .pdf files in a given folder and combine it into a single .pdf.
File structure is as such:
/floorplans
/ABC
/ABC-01.pdf
/ABC-02.pdf
/ABC-03.pdf
/XYZ
/XYZ-01.pdf
/XYZ-02.pdf
/XYZ-03.pdf
/XYZ-04.pdf
/XYZ-05.pdf
/XYZ-06.pdf
Within each directory (or in a subdirectory), I would like to have the combined output file be simple abc.pdf and xyx.pdf as per the examples above.
The file naming always follows the same format, but the number of files varies from a single file to over a dozen.
I would like to run the Robocopy and PDFtk tasks in the same script if possible (the idea to update all files, and combine them together). There would also be no need to merge files in which no updates have been detected.

Recursive rsync over ssh, include only one file extension

I'm trying to rsync files over ssh from a server to my machine. Files are in various subdirectories, but I only want to keep the ones that match a certain pattern (IE blah.txt). I have done extensive googling and searching on stackoverflow, and I've tried just about every permutation of --include and --excludes that have been suggested. No matter what I try, rsync grabs all files.
Just as an example of one of my attempts, I have used:
rsync -avze 'ssh' --include='*blah*.txt' --exclude='*' myusername#myserver.com:/path/top/files/directory /path/to/local/directory
To troubleshoot, I tried this command:
rsync -avze 'ssh' --exclude='*' myusername#myserver.com:/path/top/files/directory /path/to/local/directory
expecting it to not copy anything, but it still grabbed all of the files.
I am using rsync version 2.6.9 on OSX.
Is there something obvious I'm missing? I've been struggling with this for quite a while.
I was able to find a solution, with a caveat. Here is the working command:
rsync -vre 'ssh' --prune-empty-dirs --include='*/' --include='*blah*.txt' --exclude='*' user#server.com:/path/to/server/files /path/to/local/files
However! If I type this into my command line directly, it works. If I save it to a file, myfile.txt, and I try `cat myfile.txt` it no longer works! This makes no sense to me.
OSX follows BSD style rsync
https://www.freebsd.org/cgi/man.cgi?query=rsync&apropos=0&sektion=0&manpath=FreeBSD+8.0-RELEASE+and+Ports&format=html
-C, --cvs-exclude
This is a useful shorthand for excluding a broad range of files
that you often don't want to transfer between systems. It uses a
similar algorithm to CVS to determine if a file should be
ignored.
The exclude list is initialized to exclude the following items
(these initial items are marked as perishable -- see the FILTER
RULES section):
RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS
.make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak
*.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe
*.Z *.elc *.ln core .svn/ .git/ .bzr/
then, files listed in a $HOME/.cvsignore are added to the list
and any files listed in the CVSIGNORE environment variable (all
cvsignore names are delimited by whitespace).
Finally, any file is ignored if it is in the same directory as a
.cvsignore file and matches one of the patterns listed therein.
Unlike rsync's filter/exclude files, these patterns are split on
whitespace. See the cvs(1) manual for more information.
If you're combining -C with your own --filter rules, you should
note that these CVS excludes are appended at the end of your own
rules, regardless of where the -C was placed on the command-
line. This makes them a lower priority than any rules you spec-
ified explicitly. If you want to control where these CVS
excludes get inserted into your filter rules, you should omit
the -C as a command-line option and use a combination of --fil-
ter=:C and --filter=-C (either on your command-line or by
putting the ":C" and "-C" rules into a filter file with your
other rules). The first option turns on the per-directory scan-
ning for the .cvsignore file. The second option does a one-time
import of the CVS excludes mentioned above.
-f, --filter=RULE
This option allows you to add rules to selectively exclude cer-
tain files from the list of files to be transferred. This is
most useful in combination with a recursive transfer.
You may use as many --filter options on the command line as you
like to build up the list of files to exclude. If the filter
contains whitespace, be sure to quote it so that the shell gives
the rule to rsync as a single argument. The text below also
mentions that you can use an underscore to replace the space
that separates a rule from its arg.
See the FILTER RULES section for detailed information on this
option.

How to split sql in MAC OSX?

Is there any app for mac to split sql files or even script?
I have a large files which i have to upload it to hosting that doesn't support files over 8 MB.
*I don't have SSH access
You can use this : http://www.ozerov.de/bigdump/
Or
Use this command to split the sql file
split -l 5000 ./path/to/mysqldump.sql ./mysqldump/dbpart-
The split command takes a file and breaks it into multiple files. The -l 5000 part tells it to split the file every five thousand lines. The next bit is the path to your file, and the next part is the path you want to save the output to. Files will be saved as whatever filename you specify (e.g. “dbpart-”) with an alphabetical letter combination appended.
Now you should be able to import your files one at a time through phpMyAdmin without issue.
More info http://www.webmaster-source.com/2011/09/26/how-to-import-a-very-large-sql-dump-with-phpmyadmin/
This tool should do the trick: MySQLDumpSplitter
It's free and open source.
Unlike the accepted answer to this question, this app will always keep extended inserts intact so the precise form of your query doesn't matter; the resulting files will always have valid SQL syntax.
Full disclosure: I am a share holder of the company that hosts this program.
The UploadDir feature in phpMyAdmin could help you, if you have FTP access and can modify your phpMyAdmin's configuration (or are allowed to install your own instance of phpMyAdmin).
http://docs.phpmyadmin.net/en/latest/config.html?highlight=uploaddir#cfg_UploadDir
You can split into working SQL statements with:
csplit -s -f db-part db.sql "/^# Dump of table/" "{99}"
Which makes up to 99 files named 'db-part[n]' from db.sql
You can use "CREATE TABLE" or "INSERT INTO" instead of "# Dump of ..."
Also: Avoid installing any programs or uploading your data into any online service. You don't know what will be done with your information!

finding a corrupted part from the parts of a split archive

I have 7 files with extensions like xyz.rar.001 - xyz.rar.007 clearly they are parts of a single file. I have all the 7 parts. I join them using a file joiner into a single file xyz.rar and try to unrar them with WINRAR , it says that archive is corrupted It is clear that 1 or 2 parts are corrupted. IS THERE ANY WAY TO FIND THEM ? Please help I don't want to re download all of them NOTE- winrar can detect a corrupt part if the parts were splitted using winrar (with extensions like part1.rar , part2.rar etc. ) but not if they are named as rar.001
Parts .001 - .006 should have the same size. Check if there is a file with a different byte size.
Are there multiple files in the RAR or just the one? With multiple you could run a Test and see which is the first file to fail.
I think it's strange that there is a second tool used to split the RAR archive up. (e.g. HJSplit) This lets me think that .002 could be a RAR archive too. Try opening xyz.rar.001 with WinRAR and test/exctract. It happens more that RAR archives have the extension .001 instead of .rar. An example.
Naming your archives in WinRAR like this can be accomplished by putting "xyz.rar.001" as Archive name on the General tab and checking "Old style volume names" on the Advanced tab.
If I then join the files with HJSplit, I get one .rar file (that is corrupt). When I Test it, it says "Next volume is required". In the diagnostic messages I can see "The required volume is absent" and "CRC failed in X. The file is corrupt"
If there is one file stored inside the RAR and the RAR is indeed just chopped up into 7 pieces, there is no way of telling without additional files such as .sfv or .par2. (unless the RAR does not use compression: you can parse the underlying file for errors and calculate the part where it goes wrong)

What does f+++++++++ mean in rsync logs?

I'm using rsync to make a backup of my server files, and I have two questions:
In the middle of the process I need to stop and start rsync again.
Will rsync start from the point where it stopped or it will restart from the beginning?
In the log files I see "f+++++++++". What does it mean?
e.g.:
2010/12/21 08:28:37 [4537] >f.st...... iddd/logs/website-production-access_log
2010/12/21 08:29:11 [4537] >f.st...... iddd/web/website/production/shared/log/production.log
2010/12/21 08:29:14 [4537] .d..t...... iddd/web/website/production/shared/sessions/
2010/12/21 08:29:14 [4537] >f+++++++++ iddd/web/website/production/shared/sessions/ruby_sess.017a771cc19b18cd
2010/12/21 08:29:14 [4537] >f+++++++++ iddd/web/website/production/shared/sessions/ruby_sess.01eade9d317ca79a
Let's take a look at how rsync works and better understand the cryptic result lines:
1 - A huge advantage of rsync is that after an interruption the next time it continues smoothly.
The next rsync invocation will not transfer the files again, that it had already transferred, if they were not changed in the meantime. But it will start checking all the files again from the beginning to find out, as it is not aware that it had been interrupted.
2 - Each character is a code that can be translated if you read the section for -i, --itemize-changes in man rsync
Decoding your example log file from the question:
>f.st......
> - the item is received
f - it is a regular file
s - the file size is different
t - the time stamp is different
.d..t......
. - the item is not being updated (though it might have attributes
that are being modified)
d - it is a directory
t - the time stamp is different
>f+++++++++
> - the item is received
f - a regular file
+++++++++ - this is a newly created item
The relevant part of the rsync man page:
-i, --itemize-changes
Requests a simple itemized list of the changes that are being made to
each file, including attribute changes. This is exactly the same as
specifying --out-format='%i %n%L'. If you repeat the option, unchanged
files will also be output, but only if the receiving rsync is at least
version 2.6.7 (you can use -vv with older versions of rsync, but that
also turns on the output of other verbose messages).
The "%i" escape has a cryptic output that is 11 letters long. The
general format is like the string YXcstpoguax, where Y is replaced by
the type of update being done, X is replaced by the file-type, and the
other letters represent attributes that may be output if they are
being modified.
The update types that replace the Y are as follows:
A < means that a file is being transferred to the remote host (sent).
A > means that a file is being transferred to the local host (received).
A c means that a local change/creation is occurring for the item (such as the creation of a directory or the changing of a symlink,
etc.).
A h means that the item is a hard link to another item (requires --hard-links).
A . means that the item is not being updated (though it might have attributes that are being modified).
A * means that the rest of the itemized-output area contains a message (e.g. "deleting").
The file-types that replace the X are: f for a file, a d for a
directory, an L for a symlink, a D for a device, and a S for a
special file (e.g. named sockets and fifos).
The other letters in the string above are the actual letters that will
be output if the associated attribute for the item is being updated or
a "." for no change. Three exceptions to this are: (1) a newly created
item replaces each letter with a "+", (2) an identical item replaces
the dots with spaces, and (3) an unknown attribute replaces each
letter with a "?" (this can happen when talking to an older rsync).
The attribute that is associated with each letter is as follows:
A c means either that a regular file has a different checksum (requires --checksum) or that a symlink, device, or special file has a
changed value. Note that if you are sending files to an rsync prior to
3.0.1, this change flag will be present only for checksum-differing regular files.
A s means the size of a regular file is different and will be updated by the file transfer.
A t means the modification time is different and is being updated to the sender’s value (requires --times). An alternate value of T
means that the modification time will be set to the transfer time,
which happens when a file/symlink/device is updated without --times
and when a symlink is changed and the receiver can’t set its time.
(Note: when using an rsync 3.0.0 client, you might see the s flag
combined with t instead of the proper T flag for this time-setting
failure.)
A p means the permissions are different and are being updated to the sender’s value (requires --perms).
An o means the owner is different and is being updated to the sender’s value (requires --owner and super-user privileges).
A g means the group is different and is being updated to the sender’s value (requires --group and the authority to set the group).
The u slot is reserved for future use.
The a means that the ACL information changed.
The x means that the extended attribute information changed.
One other output is possible: when deleting files, the "%i" will
output the string "*deleting" for each item that is being removed
(assuming that you are talking to a recent enough rsync that it logs
deletions instead of outputting them as a verbose message).
Some time back, I needed to understand the rsync output for a script that I was writing. During the process of writing that script I googled around and came to what #mit had written above. I used that information, as well as documentation from other sources, to create my own primer on the bit flags and how to get rsync to output bit flags for all actions (it does not do this by default).
I am posting that information here in hopes that it helps others who (like me) stumble up on this page via search and need a better explanation of rsync.
With the combination of the --itemize-changes flag and the -vvv flag, rsync gives us detailed output of all file system changes that were identified in the source directory when compared to the target directory. The bit flags produced by rsync can then be decoded to determine what changed. To decode each bit's meaning, use the following table.
Explanation of each bit position and value in rsync's output:
YXcstpoguax path/to/file
|||||||||||
||||||||||╰- x: The extended attribute information changed
|||||||||╰-- a: The ACL information changed
||||||||╰--- u: The u slot is reserved for future use
|||||||╰---- g: Group is different
||||||╰----- o: Owner is different
|||||╰------ p: Permission are different
||||╰------- t: Modification time is different
|||╰-------- s: Size is different
||╰--------- c: Different checksum (for regular files), or
|| changed value (for symlinks, devices, and special files)
|╰---------- the file type:
| f: for a file,
| d: for a directory,
| L: for a symlink,
| D: for a device,
| S: for a special file (e.g. named sockets and fifos)
╰----------- the type of update being done::
<: file is being transferred to the remote host (sent)
>: file is being transferred to the local host (received)
c: local change/creation for the item, such as:
- the creation of a directory
- the changing of a symlink,
- etc.
h: the item is a hard link to another item (requires
--hard-links).
.: the item is not being updated (though it might have
attributes that are being modified)
*: means that the rest of the itemized-output area contains
a message (e.g. "deleting")
Some example output from rsync for various scenarios:
>f+++++++++ some/dir/new-file.txt
.f....og..x some/dir/existing-file-with-changed-owner-and-group.txt
.f........x some/dir/existing-file-with-changed-unnamed-attribute.txt
>f...p....x some/dir/existing-file-with-changed-permissions.txt
>f..t..g..x some/dir/existing-file-with-changed-time-and-group.txt
>f.s......x some/dir/existing-file-with-changed-size.txt
>f.st.....x some/dir/existing-file-with-changed-size-and-time-stamp.txt
cd+++++++++ some/dir/new-directory/
.d....og... some/dir/existing-directory-with-changed-owner-and-group/
.d..t...... some/dir/existing-directory-with-different-time-stamp/
Capturing rsync's output (focused on the bit flags):
In my experimentation, both the --itemize-changes flag and the -vvv flag are needed to get rsync to output an entry for all file system changes. Without the triple verbose (-vvv) flag, I was not seeing directory, link and device changes listed. It is worth experimenting with your version of rsync to make sure that it is observing and noting all that you expected.
One handy use of this technique is to add the --dry-run flag to the command and collect the change list, as determined by rsync, into a variable (without making any changes) so you can do some processing on the list yourself. Something like the following would capture the output in a variable:
file_system_changes=$(rsync --archive --acls --xattrs \
--checksum --dry-run \
--itemize-changes -vvv \
"/some/source-path/" \
"/some/destination-path/" \
| grep -E '^(\.|>|<|c|h|\*).......... .')
In the example above, the (stdout) output from rsync is redirected to grep (via stdin) so we can isolate only the lines that contain bit flags.
Processing the captured output:
The contents of the variable can then be logged for later use or immediately iterated over for items of interest. I use this exact tactic in the script I wrote during researching more about rsync. You can look at the script (https://github.com/jmmitchell/movestough) for examples of post-processing the captured output to isolate new files, duplicate files (same name, same contents), file collisions (same name, different contents), as well as the changes in subdirectory structures.
1.) It will "restart the sync", but it will not transfer files that are the same size and timestamp etc. It first builds up a list of files to transfer and during this stage it will see that it has already transferred some files and will skip them. You should tell rsync to preserve the timestamps etc. (e.g. using rsync -a ...)
While rsync is transferring a file, it will call it something like .filename.XYZABC instead of filename. Then when it has finished transferring that file it will rename it. So, if you kill rsync while it is transferring a large file, you will have to use the --partial option to continue the transfer instead of starting from scratch.
2.) I don't know what that is. Can you paste some examples?
EDIT: As per http://ubuntuforums.org/showthread.php?t=1342171 those codes are defined in the rsync man page in section for the the -i, --itemize-changes option.
Fixed part if my answer based on Joao's