Meaning of "near-atomic" transfer using temporary files? - scp

I was reading 'SCP' command man page in linux, in the end it said
"No attempt is made at "near-atomic" transfer using temporary files".
Vaguely I can guess what it is but can anyone clearly tell me the technical definition of this sentence.

An atomic copy would be as Craig states, to use a temporary file and then mv the temporary file to the intended destination. The mv IS an atomic providing source and destination are on the same partition. Only file operations with the tmp file open already will be able to read the contents. rename() is not atomic on files that move between partitions as the data has to be copied.
This assumes you're scp'ing to a UNIX system of course :)

Atomic would mean that nothing else could read or write the file until scp had finished with it. "Near-atomic" refers to the common practise of copying a file into a temporary location (on the target machine/disk) and then moving it into the final location. The move operation is much faster than a copy ("near-atomic" by comparison) but it's not necessarily atomic in the true sense of the word. Another process could still read the file in an inconsistent state during a non-atomic move.


redshift Unload operation causing redundant data

We use UNLOAD commands to run some transformation on s3-based external tables and publish data into a different s3 bucket in PARQUET format.
I use ALLOWOVERWRITE option in the unload operation to replace the files if they already exist. This works fine for most of the cases but inserts duplicate files for the same data at times which causes external table to show duplicate numbers.
For eg, if the parquet in the partition is 0000_part_00.parquet which contains complete data.In the next run, unload is expected to overwrite this file but instead inserts new file 0000_part_01.parquet which doubles the total output.
This again would not repeat if I just clean up entire partition and rerun again. This inconsistency is making our system unreliable.
unload (<simple select statement>)
to 's3://<s3 bucket>/<prefix>/'
iam_role '<iam-role>' allowoverwrite
PARTITION BY (partition_col1, partition_col2);
Thank you.
To prevent redundant data, you must use Redshift's CLEANPATH option in your UNLOAD statement. Note the difference, from the documentation (Perhaps AWS could clear this up a bit more):
By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.
The CLEANPATH option removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location.
If you include the PARTITION BY clause, existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation.
You must have the s3:DeleteObject permission on the Amazon S3 bucket. For information, see Policies and Permissions in Amazon S3 in the Amazon Simple Storage Service Console User Guide. Files that you remove by using the `CLEANPATH` option are permanently deleted and can't be recovered.
You can't specify the `CLEANPATH` option if you specify the `ALLOWOVERWRITE` option.
Therefore, as #Vzzarr says, ALLOWOVERWRITE only overwrites files that share the same names as the incoming file name. For recurring unload operations that do not require the state of the past data to remain intact, then you must use CLEANPATH.
And note that you cannot use both ALLOWOVERWRITE and CLEANPATH in the same UNLOAD statement.
Here's an example:
UNLOAD ('{your_query}')
TO 's3://{destination_prefix}/'
iam_role '{IAM_ROLE_ARN}'
MANIFEST verbose
From my experience the ALLOWOVERWRITE parameter is only based on the generated file names: so a result is overwritten only if 2 files have the same name.
This parameter works in most of the cases but in this domain "most of the cases" is not good enough. I stopped using it since then (and I was quite disappointed). What I do instead is manually delete the files from S3 console (or actually move them in a staging folder) and then unloading the data without relying on the ALLOWOVERWRITE parameter.
Also mentioned in comments of this answer

how to put name difference for daily backup

I created a backup cmd file with this code
it works good, but, when I run the backup again, it finds that the file exists
and terminate the process. it will not run unless I delete the previous file or rename it. I want to add something to the dumpfile and logfile name that creates a daily difference between them, something like the system date, or a copy number or what else.
The option REUSE_DUMPFILES specifies whether to overwrite a preexisting dump file.
Normally, Data Pump Export will return an error if you specify a dump
file name that already exists. The REUSE_DUMPFILES parameter allows
you to override that behavior and reuse a dump file name.
If you wish to dump separate file names for each day, you may use a variable using date command in Unix/Linux environment.
DUMPFILE=FULLDB_$(date '+%Y-%m-%d').DMP
Similar techniques are available in Windows, which you may explore if you're running expdp in Windows environment.

Why does git interpret sql files as binaries during a merge conflict?

I got the problem with resolving merge conflicts within sql files.
MenkeTTA#909086 MINGW64 //FILE0019 (master)
$ git pull
remote: Microsoft (R) Visual Studio (R) Team Services
remote: Found 5 objects to send. (5 ms)
Unpacking objects: 100% (5/5), done.
From https://***
d58a69b..4830c58 master -> origin/master
warning: Cannot merge binary files: example_StoredProcedure.sql (HEAD vs. 4830c5886d3e1eac5ac76d1d49496afb43f444c3)
Auto-merging WRR - example_StoredProcedure.sql
CONFLICT (content): Merge conflict in example_StoredProcedure.sql
Automatic merge failed; fix conflicts and then commit the result.
When the merge conflict is created git isn't creating a pre-merged file with the competing changes as in the usual structure:
<<<<<<< HEAD
competing change A
competing change B
>>>>>>> branch-a
Git is treating both files as binaries – but only for the merge-conflict operations (normal merge without conflict works properly). I can choose my own version of the file or the pulled competing file from the remote as the new head for the next push.
I reproduced this conflict with a normal .txt file. Git is treating the merge conflict then as expected with creating one pre-merged file with both competing changes/commits where I can manually fix the code how I want to.
To make git recognize the sql files as text I added
.sql diff
to the .gitattributes file like it's described here. Does anyone know how I can make git to create a ordinary pre-merged file with both versions of the competing commits when working with sql files?
First, a quick note:
To make git recognize the sql files as text I added
.sql diff
to the .gitattributes file ...
The .gitattributes line should read *.sql diff (I've fixed the linked answer, which is on a question about getting git diff to treat the file as text). However, if the file really is text, you may want, or even need, *.sql text. Note: this will not help at all if the file is not text. If the file's content is UTF-16, it is not text to Git, at least.
Consider marking the file as example_StoredProcedure.sql text, i.e., not all .sql files, just this one particular file. I'm also curious to see whether just marking it diff suffices! Update, Nov 2019: apparently marking the file as diff is not sufficient, though I have not verified this myself.
(The difference is that the diff attribute tells Git how to show the file in git diff output,1 while the text tells Git that instead of using its built in guessing algorithm, it should, for all purposes, use the setting to decide whether the file is text. The guessing algorithm consists of scanning an initial chunk of the file's contents to see how many "text-like" characters there are vs "non-text-like" characters. Probably there should be a special allowance for UTF-8 Byte Order Markers at the top, but there isn't. Curiously, during filtering, there are explicit checks.)
1Well, it's actually more involved than just showing, but I think this is a good way to start thinking about the issues. Note that you can augment the diff setting with a driver. It's not clear to me how the low level file merge interacts with a diff driver and I do not have time to experiment with it right now.
Longer explanation
warning: Cannot merge binary files: example_StoredProcedure.sql (... vs ...)
tells us that you are correct, that Git is treating the three versions of example_StoredProcedure.sql as binary. (I see you added this output after the initial question; good thing, since it's the key!)
But why did I say three versions, when the line goes on to say:
HEAD vs. 4830c5886d3e1eac5ac76d1d49496afb43f444c3
Git is being a little lazy here: all merges involve three inputs, not just two. One of these is the one you supply explicitly—or, as in this case, git pull ran git merge and git pull itself supplied the big ugly hash ID 4830c5886d3e1eac5ac76d1d49496afb43f444c3.
The second input to a merge is always the current commit, aka HEAD. You normally get this by being on the branch in the first place: HEAD names the branch-name, the branch-name identifies the commit, and this is where you want the final merge commit to go, so it all fits together.
The third input—or internally, first; internally the "theirs" version is the third input—is one that Git computes for you, based on the HEAD and other or --theirs commits: Git walks through enough of the commit graph to find the best common ancestor commit.1 It's this common ancestor commit that determines which files need merging, and if a file does need merging, the built in merge driver needs to use diffs to get textual changes to merge. For both this and for git diff, Git has a differencing engine built in to it (modified from LibXDiff).
Hence Git can, in effect, run:
git diff --find-renames <merge base commit> HEAD
to see what we did to each of our files, and:
git diff --find-renames <merge base commit> <other commit>
to see what they did to each of our files. Then:
If we changed a file and they did not touch it at all, the merge is easy: take ours.
If they changed a file and we did not touch it at all, the merge is easy: take theirs.
If we both changed a file but made the new file exactly the same, the merge is easy: take either one (ours, really, since it's in place).
Otherwise, attempt to combine the changes.
For speed reasons, Git uses the hash IDs ("blob" hashes, for the file's content) to accomplish the first three bullet points without ever having to fire up the file-level diff. This can, and does, merge unconflicted binary-file changes. It's only the final stage, where all three blob hashes differ, that requires a textual diff so as to combine changes.
Obviously, if Git can't diff the file, it cannot merge the two diff outputs. But does just marking the file as text-diff-able (pattern diff in .gitattributes) make the merge proceed? What happens if you set a diff driver, does the low-level file merge code use that driver? It "wants" to use the xdiff internal interface to find hunks; that's a lot easier than interpreting text output from a driver; you probably have to define a merge driver to get a detected-as-binary file to be merged, even if you have marked it as diff.
Additional note, Nov 2019: Since Git 2.18, Git has the ability to convert between committed UTF-8 data and in-work-tree other-format data. To use this, set the working-tree-encoding attribute. For instance, [the gitattributes documentation] shows an example line:
*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF
that would keep all *.ps1 files in UTF-8 internally (in the frozen, committed files inside each commit) but keep the useful-format versions of those files in your work-tree in UTF-16-LE. I have no data as to whether this would work with these SQL files.
1In all cases, but especially in problem cases where there's more than one best common ancestor, git merge's behavior actually depends on the strategy you chose. The usual recursive strategy will merge the merge bases, commit the result, and then use that commit as the merge base! Other merge strategies work differently.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.

SQL Server 2008: Filestream how to physically delete uploaded file from filestreamgroup?

I have created filestream group at C:\Test\FilestreamGroup1
and a table with varBinary Filstream column
Now when file is uploaded then it physically stored at FilestreamGroup1...
Now here I want to know two things
In which format file is stored at FilestreamGroup1 (for every single uploaded file I found 2 encoded file)?
secondly how to delete uploaded file physically (i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical deletion of file from how can I delete a file physically)
If you want to delete files from FileSystem instanly then you need to force garbage collection manually by using checkpoint
This is not a StackOverflow question, this belongs to ServerFault (admin). It toucehs dev though-
i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical
deletion of file from how can I delete a file physically
Do you know what the primary reason is to hav a database? Guarantee data integrity.
A delete must keep the data around until a backup is taken. What is your backup policy? YOU may note that when you make an update, another copy of the file is created.... for that simple reason. The old one must still b e available for backup, and that is just how they integrate it.
In which format file is stored at FilestreamGroup1 (for every single uploaded file I found 2 encoded file)?
No, files are stored raw. What would be the sense to encode them... if there are SQL functions to get the path and it is a supported scenario that the client does not use SQL to load the file (but: asks SQL for the file name and path, then accesses it via NTFS file share). This also supports interop (as any program loading from a network can be fed a SQL driven location.
I strongly assume you have 1 copy only and somehow make an update resulting in a second file written.
has an explanation how to access FileSTream data with SQL.
has an explanation how to access FIleStream data using Win32.
FILESTREAM files being left behind after row deleted
explains while files are left behind when a row is deleted. I found that using the extremely trivial goodle search for "sql filestream delete file" and it was item 1 on the result list - did you even try google?
secondly how to delete uploaded file physically (i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical deletion of file from how can I delete a file physically)
Checkpoint does not remove the files, files are removed in a backgroundprocess and it can take quite a while. To force deletion use
EDIT: works with SQL Server 2012 only
Write "checkpoint" after deleting a row. it will remove physical existence of file.
Run the below query and check, the file getting deleted from file system automatically