Rename filename.ext.crswap to filename.ext rather than copying - native-file-system-api-js

When performing this sequence
Obtain a handle to a new file via window.showSaveFilePicker, say filename.ext
Obtain a writeable file stream from the handle
Write some content into the file using the stream
close the stream to signal completion
the File System API writes to filename.ext.crswap and on close copies filename.ext.crswap to filename.ext
Is there a reason that filename.ext.crswap is not rather renamed to filename.ext?

The reason for this behavior is to avoid partial writes:
"User agents try to ensure that no partial writes happen, i.e. the file represented by fileHandle will either contain its old contents or it will contain whatever data was written through stream up until the stream has been closed."—Spec.

Related

File create time doesn't change even after it is deleted

I am using the following code:
from datetime import datetime
import time, pandas as pd, os, pickle
df = pd.DataFrame(np.arange(1,200))
fn = r'C:\z1.p'
pickle.dump(df, open(fn, 'wb'))
print(datetime.fromtimestamp(os.stat(fn).st_ctime))
os.remove(fn)
time.sleep(5)
pickle.dump(df, open(fn, 'wb'))
print(datetime.fromtimestamp(os.stat(fn).st_ctime))
But I get the same create time from both print statements as:
2022-03-16 08:43:30.885011
2022-03-16 08:43:30.885011
How do I make sure that new time gets printed for second print statement?
This is a Windows feature, called "file system tunnelling".
The apocryphal history of file system tunnelling
One of the file system features you may find yourself surprised by is
tunneling, wherein the creation timestamp and short/long names of a
file are taken from a file that existed in the directory previously.
In other words, if you delete some file “File with long name.txt” and
then create a new file with the same name, that new file will have the
same short name and the same creation time as the original file. You
can read this KB article for details on what operations are sensitive
to tunnelling.
Why does tunneling exist at all?
When you use a program to edit an existing file, then save it, you
expect the original creation timestamp to be preserved, since you’re
editing a file, not creating a new one. But internally, many programs
save a file by performing a combination of save, delete, and rename
operations (such as the ones listed in the linked article), and
without tunneling, the creation time of the file would seem to change
even though from the end user’s point of view, no file got created.
...
See this archived copy of Windows NT Contains File System Tunneling Capabilities:
When a name is removed from a directory (rename or delete), its
short/long name pair and creation time are saved in a cache, keyed by
the name that was removed. When a name is added to a directory (rename
or create), the cache is searched to see if there is information to
restore. The cache is effective per instance of a directory. If a
directory is deleted, the cache for it is removed.
These paired operations can cause tunneling on "name."
delete(name)/create(name)
delete(name)/rename(source, name)
rename(name, newname)/create(name)
rename(name, newname)/rename(source, name)
The idea is to mimic the behavior MS-DOS programs expect when they use
the safe save method. They copy the modified data to a temporary file,
delete the original and rename the temporary to the original. This
should seem to be the original file when complete. Windows performs
tunneling on both FAT and NTFS file systems to ensure long/short file
names are retained when 16-bit applications perform this safe save
operation.
One Windows function related to file tunneling is FltGetTunneledName():
The FltGetTunneledName routine retrieves the tunneled name for a file, given the normalized name returned for the file by a previous call to FltGetFileNameInformation, FltGetFileNameInformationUnsafe, or FltGetDestinationFileNameInformation.
...
To disable tunnelling:
Open regedit
Navigate here:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
On the Edit menu, point to New and then click DWORD Value
Type MaximumTunnelEntries and then press Enter
On the Edit menu, click Modify
Type 0 and then click OK
Restart your computer
Done

What is the best practice for downloading large CSV files from S3 in Java?

I'm trying to get a large CSV file from S3 but the download fails with “java.net.SocketException: Connection reset”, which is probably due to the InputStream simply being open for too long (the download often takes more than an hour since I am doing multiple time-consuming processes on the streamed content). This is how I currently parse the file:
InputStream inputStream = new GZIPInputStream(s3Client.getObject("bucket", "key").getObjectContent());
Reader decoder = new InputStreamReader(inputStream, Charset.defaultCharset());
BufferedReader isr = new BufferedReader(decoder);
CSVParser csvParser = new CSVParser(isr, CSVFormat.DEFAULT);
CSVRecord nextRecord = csvParser.iterator().next();
...
I know I have to split the download into multiple short getObject-calls with a defined offset for the GetObjectRequest, but I'm wondering how to define this offset in case of a CSV, since I need complete lines.
Do I have to ditch the parser library and parse each line into an Object myself so I can keep a count of the read bytes and use it as an offset for the next batch? That doesn't seem very robust to me. Is there any best practice way to achieve "batch downloading" of CSV records?
I decided on simply using the dedicated getObject(GetObjectRequest getObjectRequest, File destinationFile) method to copy the entire CSV to a temporary file on disk. This closes the HTTP connection as soon as possible and allows me to get the InputStream from the local file with no problems. It doesn't resolve the question of the best way to download in batches, but it's a nice and simple workaround.

How to handle file inputs with changing schemas in Talend

Questions: How do I continue to process files that differ substantially from a base schema and that trigger tSchemaComplianceCheck errors?
Background
Suppose I have a folder with Customer xls files called file1,file2,....file1000. Assume I have imported the file schema into Talend repository and called it 6Columns and I have the talend job configured to iterate through each of the files and process them
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
Read each excel file
Compare it to the schema 6Columns
Format the output (rename columns)
Take the collection of Customer data and process it more
While processing I notice that the schema compliance is generating errors (errorCode 16) which points to a number of files (200) with a different schema 13Columns but there isn't a way to identify the files in advance to filter then into a subjob
How do I amend my processing to correctly integrate the files with 13Columnsschema into the process (whats the recommended way of handling) and designing incase other schema changes occur
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
|
|Reject Flow (ErrorCode 16)
|Schema-13Columns
|
|-> ??
Current Thinking When ErrorCode 16 detected
Option 1 Parallel. Take the file path for the current file and process it against 13Columns using a new FileInput before merging the 2 flows back into 1
Option 2 Serial. Collect the list of files that triggered the error and process them after I've finished with the compliance files?
You could try something like below :
tFileList - Read your input repository
tFileInput "schema6" - tSchemaComplianceCheck : read files as 6-columns schema
tMap_1 : further processing
In the reject part :
tMap after reject link : add a new column containing the filepath that has been rejected
tFlowToIterate : used to get an iterate link, acceptable input for tFileInputDelimited that follows.
tFileInput : read data as 13-columns schema. Following components are the same as in part 1.
After that, you can push your data to tHashOutput, in order to read them further in another subjob.

How to create a http request that contains multiple FileHeaders?

I am trying to test a uploading service that supports multiple files uploading,and I found this:
golang POST data using the Content-Type multipart/form-data
that introduced how to create a request to upload a single file,but I need to upload multiple files,is there simple way to create this kind of request?
update:
please check line:38 and 39 in post:to support html5 multiple files uploading
line 38 files := m.File["myfiles"]
line 29 for i, _ := range files {
It seems that it needs to set single name for multiple file headers to stimulate the html5 multiple files uploading.
For each file, call CreateFormFile to create the header for the file. Call Write on the writer returned from CreateFormFile one or more times to write data to the file. When done with all files, close the multipart writer.
The top answer in the linked question uploads two files, one named "image" and one named "key". The data for the "image" is copied from a file. The data for "key" is simply the bytes "KEY".
The field name is the first argument to CreateFormFile. If you want to upload multiple files with the same name, use the same name each time you call CreateFormFile.

SSIS 2005 flat file source - partial row which isn't actually a partial row

I'm currently working on an SSIS package to load mainframe logs from multiple server/file sources into a database.
As it stands at the moment I'm using a foreach loop container to loop through a recordset containing filenames and load the files using a Data Flow task from a Flat File Source and File connection to an OLE DB Destination through a Derived column.
I've built in error handling on the Data Flow task to allow for the fact that there won't always be a log file in the location specified (ie. because the server was down for maintenance during a specific period as the files are generated on an hourly basis), but the problems start after it finishes handling these errors.
If the file immediately following an attempt to load a file that wasn't found exists it begins to load it but then throws the following warning message: [Message Log File Source (NORDXSL) [57]] Warning: There is a partial row at the end of the file., and doesn't load all of the records in that file.
However, when I remove the files I know won't exist from the recordset (so that it only attempts to load files that do exist, including the one with the alleged "partial row"), everything works fine and all files/rows are loaded without a problem. It just seems to not want to load the first file after it's failed a missing file correctly and I can't for the life of me work out why?
I've tried calling Dispose() and ReleaseConnection() on the file connection after the Data Flow task has finished processing but this makes no difference and I'm now completely out of ideas.
Any help would be really appreciated as this is the last bug in this project and I want to get it out the door. PLEASE!!
Thanks,
James
I've now found a workaround for this problem...
I've added a Script Task before the Data Flow Task to load the files that checks to see if the file I want to read exists:
If (System.IO.File.Exists(Dts.Variables("MQLogMessagePath").Value.ToString)) Then
Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If
If it doesn't exist it fails the iteration of the Foreach Loop container and continues onto the next file.
BINGO!