getting some extra files without any extension on Azure Data Lake Store - azure-data-lake

I am using Azure data Lake Store for files Storage. I am using operations like
Creating a main file
Creating part files
Appending these part files to main file (for Concurrent append)
Example:
There is main log file (eventually will contain logs from all
programs)
There are part log file that each program creates solely and then
append to the main log file
The workflow runs really file but i have noticed some unknown file getting uploaded onto the store directory. These files name is a GUID an has no extension, moreover these unknown files are empty.
Does anyone knows what might be the reason for these extra files.

Thanks for reformatting your question. This looks like some processing artefacts that probably will disappear shortly after. How did you upload/create your files?

Related

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

FTP client sees a file that isn't there... How can I successfully delete/overwrite this "ghost" file?

So we have a client that creates "training packages" and then uploads them via ftp to their website. They create the training packages in PowerPoint, and then use some program to convert them into html/swf files and package them within a folder. When they upload, they use Filezilla, and just transfer the entire folder over. The folder is uniquely named, uses no spaces or special characters.
These files have uploaded fine for about a year. Recently, they've run into a problem. Whenever they try to upload training package folder, they are immediately presented with the "This file already exists, do you want to overwrite?" message. Except... the folder they're moving is brand new, and the file it's asking to overwrite DOESN'T EXIST. When they choose "Overwrite" the file looks like it transfers, but the file size is wrong, and the training package doesn't work correctly.
This happens with every training package they try to upload. It's not just a badly outputted package. Also, it's always the same file that has the problem--it's the main "player" for the training package, and though it contains different content for every package, it is the same file name (cplayer.swf) every time.
Things they've tried without success:
-Re-uploading the file again by itself, and overwriting
-Deleting the "bad" file and re-uploading the single file - Get the overwrite message again, even though the file DOES NOT EXIST.
-Renaming the file on the server and re-uploading the single file - Get the overwrite message.
-Renaming the single file locally within the package and uploading/renaming it - Won't let us rename because the file already exists.
-Used another FTP client - Same results as above, so not a client specific problem.
-Used a different FTP login - Same results as above, so not a permissions problem.
Other things of note:
-The file is small--it's not a time out problem. Plus, all other files upload fine, and some are a lot larger.
-They've emailed this file to me, and I've uploaded it successfully.
I am completely at my wits end. Does anyone have any ideas where I can at least troubleshoot a little further?
Thanks for the non-help, the downvote, and the general lack of response on what was a pretty serious issue for me.
In case anyone else has a similar problem, here's what was going on:
Virus software (specifically Malware Bytes) was blocking THIS ONE SINGLE FILE. All I had to do was exclude the folder that contained the file.

WCF returning "dynamic" gzipstream

I need to create a service that returns a GZipStream consisting of one or more files. The number of files could be hundreds and each file could potentially take up more than 500MB.
Is it somehow possible to add the files dynamically to the gzipstream as the stream is being transfered? (to avoid running into an out-of-memory exception when the files needs to be copied into the stream)
Etc:
Copy fileA to the stream being returned.
The client starts reading the stream.
When fileA has been read (client side), copy fileB to the stream (server side).
The client continue to read the stream.
... and so on until there's no more files.
Btw. it's not important that the files are compressed, just that they are combined into a zip file so that the client only has to download one single file.
So my goal is: Stream multiple files back to the client as one single file without processing all the files at once on the server (to avoid loading all files into memory and therefore raise an out-of-memory exception).
Could this be done by creating a custom stream somehow or is there an easier way to go?
Thanks.
You could combine the files to a single zip file on disk and then stream that file back.
For how to combined the files to a zip file see: c# sharpziplib adding file to existing archive
This solves the out of memory problem, but it does mean that you need a lot of disk space.

Ways to achieve de-duplicated file storage within Amazon S3?

I am wondering the best way to achieve de-duplicated (single instance storage) file storage within Amazon S3. For example, if I have 3 identical files, I would like to only store the file once. Is there a library, api, or program out there to help implement this? Is this functionality present in S3 natively? Perhaps something that checks the file hash, etc.
I'm wondering what approaches people have use to accomplish this.
You could probably roll your own solution to do this. Something along the lines of:
To upload a file:
Hash the file first, using SHA-1 or stronger.
Use the hash to name the file. Do not use the actual file name.
Create a virtual file system of sorts to save the directory structure - each file can simply be a text file that contains the calculated hash. This 'file system' should be placed separately from the data blob storage to prevent name conflicts - like in a separate bucket.
To upload subsequent files:
Calculate the hash, and only upload the data blob file if it doesn't already exist.
Save the directory entry with the hash as the content, like for all files.
To read a file:
Open the file from the virtual file system to discover the hash, and then get the actual file using that information.
You could also make this technique more efficient by uploading files in fixed-size blocks - and de-duplicating, as above, at the block level rather than the full-file level. Each file in the virtual file system would then contain one or more hashes, representing the block chain for that file. That would also have the advantage that uploading a large file which is only slightly different from another previously uploaded file would involve a lot less storage and data transfer.

DotNetZip - create zip from accessed file

it is possible to use DotNetZip to create a zip from an accessed file (eg log file from another application) ?
so create a zip when the log file gets written through the other application
Hmm, well, yes, if you are willing to write some code.
One way to do it is to compress the file AFTER it has been written and closed.
You would need to have an app that runs with a filesystem watcher, and when it sees the log file being closed, it compresses that log file into a zip.
If you mean to imply, a distinct app that writes to a file and it automagically writes into a zip file, no I don't know of a simple way to do that. There is one possibility: if the 3rd party app accepts a System.IO.Stream in which to write the log entries. In that case, you can do that with DotNetZip. You can get a writeable stream from DotNetZip, into which the app writes content. It is compressed as it is written, and when the writing is complete, DotNetZip closes the zipfile. To use this, check the ZipFile.AddEntry() method that accepts a WriteDelegate. It's in the documentation.