Can snowpipe pick up files from within sub-folders? - amazon-s3

I'm looking to try to pick up all parquet files from an s3 bucket that have been placed into partitioned sub-folders by date.
In the past I've used snowpipe with a sort of 1-1 relationship, one sub-folder to one table; but I would be interested to know if it is possible to crawl over partitioned data to a single table.
Many thanks!

Short answer: Yes!
With COPY INTO you can load a particular file, a whole folder or all sub-folders within a certain folder. All you need to do is to adjust your path accordingly. Just mention the path in the FROM clause and all subfolders will be copied.
copy into mytable
from #my_stage/your_main_folder/;
Docs: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Edit: There are variations possible. Also the stage itself can point to a certain main folder already and you do not need to extend the COPY INTO path.

Yes you can use sub folders as part of Snowflake Stage which you want to crawl in the Pipe Definition.
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html#step-3-create-a-pipe-with-auto-ingest-enabled
Make sure that the S3 Stage has the Sub folder path which you want to crawl.

Related

How to unzip the same zip file multiple times?

I am developing a zip extractor app for which if i unzip multiple times the same zip file it should extract like myfile-1, myfile-2, myfile-3 something like this .
example : there is sampleproject.zip in my desktop when i unzip it should be like sampleproject, sampleproject-1, sampleproject-2.
Any Suggestions.
Thanks in Advance!
Based on your comment I suggest you unzip your file to a temporary directory and then move its contents into the actual directory, handling any name clashes as you do that. In outline:
Use URLForDirectory:inDomain:appropriateForURL:create:error: to create a temporary directory suitable to unzip into. You should pass your the URL of your destinationPath for the appropriateForURL: parameter; this should give you a temporary directory on the same volume as destinationPath making placing the unzipped items into the right place moves rather than copies.
Unzip into the temporary directory returned by (1)
Now use NSFileManager calls to traverse the temporary directory moving each item found to destinationPath, renaming as needed to avoid name clashes.
Remove the temporary directory.
If you have problems implementing this algorithm ask a new question, show your code, explain your problem, and include a link back to this question so the thread can be followed. Someone will then undoubtedly help you with the next step.
HTH

Pentaho - Check if a csv file is already loaded before loading

I am loading CSV files from a folder using Pentaho, and once files are loaded, I am making an entry into a table with the filenames that are loaded.
I need to put a check before loading a file if it is already loaded, for that I want to pick the filename and check with the names in the table that holds files which are already loaded. Since I am new to Pentaho, I am struggling to design this approach.
Please, suggest how should I go through to do this or if there is any totally different approach.
Your approach is valid. Make some book keeping of the processed filename in a database (you may also use a CSV file for that).
The difficulty with this approach is that the filename may not be in a field. So you have to write a master job to Add file name to results and give hand to a transformation that load the CSV (Press crtl-space in the box and find your variable in the drop down), check the database, with a Stream lookup, and Filter rows that are not matched. After the load, you 'Update' the bookkeeping table.
An other approach we used successfully in the past was to load the file form a directory and move the processed file into an other directory. This way it was easy to drop new files into a directory, and to retrieve processed file in case of problems.
This could be a start:
The Job
The transformation

Automatically add database entry after ftp upload

Sorry if this seems stupid but I wonder if it's possible to add a database entry after an ftp upload.
To be more clear, thanks to winSCP I have several folders sending everything I put in there automatically to my server.
However, I would like to create a mysql entry for each uploaded files and once again, automatically. Is it possible to do that? How?
To gives the full details of what I need to do, you can read the following.
I have several folders with pictures and each folders are uploaded automatically.
Each of those folders belong to one user and the goal is to give them an account and allow them to see and download those files through a web interface. Since one account = one folder, that's kinda easy.
And I think a simple .htaccess can simply secure things so one user can only see and download the file in his own repository, no?
However if I want them to be able to see what's new (=something they didn't download or simply mark as read) I think I need a table to manage those files.
Something like id | file (string) | read (bool).
If you think this way to proceed is bad, they I'm open to change how to do things, but to be clear uploading the file need to work this way. Not using any kind of formulary.
Thanks for reading that, sorry for my english.
Your problem contains three steps:
Folders/Files been automatically uploaded to your server directory, as you say, this been efficiently handled by winSCP.
You need to update your database with all the files and folders present in your server directory.
You need to update whether or not it is been read/downloaded by the user.
Since your first step is in place, we don't need anything there. For second step, you should write a script and schedule that script to run at a fixed time interval using CRON (if using LINUX or UNIX, or WINDOWS). The script would be responsible to create a list of file(s) present in the directory, and simply insert the file(s) information that are not present in your database.
EDIT:
This edit is to describe how your script file should work. As I explained, the cron jobs would simply help you run your script file in fixed set of interval (which can be every minute, or every hour, or every day, and so on). Lets say your database table has following columns:
fileid (varchar[20])
filepath (varchar[20])
status (boolean)
Your script file should do following things:
Create a list of existing filepaths in your server directory
Run a select query, create a list of existing filepaths from database table.
Compare list1 with list2, and find the ones that doesn't exist in list2 (This would give you a list of filepath that needs to be inserted into table)
Just insert the list of file paths you got above, and set there status to be false (which means the file is not read/downloaded yet)
NOTE: Please keep in mind that I am not advising right now that how your database table should look like. It can be what you have proposed or can even differ depending on your will or requirements.
For the third step, simply keep the status of your file to be unread when creating entries in your table from the second step, and then when user click on the file link in your application whether to view or download it, send a POST request to your server updating the file status to be marked as read.
Let me know if this helps!

How to delete lot of objects named with common prefix from S3 bucket?

I have files in S3 bucket, and their names have the following format:
username#file_id#...
How to remove all john#doe#* items, without listing them? There are thousands of them, so when user request my app to delete all of them, he has to wait.
For anyone who stumbles upon this now, you can create a lifecycle rule to either delete or set expiration of files with a certain prefix.
There's no way to tell S3 to delete all files that meet a specific criteria - you have to delete one key at a time.
Most client libraries offer a way to filter and paginate such that you'd only list the files you need to delete and you can provide a status update. For an example, Boto's bucket listing accepts prefix as one of the parameters.
I have mistakenly create logging files in same bucket and there are like tons of log file in my bucket.
Luckily I came across a nodejs util node-s3-utils and it save my day!
example of delete files with foo/ prefix, having extension .txt
$ s3utils files delete -c ./.s3-credentials.json -p foo/ -r 'foo\/(\w)+\.txt'

Vb.Net Document Storage

I am attempting to add a document storage module to our AR software.
I will be prompting the user to attach a doc/image to thier account. I will then put a copy of this file into our folder so that we can reference it without having to rely on them keeping the file in its original place. This system is not using a database but instead its using multiple flat files.
I am looking for guidance on how to handle these files once they have attached them to our system.
How should I store these attached files?
I was thinking I could copy the file over to a sub directory then renaming it to a auto-generated number so that we do not have duplicates. The bad thing about this, is the contents of the folder can get rather large.
Anyone have a better way? Should I create directories and store them...?
This system is not using a database but instead its using multiple flat files.
This sounds like a multi-user system. How are you handing concurrent access issues? Your answer to that will greatly influence anything we tell you here.
Since you aren't doing anything special with your other files to handle concurrent access, what I would do is add a new folder under your main data folder specifically for document storage, and write your user files there. Additionally, you need to worry about name collisions. To handle that, I'd name each file there with by appending the date and username to the original file name and taking the md5 or sha1 hash of that string. Then add a file to your other data files to map the hash values to original file names for users.
Given your constraints (and assuming a limited number of total users) I'd also be inclined to go with a "documents" folder -- plus a subfolder for each user. Each file name should include the date to prevent collisions. Over time, you'll have to deal with getting rid of old or outdated files either administratively or with a UI for users. Consider setting a maximum number of files or maximum byte count for each user. You'll also want to handle the files of departed users.