I have a folder called "masterData"; this folder has approximately 500GB of data, 100k files, 500 folders.
I would like to load the list of all the file name and the location into SQL Server.
I can do this using SSIS. But I am wondering if there are other more efficient way out there, instead of having the job run for 10 hours/day checking if the filename exists in the table.
I wonder if there is a way to set a pointer/marker somehow, so every new filename will be added into the table.
This sounds like a job for a FileSystemWatcher with associated event handlers that update your tables appropriately.
Related
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
I have 4 different text files each file with different name and different column in it place in one folder. I want these 4 four files to be inserted or updated into 4 different existing tables. So How to read read these 4 files dynamically and insert them into their respective table dynamically in SSIS.
Well, you need to use Data Flow Task to move data from a Flat File Source to a Table Destination (OLEDB Destination perhaps). Are the columns in your file delimited in any way? For example, with any of these: (;),(|) or something like that? if it is, you can create a FlatFileConnectionManager and set that to split the columns. If not, you might need to use the FixedWidth option to separate your columns. To use the OLEDB Destination, you will need to create a OLEDB connectionManager to point to the table in your database. I could help you more if I had more information about the files you want to read the data from.
EDIT
Well you said at the start you were working with 4 files and 4 tables, so you can create 4 Flat Destination sourcers with 4 OLEDB destinations aswell (1 of each for each flat file). If I understood you correctly, these 4 files can or cannot exist yet. So if you know the names that the files will get, change the Package Property DelayValidation to true, and then create a connection with a sample text file. You do this so the File path gets saved. The tables, in my opinion DO need to exist. Now, when you said:
i want to load all the text files into each different existing table whenever there is files inside the folder.
The only way I know you can do something similar, is to schedule the execution of your package at a certain time with SQL Server Agent Job. Please let me know if this was what you were looking for.
I am new to SQL.
What is the best way to create a TXT file, if a table has records > 0?
The code already exists to remove or add records to this table.
I am looking for ways to create a trigger file (with no content in the file) at a specific network folder.
Preferably, I would want this TXT file to be removed at the end of the day, so the process could repeat itself every morning
On an after delete Trigger do a select count(*) from table or query one of the system catalog views. If its zero, then call a stored proc that poops a file onto your share drive.
To move the file you could create a small package or call a powershell or bcp (after enabling xp_cmdshell though), or you could create a CLR function (after enabling CLR). I guess since the latter two you need to change a server setting, you could just create a package.
Annnd since there is no data you dont actually need to export, you just create a blank file!
I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.
I have created filestream group at C:\Test\FilestreamGroup1
and a table with varBinary Filstream column
Now when file is uploaded then it physically stored at FilestreamGroup1...
Now here I want to know two things
In which format file is stored at FilestreamGroup1 (for every single uploaded file I found 2 encoded file)?
secondly how to delete uploaded file physically (i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical deletion of file from NTFS...so how can I delete a file physically)
If you want to delete files from FileSystem instanly then you need to force garbage collection manually by using checkpoint
Link
This is not a StackOverflow question, this belongs to ServerFault (admin). It toucehs dev though-
i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical
deletion of file from NTFS...so how can I delete a file physically
Do you know what the primary reason is to hav a database? Guarantee data integrity.
A delete must keep the data around until a backup is taken. What is your backup policy? YOU may note that when you make an update, another copy of the file is created.... for that simple reason. The old one must still b e available for backup, and that is just how they integrate it.
In which format file is stored at FilestreamGroup1 (for every single uploaded file I found 2 encoded file)?
No, files are stored raw. What would be the sense to encode them... if there are SQL functions to get the path and it is a supported scenario that the client does not use SQL to load the file (but: asks SQL for the file name and path, then accesses it via NTFS file share). This also supports interop (as any program loading from a network can be fed a SQL driven location.
I strongly assume you have 1 copy only and somehow make an update resulting in a second file written.
http://msdn.microsoft.com/en-us/library/cc645962.aspx
has an explanation how to access FileSTream data with SQL.
http://technet.microsoft.com/en-us/library/cc645940(v=sql.105).aspx
has an explanation how to access FIleStream data using Win32.
FILESTREAM files being left behind after row deleted
explains while files are left behind when a row is deleted. I found that using the extremely trivial goodle search for "sql filestream delete file" and it was item 1 on the result list - did you even try google?
secondly how to delete uploaded file physically (i.e. deleting a record from the table is like execute delete command, but doing this I'll not result in physical deletion of file from NTFS...so how can I delete a file physically)
Checkpoint does not remove the files, files are removed in a backgroundprocess and it can take quite a while. To force deletion use
sp_filestream_force_garbage_collection
EDIT: works with SQL Server 2012 only
Write "checkpoint" after deleting a row. it will remove physical existence of file.
Run the below query and check, the file getting deleted from file system automatically
DELETE FROM TableName CHECKPOINT
Thanks.