ETL file loading: files created today, or files not already loaded? - pentaho

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?

File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.

Related

ADF Better way to count the number of files matching a FileMask, in a known folder

I have a Known\Folder Path.
That folder contains several hundred small txt files.
Generally the filenames are of the form Prefix_<Code1>_<SubCode2>_<State>.txt
I want to know how many files there are for a specific value of Code1.
I was hoping to use the GetMetadata activity, with Path Known\Folder\Prefix_Value_*.txt, but that just returns empty set :(
Currently I've got it working with GetMetadata on Known\Folder, with childItems captured, and then a foreach over all the files, with If on #startsWith(file.name, 'Prefix_Value').
But that results in hundreds of iterations of the loop, in sequence, and each activity takes ~1 second so it ends up taking minutes to do this check.
Is there a better way to do this? Either to direclty locate all files matching my mask, or a better way to count the matching elements of a hundreds-of-items array?
Lots of little activities might be expensive if you do it often.
If you only want the count, you can do this in the following hideous way (promise it isn't written in Brainf&ck) ... it relies on the fact that you can use XPATH to scan XML in ADF. You only need a set-variable activity after your metadata lookup.
Set a variable equal to this - it will contain the number of files with 'Code1' in the name.
#{xpath(xml(json(concat('{"files":{',replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":',''),'}}'))),'count(/files/*[contains(local-name(),''Code1'')])')}
The inner part:
replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":','')
takes the metadata activity's output and strips the []{} parts and the type and name elements, then
json(concat('{"files":{',<the foregoing>,'}}')
wraps that up in to a JSON object, with files as the outer key and the filenames as inner keys (with text = "file" but that's going to be irrelevant).
Then you can take that JSON, turn it into XML and query the XML.
xpath(xml(<the above JSON>), 'count(/files/*[contains(local-name(),''Code1'')])')
The XPATH query counts all the elements under /files (which are now our filenames) whose names contain the text 'Code1'.
There is no way to get the file count directly matching Wildcard in Get Metadata activity by now. You can vote Get Metadata for Multiple Files Matching Wildcard to progress this feature.
If you only want to copy those files, you can use Wildcard file path.
If those files stored in Azure Blob Storage or somewhere that can be got file count with prefix by API, you can use Azure Function activity.

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

How to find less frequenlty accessed files in HDFS

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.
I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

SSIS looping through Files multiple times i

I have an SSIS package where part of it loops through a directory and dumps every filename into a sql table.
Functionally, it works and I can use it.
I'm able to find what I need easy with a SELECT DISTINCT, but for performance purposes, I'd like to stop it from doing this.
However, the files in the directory are all excel files(not sure if that's relevant) and instead of getting all 50 files, I get duplicates of every filename.
Oddly enough, it saves every file name exactly 1024 times.
Anyone have any hints?

Processing Files - Keeping Track

Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.