I have an SSIS package where part of it loops through a directory and dumps every filename into a sql table.
Functionally, it works and I can use it.
I'm able to find what I need easy with a SELECT DISTINCT, but for performance purposes, I'd like to stop it from doing this.
However, the files in the directory are all excel files(not sure if that's relevant) and instead of getting all 50 files, I get duplicates of every filename.
Oddly enough, it saves every file name exactly 1024 times.
Anyone have any hints?
Related
I'm very new to Databricks and Spark, so I hope my question is clear. If not, please let me know.
I have a folder in azure with more than 2 million XML-files. The goal is to convert all these files into one CSV-file. I have code that can convert XML to CSV and then add it to a CSV-file in azure. I've tested it with 50.000 files and it worked.
However, when I want to convert all XML-files (+2 million), I get the error that the driver limit is exceeded. I do not want to increase this limit, since it is not very efficient, so I came up with the idea to convert one XML-file at a time and then add it (append) to the CSV-file. So instead of converting all XML-files in one job, I want to convert one XML-file per job.
A colleague was able to develop a code in Scala that creates a table with all +2 million file paths. I can access this table sing SQL:
(The full paths are not shown due to security reasons).
What I actually need is a code in Python that can loop thru this table and retrieve one path (as string) at a time. The reason I need this in Python is because I have the code to convert to CSV in Python. The conversion only needs the path as string to perform. If I’m able to put this in a loop, each loop a new string is retrieved from the table as string, converted to CSV and then added to one CSV-file.
So my question is: how can I loop thru this table, returning the path (value of the table) as string with each iteration? This iteration should be able to go thru the whole list (+2 million paths).
I hope my question is clear and someone can help.
Best regards,
Ganesh
I tried a Hive process,
which generate words frequency rank from
sentences,
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.
in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into default.target
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.
Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.
We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.
I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.
Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.