I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.
Related
We are designing a new ingestion framework (Cloud Storage -> BigQuery) using Cloud Functions. However, we receive some files (json, csv) that are corrupted and cannot be inserted as is (bad field names, missing columns, etc.) not even as external tables. Therefore, we would like to ingest every row to one cell as a JSON string and deal with the issues when we cleanse the data in BigQuery.
Is there a way to do that natively and efficiently and as little processing possible (so Cloud Functions wouldn't time out)? I wrote a function that processes the files and wraps lines one by one but for bigger files it won't be an option. We would prefer to stay with Cloud Functions to have this as lightweight as possible.
My option in that case is to ingest the CSV with a dummy separator, for instance # or |. I know that I will never have those characters and that's why I chose them.
Like that, the schema autodetect detect only 1 column, and create a single string column table.
If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)
I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.
We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.
Could you please suggest. I have two files each have 80 to 90k product and these two files are interlinked with each other(one file have information on other) and i need to generate one single file by looking up the other files. These files probably comes in the sameTime with different name.
Both the files are csv and i need to generate the new csv.
Is that the only way I should keep any one of these files in memory and keep looking by iterating.
I planned to use Batch inside dataMapper. Is there any way we can keep the first file in Datamapper userDefined table or something like that.And the getting the new file to make a look up on it.( I'm not provided with external DB)
If any one of the file have some 5000 or 10k lines it the sense, i can keep that in memory and make the 80k file to look on it. I'm not comfortable to keep 80 or 90k file in memory.
Have reference this link: Mule ESB - design a multi file processing flow when files are dependent on each other.
Could you please suggest me the best solution.
Also any idea How long to process the file it does take, Thanks in advance.
Mule studio:5.3.1 and Runtime: 3.7.2
I would think of the problem as two distinct events from Mule's perspective, and plan to keep state from the first one in a "database" of some kind. This doesn't have to be an Oracle cluster or anything, you can run H2 in process or Redis on the same server as Mule for example.
I think you're on the right track with the Batch idea. When the first file is received, I'd create a record for each in a batch job. Then when the second file is received, I'd run a second batch job that looks up the relevant information from the database, and generates the CSV file you need. It could also remove the records that have been matched from the database in a subsequent batch step.
For the transformations, I'd recommend trying DataWeave instead of DataMapper. It's a better way to write transformation logic, and Mulesoft has deprecated DataMapper, to be removed as of Mule 4.0.
This is my first time that I am working on a big project for a client. So I was not sure how to solve this problem. However I have come up with two different ideas but I need professionals opinion about which one is better :)
Situation :
There is an application which runs on different client's iPad. Application data is stored by using giant XML file. This XML file is shared among all client by a server. So a server has a centralised copy and each client has their own copy. Once client made changes to their XML copy they updates server copy in and other client updates their copy by updated server copy.
Now only one client can make changes at one time, To fix this I have logic by which before client starts editing XML they need to get ownership from server and server will only allow one client to edit at one time.
Visual Representation :
Now on client side I have to think of a logic by which I will update my client copy and upload it to server. There are two options,
Option 1 :
In option 1, I can directly manipulate XML file by using GDataXML parser and upload that copy to server. For persistence I can save client copy on my iPad in document directory.
Option 2 :
In option 2, I can read XML file create a CoreData representation for local storage. When ever I update data inside core data it will I will change XML file too and than upload that file on server. Double work but I guess better persistence.
Now which one more robust and advisable? Personally I was planning to do option 2 because it seems more robust as I am persisting application data in core data. But option 1 seems more easy work but I don't know how good persistency will remain.
Sorry for lengthy question,
Thanks for any input given.
There are a number of factors which would influence selecting the second option over the first.
How big is the XML file? If you need to work with very large documents, you may need to incrementally parse the XML (SAX) into core data. This will allow you to access the document's contents without loading it all into memory at once.
Do you need to run complex queries in the data? If so, you may be better off using core data fetch predicates, rather than xpath or XSL.
Are you already using core data? Depending on how the XML data is structured, it might be simpler overall to import the data into your existing persistent store.
Otherwise, you can probably make due with parsing the entire document and either traversing the resulting tree or querying with xpath.
If you need to create an object graph based on what you get from server and show it to user (which you most probably need to do), you should stick up to second option, since it allows easy and robust data persistence.
If you do not need to present user with any data from the XML file you can, of course, store it in the Documents directory.
So, if this is a client application and it has at least some visual representation of the data from an XML file you should use CoreData.
If you want a regular update of data , then use CoreData