Split csv file by the value of a column - Apache Nifi - apache

I have a csv files, that it has the following structure.
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
MARKETING,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
As you can see there is not header, but for your information the first part (first column) represents the sector where are getting the data.
What I have to do is depending on the first column value, for example (MARKETING or ERP) I have to send all that rows to a different output directory.
For example, all rows with ERP to /output/ERP/
all rows with MARKETING to /output/marketing/
I have an idea about how to do it, but my problem is about the RouteOnAttribute processor I am using, I don't know how to refer to the first column and to indicate what is the value (ERP or MARKETING) to later on send it to the correct output directory.
Here is my schema.
Thanks.

Use PartitionRecord processor for this case.
Configure the processor with record reader/writer controller services. Even though if you are not having header you can use col1,col2...etc in avro schema.
add new property that defines processor to use that field for partition the flowfile.
Now partition record processor adds the partition field attribute with value, by making use of this attribute value we can dynamically store files into respected directories dynamically.
Flow:
1.GetFile
2.PartitionRecord
3.PutFile //configure directory as /output/${<keep_partition_field_name_here>}
Please refer this link for configuring usage of partition record processor.
(or)
Old Approach:
Using RouteText processor instead of SplitText + RouteOnAttribute Processors
Configure RouteText processor as
Use the ERP/MARKETING connections connect to PutFile processor and use RouteText.Route attribute value to dynamically save the files into Directories.
Flow:
1.GetFile
2.RouteText
3.PutFile //configure directory as /output/${RouteText.Route}/
You can also use Group Regular expression property value to create partitions.
Note
Using PartitionRecord processor will be more efficient than RouteText processor.

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

TfsAreaAndIterationProcessorOptions configuration2.json

I am looking at the source code and configuration2.json file and the processor TfsAreaAndIterationProcessorOptions which has the ability to specify a Query that is unique for both the Source and Target. Is it possible to use different queries to migrate based on Area Path.
The TfsAreaAndIterationProcessor does not use queries and is only used to migrate Area and Iteration nodes. You can filter those nodes.

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

Save large BigQuery results to another project's BigQuery

I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.

How to know the configured size of the Chronicle Map?

We use Chronicle Map as a persisted storage. As we have new data arriving all the time, we continue to put new data into the map. Thus we cannot predict the correct value for net.openhft.chronicle.map.ChronicleMapBuilder#entries(long). Chronicle 3 will not break when we put more data than expected, but will degrade performance. So we would like to recreate this map with new configuration from time to time.
Now it the real question: given a Chronicle Map file, how can we know which configuration was used for that file? Then we can compare it with actual amount of data (source of this knowledge is irrelevant here) and recreate a map if needed.
entries() is a high-level config, that is not stored internally. What is stored internally is the number of segments, expected number of entries per segment, and the number of "chunks" allocated in the segment's entry space. They are configured via ChronicleMapBuilder.actualSegments(), entriesPerSegment() and actualChunksPerSegmentTier() respectively. However, there is no way at the moment to query the last two numbers from the created ChronicleMap, so it doesn't help much. (You can query the number of segments via ChronicleMap.segments().)
You can contribute to Chronicle-Map by adding getters to ChronicleMap to expose those configurations. Or, you need to store the number of entries separately, e. g. in a file along with the ChronicleMap persisted file.