Need help in Apache Camel multicast/parallel/concurrent processing - apache

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)

camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

how to run multiple coordinators in oozie bundle

I'm fresher in oozie bundle. I want to run multiple coordinators one after another in bundle job.My requirement is after completion of one coordinator job _SUCCESS file will be generated, then by using that _SUCCESS file second coordinator should be triggered. I don't know how to do that.For that i used data dependency technique which will keep track for generated output files of previous coordinator. I'm sharing some code which i tried.
Lets say there are 2 coordinator jobs:A and B.and i want to trigger only A coordinator.and if _SUCCESS file for Coordinator A generated then only Coordinator B should get start.
A - coordinator.xml
<workflow>
<app-path>${aDir}/aWorkflow</app-path>
</workflow>
this will call respective workflow.and _SUCCESS file is generated at ${aDir}/aWorkflow/final_data/${date}/aDim location so i included this location in
B coordinator:
<dataset name="input1" frequency="${freq}" initial-instance="${START_TIME1}" timezone="UTC">
<uri-template>${aDir}/aWorkflow/final_data/${date}/aDim</uri-template>
</dataset>
<done-flag>_SUCCESS</done-flag>
<data-in name="coordInput1" dataset="input1">
<instance>${START_TIME1}</instance>
</data-in>
<workflow>
<app-path>${bDir}/bWorkflow</app-path>
</workflow>
but when i run it first coordinator gets KILLED itself, but if i run individually they are running successfully.i'm not getting why these are all getting KILLED.
help to sort out
I find out easy way to do that. I'm sharing solution.For coordinator B coordinator.xml I'm sharing.
1)For Data-set instance should be start time of second one but it should not be time instance of first coordinator.otherwise that particular coordinator will get KILLED.
2)If you want to run multiple coordinators one after another then you can also include controls in coordinator.xml. e.g. concurrency, timeout or throttle. Detailed information about these controls you can find out in "apache oozie" book's 6th chapter.
3)in "" i included latest(0) it will take latest generated folder in mentioned output path.
4)for "input-events" it is mandatory to include it's name as a input to ${coord:dataIn('coordInput1')}.otherwise oozie will not consider dataset.
30
1
${aimDir}/aDimWorkflow/final_data/${date}/aDim
_SUCCESS
${coord:latest(0)}
${bDir}/bWorkflow
input_files
${coord:dataIn('coordInput1')}

Jmeter - Getting previous results in mail

I'm using Jmeter - it runs automatically every 4 hours (through crontab). I'm sending the results file (csv) in the mail at the end of the test. I always see the file of the previous test, not the current one (I can see by the hour).
the structure is this: one 'Test Plan' (I checked 'Run Thread Groups consecutively' and 'Run tearDown Thread Groups after shutdown of main threads), two 'Thread Groups' - which at the end of each I write results to csv file using 'View Results Tree', and at the end - 'TearDown Thread Group' that uses SMTP sampler to send the files created.
any help would be appreciated.
EDIT:
This is the SMTP sampler settings:
and this is the writing to the file:
This might be due to Autoflush policy which flushes content of buffer only when buffer is reached.
As you use a tear down thread group results are nit guaranteed to be fully written as test is not really finished.
The fact that you think you are sending previous test file might be due to jmeter appending data to the same results file.
So :
1/ ensure you move or delete the file once sent
2/ Edit user.properties and add:
jmeter.save.saveservice.autoflush=true
This will make jmeter write to file any sample result immediately afte it is executed.

julia on PBS cluster: what to give to addprocs()?

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!

Restart my delta loading after delete the infopackage in PSA by mistake

here i have got one issue.can some one please help me to resolve this.
i was trying to extract some data to DS 0FI_AP_6...
then in InfoPackage Monitor I can see like..
-->Requests (messages): Everything OK
-->Extraction (messages): Everything OK
-->Transfer (IDocs and TRFC): Missing messages or warnings
-->Info IDoc 2 : sent, not arrived ; IDoc ready for dispatch (ALE service)
Data Package 1 : 23752 Records arrived in BW
Data Package 2 : 15216 Records arrived in BW
Request IDoc : Application document posted
Info IDoc 1 : Application document posted
Info IDoc 3 : Application document posted
Info IDoc 4 : Application document posted
-->Processing (data packet): Everything OK
Data Package 1 ( 38672 Records ) : Everything OK
in Status Menu I am having message like...
Missing data packages for PSA Table
Diagnosis
Data packets are missing from PSA Table . BI processing does not
return any errors. The data transport from the source system to BI was
probably incorrect.
Procedure
Check the tRFC overview in the source system.
You access this log using the wizard or following the menu path
"Environment -> Transact. RFC -> Source System".
Error handling:
If the tRFC is incorrect, resolve the errors listed there.
Check that the source system is connected properly to BI. In
particular, check the remote user authorizations in BI.
Please suggest me how to resolve this issue...
thanks in advance for your help and quick reply is much appreciated.
But what the worst thing is I deleted the infopackage in PSA by mistake.
In the normal case, if I repeat the process again, the delta load would be OK, but now the delta load remains error.
so gurus,
1. how can I restart my delta loading correctly?
2. I want to modify the timestamp in the delta table, but how to do it ?
Go to T-Code RSA7 in the source system. This will tell you the date/timestamp that the delta is set to. If the date was changed to a range that no longer works then you will need to re-initialize the datasource in the BW system side. However, the Delta date may still be fine becauase it may have never been changed when you tried to first do your load because of the connection issues.
You can create a new infopackage and set the update to Initialize Datasource with Data Transfer. This will essentially run a full load from the datasource and then reset the delta pointer date/timestamp to when you ran it. This way you will capture all the data that you needed and anything that was already in the PSA should be overwritten.
Also note that you should delete or set the request status to red on the previous request that may contain bad data in the PSA.
From the original error it seems like you are having an RFC connection issue between the datasource and BW. Contact your BASIS support and have them check the connection to make sure it is good. To ensure that your datasource is extracting properly you can run t-code RSA3 on it in the source system. This will ensure that the extraction of data is working properly.