dtrace scripts io counters per NFS mount - performancecounter

I need to measure performance counters per NFS mount I have on my Nexenta machine (Based on OpenSolaris).
iostat is great, the problem is that it count by devices and not at the NFS level.
Is there any way I can do that with a dtrace script?
Counters Im interested in:
(CIFS/NFS) avg_latency per Volumes/Folder
(CIFS/NFS) read_latency per Volumes/Folder
(CIFS/NFS) write_latency per Volumes/Folder
(CIFS/NFS) read_ops/s per Volumes/Folder
(CIFS/NFS) write_ops/s per Volumes/Folder
(CIFS/NFS) read_data/s per Volumes/Folder
(CIFS/NFS) write_data/s per Volumes/Folder c
pu_util%
Thanks!

You may use the following:-
#!/usr/sbin/dtrace -qs
/* NFSv3 read/write */
nfsv3:::op-read-done
{
#nfsrb = sum(args[2]->res_u.ok.data.data_len);
}
nfsv3:::op-write-done
{
#nfswb = sum(args[2]->res_u.ok.count);
}
dtrace:::END
{
printa(" read bytes : - %#d\n", #nfsrb);
printa(" write bytes : - %#d\n", #nfswb);
}

Related

Why flink container vcore size is always 1

I am running flink on yarn(more precisely in AWS EMR yarn cluster).
I read flink document and source code that by default for each task manager container, flink will request the number of slot per task manager as the number of vcores when request resource from yarn.
And I also confirmed from the source code:
// Resource requirements for worker containers
int taskManagerSlots = taskManagerParameters.numSlots();
int vcores = config.getInteger(ConfigConstants.YARN_VCORES,
Math.max(taskManagerSlots, 1));
Resource capability = Resource.newInstance(containerMemorySizeMB,
vcores);
resourceManagerClient.addContainerRequest(
new AMRMClient.ContainerRequest(capability, null, null,
priority));
When I use -yn 1 -ys 3 to start flink, I assume yarn will allocate 3 vcores for the only task manager container, but when I checked the number of vcores for each container from yarn resource manager web ui, I always see the number of vcores is 1. I also see vcore to be 1 from yarn resource manager logs.
I debugged the flink source code to the line I pasted below, and I saw value of vcores is 3.
This is really confuse me, can anyone help to clarify for me, thanks.
An answer from Kien Truong
Hi,
You have to enable CPU scheduling in YARN, otherwise, it always shows that only 1 CPU is allocated for each container,
regardless of how many Flink try to allocate. So you should add (edit) the following property in capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
ALso, taskManager memory is, for example, 1400MB, but Flink reserves some amount for off-heap memory, so the actual heap size is smaller.
This is controlled by 2 settings:
containerized.heap-cutoff-min: default 600MB
containerized.heap-cutoff-ratio: default 15% of TM's memory
That's why your TM's heap size is limitted to ~800MB (1400 - 600)
#yinhua.
Use the command to start a session:./bin/yarn-session.sh, you need add -s arg.
-s,--slots Number of slots per TaskManager
details:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/yarn_setup.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/cli.html#usage
I get the answer finally.
It's because yarn is use "DefaultResourceCalculator" allocation strategy, so only memory is counted for yarn RM, even if flink requested 3 vcores, but yarn simply ignore the cpu core number.

Wrong balance between Aerospike instances in cluster

I have an application with a high load for batch read operations. My Aerospike cluster (v 3.7.2) has 14 servers, each one with 7GB RAM and 2 CPUs in Google Cloud.
By looking at Google Cloud Monitoring Graphs, I noticed a very unbalanced load between servers: some servers have almost 100% CPU load, while others have less than 50% (image below). Even after hours of operation, the cluster unbalanced pattern doesn't change.
Is there any configuration that I could change to make this cluster more homogeneous? How to optimize node balancing?
Edit 1
All servers in the cluster have the same identical aerospike.conf file:
Aerospike database configuration file.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 32
batch-index-threads 32
proto-fd-max 15000
batch-max-requests 200000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.6 3002
mesh-seed-address-port 10.240.0.5 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 3
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
write-block-size 1M
filesize 180G
}
}
Edit 2:
$ asinfo
1 : node
BB90600F00A0142
2 : statistics
cluster_size=14;cluster_key=E3C3672DCDD7F51;cluster_integrity=true;objects=3739898;sub-records=0;total-bytes-disk=193273528320;used-bytes-disk=26018492544;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=239353472;data-used-bytes-memory=0;index-used-bytes-memory=239353472;sindex-used-bytes-memory=0;free-pct-memory=95;stat_read_reqs=2881465329;stat_read_reqs_xdr=0;stat_read_success=2878457632;stat_read_errs_notfound=3007093;stat_read_errs_other=0;stat_write_reqs=551398;stat_write_reqs_xdr=0;stat_write_success=549522;stat_write_errs=90;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=4;stat_rw_timeout=1862;udf_read_reqs=0;udf_read_success=0;udf_read_errs_other=0;udf_write_reqs=0;udf_write_success=0;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=7021;stat_proxy_reqs_xdr=0;stat_proxy_success=2121;stat_proxy_errs=4739;stat_ldt_proxy=0;stat_cluster_key_err_ack_dup_trans_reenqueue=607;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;stat_compressed_pkts_received=0;err_tsvc_requests=110;err_tsvc_requests_timeout=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=17;err_rw_pending_limit=19;err_rw_cant_put_unique=0;geo_region_query_count=0;geo_region_query_cells=0;geo_region_query_points=0;geo_region_query_falsepos=0;fabric_msgs_sent=58002818;fabric_msgs_rcvd=57998870;paxos_principal=BB92B00F00A0142;migrate_msgs_sent=55749290;migrate_msgs_recv=55759692;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=7228;migrate_num_incoming_refused=0;queue=0;transactions=101978550;reaped_fds=6;scans_active=0;basic_scans_succeeded=0;basic_scans_failed=0;aggr_scans_succeeded=0;aggr_scans_failed=0;udf_bg_scans_succeeded=0;udf_bg_scans_failed=0;batch_index_initiate=40457778;batch_index_queue=0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0;batch_index_complete=40456708;batch_index_timeout=1037;batch_index_errors=33;batch_index_unused_buffers=256;batch_index_huge_buffers=217168717;batch_index_created_buffers=217583519;batch_index_destroyed_buffers=217583263;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=7021;proxy_action=5519;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=551089;write_prole=1055431;read_dup_prole=14232;rw_err_dup_internal=0;rw_err_dup_cluster_key=1814;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=1767;rw_err_ack_badnode=0;client_connections=366;waiting_transactions=0;tree_count=0;record_refs=3739898;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=296;partition_replica=572;partition_desync=0;partition_absent=3228;partition_zombie=0;partition_object_count=3739898;partition_ref_count=4096;system_free_mem_pct=61;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=90;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=53184;uptime=1001388;stat_write_errs_notfound=0;stat_write_errs_other=90;heartbeat_received_self=0;heartbeat_received_foreign=145137042;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_running=0;query_long_running=0;query_short_queue_full=0;query_long_queue_full=0;query_short_reqs=0;query_long_reqs=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 : features
cdt-list;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
61
5 : partition-generation
11811
6 : edition
Aerospike Community Edition
7 : version
Aerospike Community Edition build 3.7.2
8 : build
3.7.2
9 : services
10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.41:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.23:3000
10 : services-alumni
10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.23:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.41:3000
I have a few comments about your configuration. First, transaction-threads-per-queue should be set to 3 or 4 (don't set it to the number of cores).
The second has to do with your batch-read tuning. You're using the (default) batch-index protocol, and the config params you'll need to tune for batch-read performance are:
You have batch-max-requests set very high. This is probably affecting both your CPU load and your memory consumption. It's enough that there's a slight imbalance in the number of keys you're accessing per-node, and that will reflect in the graphs you've shown. At least, this is possibly the issue. It's better that you iterate over smaller batches than try to fetch 200K records per-node at a time.
batch-index-threads – by default its value is 4, and you set it to 32 (of a max of 64). You should do this incrementally by running the same test and benchmarking the performance. On each iteration adjust higher, then down if it's decreased in performance. For example: test with 32, +8 = 40 , +8 = 48, -4 = 44. There's no easy rule-of-thumb for the setting, you'll need to tune through iterations on the hardware you'll be using, and monitor the performance.
batch-max-buffer-per-queue – this is more directly linked to the number of concurrent batch-read operations the node can support. Each batch-read request will consume at least one buffer (more if the data cannot fit in 128K). If you do not have enough of these allocated to support the number of concurrent batch-read requests you will get exceptions with error code 152 BATCH_QUEUES_FULL . Track and log such events clearly, because it means you need to raise this value. Note that this is the number of buffers per-queue. Each batch response worker thread has its own queue, so you'll have batch-index-threads x batch-max-buffer-per-queue buffers, each taking 128K of RAM. The batch-max-unused-buffers caps the memory usage of all these buffers combined, destroying unused buffers until their number is reduced. There's an overhead to allocating and destroying these buffers, so you do not want to set it too low compared to the total. Your current cost is 32 x 256 x 128KB = 1GB.
Finally, you're storing your data on a filesystem. That's fine for development instances, but not recommended for production. In GCE you can provision either a SATA SSD or an NVMe SSD for your data storage, and those should be initialized, and used as block devices. Take a look at the GCE recommendations for more details. I suspect you have warnings in your log about the device not keeping up.
It's likely that one of your nodes is an outlier with regards to the number of partitions it has (and therefore number of objects). You can confirm it with asadm -e 'asinfo -v "objects"'. If that's the case, you can terminate that node, and bring up a new one. This will force the partitions to be redistributed. This does trigger a migration, which takes quite longer in the CE server than in the EE one.
For anyone interested, Aerospike Enterpirse 4.3 introduced 'uniform-balance' which homogeneously balances data partitions. Read more here: https://www.aerospike.com/blog/aerospike-4-3-all-flash-uniform-balance/

Aerospike: Failed to store record. Error: (13L, 'AEROSPIKE_ERR_RECORD_TOO_BIG', 'src/main/client/put.c', 106)

I get the following error while storing the data to aerospike ( client.put ). I have enough space on the drive.
Aerospike: Failed to store record. Error: (13L, 'AEROSPIKE_ERR_RECORD_TOO_BIG', 'src/main/client/put.c', 106).
Here is my Aerospike server namespace configuration
namespace test {
replication-factor 1
memory-size 1G
default-ttl 30d # 30 days, use 0 to never expire/evict.
storage-engine device {
file /opt/aerospike/data/test.dat
filesize 2G
data-in-memory true # Store data in memory in addition to file.
}
}
By default namespaces have a write-block-size of 1 MiB. This is also the maximum configurable size and will limit the max object size the application is able to write to Aerospike.
If you need to go beyond 1 MiB see Large Data Types as a possible solution.
UPDATE 2019/09/06
Since Aerospike 3.16, the write-block-size limit has been increased from 1 MiB to 8 MiB.
Yes, but unfortunately, Aerospike has deprecated LDT (https://www.aerospike.com/blog/aerospike-ldt/). They now recommend to use Lists or Maps, but as stated in their post:
"the new implementation does not solve the problem of the 1MB Aerospike database row size limit. A future key feature of the product will be an enhanced implementation that transcends the 1MB limit for a number of types"
In other terms, it is still an unsolved problem when storing your data on SSD or HDD. However, you can store larger data on memory namespaces.

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.

Using multiple JedisPools

Initially when I used 1 JedisPool to communicate with a Redis server (single instance of redis running on the server) I saw that my 200K queries took 15 sec to complete. For each query I do following:
Jedis jedis = pool.getResource();
Pipeline p = jedis.pipelined();
p.multi();
for (String ViewId : ViewIds) {
for (Article aR : view.getEntriesList()) {
p.rpush(ViewId, <article-in-json>);
}
p.ltrim(ViewId, -10, -1);
}
p.exec();
p.sync();
pool.returnResource(jedis);
JedisPool created using:
JedisPool pool = new JedisPool(new JedisPoolConfig(), getHostName(), getPort(), 0, "foobared");
However when I used 8 JedisPools to the same server (same redis instance) and used them in a round robin way I saw that 200K queries were processed in around 8-10 secs (earlier with 1 JedisPool it took 15 sec) => considerable improvement.
What is the reason for this? and what is the optimal number of JedisPools that we should use?