Using solr, i want to know how to estimate desk usage - indexing

i uploaded same logs 200 million and one logSize is 474Byte
so i simply thought that the total disk usage is about 100G(200000000 x 474Byte)
but the result is 60G(solrUI - clode - Nodes - DiskUsage)
how is this possible .. ?
dose solr compress if those all logs same ??


Issues related to checkPointPageBufferSize and WalAutoArchiveAfterInactivity in Ignite

I have set checkPointPageBufferSize to 0. It is said that if 0 is mentioned then it calculates automatically. When I saw the logs it prints 0 so what is value of checkPointPageBufferSize which is calculated automatically. I am using Ignite 2.9.0.
Can anyone help me to solve this?
If you explicitly use 0 as the checkPointPageBufferSize or just leave as is then it will be defaulted to a value which is function of the data region size.
less than 1GB - min (256MB, Data_Region_Size)
between 1GB and 8GB - Data_Region_Size / 4
more than 8GB - 2GB
More details could be found in the documentation.

concurrent testing of login functionality with 50 users/threads is not working

i have given the thread count = 50
rampup period =0
for 48 threaads it is getting passed , for 2 threads there is no failure captured in the selenium log files.
I am expecting concurrent login of 50 users with 0 rampup period , i am not able to find out the exact reason of failure . please suggest the fixes to handle this scenario.
Check jmeter.log file for any suspicious entries
Add View Results Tree listener to your test plan - it will allow you to inspect request and response details
50 real browsers might be too high for a single machin
as per WebDriver Sampler documentation
From experience, the number of browser (threads) that the reader creates should be limited by the following formula:
C = N + 1
C = Number of Cores of the host running the test
and N = Number of Browser (threads).
as per Firefox 62.0 system requirements
512MB of RAM / 2GB of RAM for the 64-bit version
So you will need a machine with 51 cores and 100 GB of RAM in order to ensure there will no be JMeter-side bottleneck. If your machine hardware specifications are lower - you will have to go for Remote Testing

Aerospike cluster not clean available blocks

we use aerospike in our projects and caught strange problem.
We have a 3 node cluster and after some node restarting it stop working.
So, we make test to explain our problem
We make test cluster. 3 node, replication count = 2
Here is our namespace config
namespace test{
replication-factor 2
memory-size 100M
high-water-memory-pct 90
high-water-disk-pct 90
stop-writes-pct 95
single-bin true
default-ttl 0
storage-engine device {
cold-start-empty true
file /tmp/test.dat
write-block-size 1M
We write 100Mb test data after that we have that situation
available pct equal about 66% and Disk Usage about 34%
All good :slight_smile:
But we stopped one node. After migration we see that available pct = 49% and disk usage 50%
Return node to cluster and after migration we see that disk usage became previous about 32%, but available pct on old nodes stay 49%
Stop node one more time
available pct = 31%
Repeat one more time we get that situation
available pct = 0%
Our cluster crashed, Clients get AerospikeException: Error Code 8: Server memory error
So how we can clean available pct?
If your defrag-q is empty (and you can see whether it is from grepping the logs) then the issue is likely to be that your namespace is smaller than your post-write-queue. Blocks on the post-write-queue are not eligible for defragmentation and so you would see avail-pct trending down with no defragmentation to reclaim the space. By default the post-write-queue is 256 blocks and so in your case that would equate to 256Mb. If your namespace is smaller than that you will see avail-pct continue to drop until you hit stop-writes. You can reduce the size of the post-write-queue dynamically (i.e. no restart needed) using the following command, here I suggest 8 blocks:
asinfo -v 'set-config:context=namespace;id=<NAMESPACE>;post-write-queue=8'
If you are happy with this value you should amend your aerospike.conf to include it so that it persists after a node restart.

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): Read timed out
at Method)
at org.apache.http.conn.BasicManagedEntity.streamClosed(
at org.apache.http.conn.EofSensorInputStream.checkClose(
at org.apache.http.conn.EofSensorInputStream.close(
at org.apache.http.util.EntityUtils.consume(
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$
at org.apache.hadoop.mapred.LineRecordReader.<init>(
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
a=sc.textFile(f, 15)
.map(lambda line: line.split(","))
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.

Neo4j - cypher query optimization

I am pretty new with Neo4j and we are trying to use it with our PHP SQL Server based application. I am using Neo4j 2.0 milestone 6. Some of the relevant configuration variables:
Now coming to the question- I am trying to write a cypher query which traverses graph and calculates allocation amount. Here is a snapshot of the structure of the graph.
Basically I have a department as a starting point which has certain amount and then it is allocated out to certain Product which in turn allocates out to other Products, so on. It can be n level deep allocation. Now I need to calculate amount for all the products.
The cypher query that I am using is as below:
MATCH (f:financial) WHERE f.amount <> 0
MATCH f-[r_allocates:Allocates*]->(n_allocation)
SUM(reduce(totalamt=f.amount, r IN r_allocates| (r.allocate)*totalamt/100 )) as amt
ORDER BY n_allocation.cost_pool_hierarchy_id
LIMIT 1000
This query takes 30+ seconds with warmed up caches. I have tried going back to Neo4j 1.9 as I found on some posts that Neo4j 2.0 is not yet optimized, but the similar query in 1.9 takes 40+ seconds.
Here is the output from profiler:
==> ColumnFilter(symKeys=["n_allocation.cost_pool_hierarchy_id", " INTERNAL_AGGREGATE2db722c4-6400-4803-90a2-b883b0076e8b"], returnItemNames=["n_allocation.cost_pool_hierarchy_id", "amt"], _rows=746, _db_hits=0)
==> Top(orderBy=["SortItem(Cached(n_allocation.cost_pool_hierarchy_id of type Any),true)"], limit="Literal(1000)", _rows=746, _db_hits=0)
==> EagerAggregation(keys=["Cached(n_allocation.cost_pool_hierarchy_id of type Any)"], aggregates=["( INTERNAL_AGGREGATE2db722c4-6400-4803-90a2-b883b0076e8b,Sum(ReduceFunction(r_allocates,r,Divide(Multiply(Product(r,allocate(11),true),totalamt),Literal(100)),totalamt,Product(f,amount(9),true))))"], _rows=746, _db_hits=11680622)
==> Extract(symKeys=["f", "n_allocation", " UNNAMED55", "r_allocates"], exprKeys=["n_allocation.cost_pool_hierarchy_id"], _rows=3906768, _db_hits=3906768)
==> PatternMatch(g="(f)-[' UNNAMED55']-(n_allocation)", _rows=3906768, _db_hits=0)
==> Filter(pred="(NOT(Product(f,amount(9),true) == Literal(0)) AND hasLabel(f:financial(6)))", _rows=9959, _db_hits=34272)
==> NodeByLabel(label="financial", identifier="f", _rows=34272, _db_hits=0)
I would appreciate any help in optimizing this query. Do I need to update some configuration settings? Or do I need to change the structure of graph?
Just to add further - even the below relatively simple query takes 26 seconds:
MATCH p=(f:financial)-[r*2]->(n) RETURN COUNT(p)