Apache Ignite: Long GC pause on Cross Join of two tables - jvm

I've setup of 4 node Ignite cluster with following JVM flags.
ENV JAVA_OPTS="-Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -server -Xms1g\
-Xmx1500m -XX:+AlwaysPreTouch -XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled \
-XX:+CMSPermGenSweepingEnabled -XX:+CMSScavengeBeforeRemark \
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC \
-XX:+UseContainerSupport \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation\
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xloggc:/opt/ignite/logs/log.txt"
I tried with both G1 and CMS garbage collector. And limited memory to 1500mb and minimum of 1gb. I've attached the logs.
One more thing I've added +PrintGCDetails and have provided log file path to dump the GC logs but file isn't getting generated(I've the folder logs). what is the issue?
for reference both tables have 100k rows.
Logs file

In the log, I see
[09:36:45,955][WARNING][long-qry-#31][LongRunningQueryManager] Query execution is too long [globalQueryId=2e3ce071-b1cd-48d9-bb4c-22ec428a8e19_226, duration=48660ms, type=MAP, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=PUBLIC, sql='SELECT
A__Z0.FIRST_NAME __C0_0,
B__Z1.LAST_NAME __C0_1,
PUBLIC.LEVENSHTEINDISTANCE(A__Z0.FIRST_NAME, B__Z1.LAST_NAME) __C0_2
FROM PUBLIC.LEADS A__Z0
INNER JOIN PUBLIC.CONTACTS B__Z1
ON TRUE
WHERE TRUE', plan=SELECT
A__Z0.FIRST_NAME AS __C0_0,
B__Z1.LAST_NAME AS __C0_1,
PUBLIC.LEVENSHTEINDISTANCE(A__Z0.FIRST_NAME, B__Z1.LAST_NAME) AS __C0_2
FROM PUBLIC.LEADS A__Z0
/* PUBLIC.LEADS.__SCAN_ */
/* WHERE TRUE
*/
/* scanCount: 163 */
INNER JOIN PUBLIC.CONTACTS B__Z1
/* PUBLIC.CONTACTS.__SCAN_ */
ON 1=1
/* scanCount: 4082223 */
WHERE TRUE, reqId=17, segment=0]
Note the lazy=false. What it tells me is that Ignite will try to collect the entire cross join product in the memory and then serve it to the client. This is why you run out of memory.
Set lazy=true to this query (actually, set it by default to most your SQL queries).
See this for details.

Related

How to use subprocess.run() to run Hive query?

So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '"{}"'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.
The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston#tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston#tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston#tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston#tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '"{}"'.format(query), ">", "{}".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > ${outfile} to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

Any specific problems running (linux) BCP on "too many" threads?

Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.
Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7
DELIMITER=$8
echo -e "\nRemoving header from TSV file $filename"
echo -e "Current head:\n"
echo $(head -n 1 $filename)
echo "$(tail -n +2 $filename)" > $filename
echo "First line of file is now..."
echo $(head -n 1 $filename)
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached
timeout 30 sleep 5
echo -e "\nReplacing null literal values with empty chars"
NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
TAB="\t"
sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below
timeout 30 sleep 5
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log
}
export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
::: $DATAFILES/$TARGET_GLOB
where $DATAFILES/$TARGET_GLOB constructs a glob that lists a set of files in a directory.
When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server
Starting copy...
5397376 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)
while others output error message
Starting copy...
BCP copy in failed
Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).
Notice in the code there is the -e option to produce an error file for each BCP copy-in operation (see https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.
Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.
Monitoring system resources (via free -mh) shows that generally ~13GB or RAM is always available.
May be helpful to note that the data bcp is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.
Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.
No, no issues specific to the BCP program being run in multiple threads. You seem to be on the track of what I would say your issue is, system resources. Have you monitored system resources while increasing the number of threads? If anything, there is likely an issue with BCP executing properly when memory/cpu/network resources are low. Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
TLDR: Adding more threads to run concurrently to have bcp copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp (ie. both the number of records in the file as well as the width of each record (ie. number of columns)).
Long version (more reasons for my theory):
1.
When running a larger number of bcp threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)
ps -aux | grep bcp
seeing a bunch of sleeping processes (notice the S, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)
me 135296 14.5 0.0 77596 6940 ? S 00:32 0:01
/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile
These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)
A process in S state is usually in a blocking system call, such as reading or writing to a file or the network, or waiting for another called program to finish.
(eg. writing to the MSSQL Server endpoint destination given to bcp in the ODBCDSN)
Your process will be in S state when it is doing reads and possibly writes that are blocking. Can also happen while waiting on semaphores or other synchronization primitives... This is all normal and expected, and not usually a problem... you don't want it to waste CPU while it's waiting for user input.
2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp threads would fail.
Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from #jamie's answer, the fact the the failure messages returned are "BCP copy in failed" errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e errorlog files from my process, #jamie's post says this
Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp copy-in'ed to the MSSQL Server.
All this while
free -mh
still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).
3. When running some of the tests from (2), found that manually killing the parallel process (via CTL+C) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable" on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.
** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.

Cassandra service cannot start on CentOS 6.4, starting process is mysteriously killed

Cassandra version: 2.2
OS: CentOS 6.4
I just follow the documentation here to run
[user#centOS conf]$ sudo yum install dsc22
[user#centOS conf]$ sudo service cassandra start
Starting Cassandra: OK
But I check the process status by "ps" and no cassandra process running, since the starting process is mysteriously killed. (I run the same thing on my Mac and all works fine, and on CentOS directly type 'cassandra' in command line also works. #update: I tried "yum install dsc20" and working fine too...)
In '/var/log/cassandra/cassandra.log' is like below, no error message, no warning. Could anyone please have any idea on this? Thanks a lot.
tail: /var/log/cassandra/cassandra.log: file truncated
CompilerOracle: inline org/apache/cassandra/db/AbstractNativeCell.compareTo (Lorg/apache/cassandra/db/composites/Composite;)I
CompilerOracle: inline org/apache/cassandra/db/composites/AbstractSimpleCellNameType.compareUnsigned (Lorg/apache/cassandra/db/composites/Composite;Lorg/apache/cassandra/db/composites/Composite;)I
CompilerOracle: inline org/apache/cassandra/io/util/Memory.checkBounds (JJ)V
CompilerOracle: inline org/apache/cassandra/io/util/SafeMemory.checkBounds (JJ)V
CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.selectBoundary (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;II)I
CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.strictnessOfLessThan (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare (Ljava/nio/ByteBuffer;[B)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare ([BLjava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compareUnsigned (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/lang/Object;JI)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/nio/ByteBuffer;)I
CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
INFO 23:47:52 Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
INFO 23:47:52 Node configuration:[authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_snapshot=true; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=<REDACTED>; cluster_name=Test Cluster; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_directory=/data/cassandra/commitlog; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_period_in_ms=10000; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[/data/cassandra/data]; disk_failure_policy=stop; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_user_defined_functions=false; endpoint_snitch=SimpleSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=localhost; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_validity_in_ms=2000; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=localhost; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/data/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=<REDACTED>; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; windows_timer_interval=1; write_request_timeout_in_ms=2000]
INFO 23:47:52 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO 23:47:52 Global memtable on-heap threshold is enabled at 121MB
INFO 23:47:52 Global memtable off-heap threshold is enabled at 121MB
INFO 23:47:52 Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
INFO 23:47:52 Node configuration:[authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_snapshot=true; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=<REDACTED>; cluster_name=Test Cluster; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_directory=/data/cassandra/commitlog; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_period_in_ms=10000; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[/data/cassandra/data]; disk_failure_policy=stop; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_user_defined_functions=false; endpoint_snitch=SimpleSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=localhost; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_validity_in_ms=2000; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=localhost; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/data/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=<REDACTED>; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; windows_timer_interval=1; write_request_timeout_in_ms=2000]
INFO 23:47:52 Hostname: centos-vm.com
INFO 23:47:52 JVM vendor/version: OpenJDK 64-Bit Server VM/1.7.0_45
INFO 23:47:52 Heap size: 509214720/509214720
INFO 23:47:52 Code Cache Non-heap memory: init = 2555904(2496K) used = 636288(621K) committed = 2555904(2496K) max = 50331648(49152K)
INFO 23:47:52 Par Eden Space Heap memory: init = 104071168(101632K) used = 50207568(49030K) committed = 104071168(101632K) max = 104071168(101632K)
INFO 23:47:52 Par Survivor Space Heap memory: init = 12976128(12672K) used = 0(0K) committed = 12976128(12672K) max = 12976128(12672K)
INFO 23:47:52 CMS Old Gen Heap memory: init = 392167424(382976K) used = 0(0K) committed = 392167424(382976K) max = 392167424(382976K)
INFO 23:47:52 CMS Perm Gen Non-heap memory: init = 21757952(21248K) used = 12843096(12542K) committed = 21757952(21248K) max = 174063616(169984K)
INFO 23:47:52 Classpath: /etc/cassandra/conf:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/cassandra-driver-core-2.2.0-rc2-SNAPSHOT-20150617-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/crc32ex-0.1.1.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-16.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jna-4.0.0.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.9.2.jar:/usr/share/cassandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/logback-classic-1.1.3.jar:/usr/share/cassandra/lib/logback-core-1.1.3.jar:/usr/share/cassandra/lib/lz4-1.3.0.jar:/usr/share/cassandra/lib/metrics-core-3.1.0.jar:/usr/share/cassandra/lib/metrics-logback-3.1.0.jar:/usr/share/cassandra/lib/netty-all-4.0.23.Final.jar:/usr/share/cassandra/lib/ohc-core-0.3.4.jar:/usr/share/cassandra/lib/ohc-core-j8-0.3.4.jar:/usr/share/cassandra/lib/reporter-config3-3.0.0.jar:/usr/share/cassandra/lib/reporter-config-base-3.0.0.jar:/usr/share/cassandra/lib/sigar-1.6.4.jar:/usr/share/cassandra/lib/slf4j-api-1.7.7.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snappy-java-1.1.1.7.jar:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/stream-2.5.2.jar:/usr/share/cassandra/lib/super-csv-2.1.0.jar:/usr/share/cassandra/lib/thrift-server-0.3.7.jar:/usr/share/cassandra/apache-cassandra-2.2.0.jar:/usr/share/cassandra/apache-cassandra-thrift-2.2.0.jar:/usr/share/cassandra/stress.jar::/usr/share/cassandra/lib/jamm-0.3.0.jar
It turns out to be a memory problem. Since the starting process is mysteriously killed, the fact is it's killed by kernel.
I'm running it on virtual machine with 1G memory.
When I run "ps -ef | grep cassandra" and then "dmesg | egrep -i -B100 'cassandra-pid'", I got
Out of memory: Kill process 9450 (java) score 218 or sacrifice child
Killed process 9450, UID 0, (java) total-vm:1155600kB, anon-rss:797292kB, file-rss:100956kB
Thus confirmed it's a memory issue. Then I just modified the memory allocation for Cassandra and all worked fine.
in cassandra-env.sh
change:
half_system_memory_in_mb=`expr $system_memory_in_mb / 2`
quarter_system_memory_in_mb=`expr $half_system_memory_in_mb / 2`
to be:
half_system_memory_in_mb="300"
quarter_system_memory_in_mb="300"
you can seet by typing command netstat -tunlp on the prompt , there you will find process id for the port 7199 . then you have to kill it if you want to stop .
netstat -tunlp tell us port wise process id .

Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space

Using Apache Pig version 0.10.1.21 (reported),
CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM)
$ cat data.txt
11,11,22
33,34,35
47,0,21
33,6,51
56,6,11
11,25,67
$ cat GrpTest.pig
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
pig -x local GrpTest.pig
[Thread-12] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
[Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[Thread-13] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#19a9bea3
[Thread-13] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
[Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0002
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B
The java.lang.OutOfMemoryError: Java heap space error occurs each time I use GROUP or JOIN in a pig script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.
Question 1: How come there is an OutOfMemory error while the data sample is minuscule and local mode is supposed to use less resources than HDFS mode?
Question 2: Is there a solution to run successfully a small pig scripts with GROUP or JOIN in local mode?
Solution: force pig to allocate less memory for the java property io.sort.mb
I set to 10 MB here and the error disappears. Not sure what would be the best value but at least, this allow to practice pig syntax in local mode
$ cat GrpTest.pig
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
The reason is you have less memory allocated to Java locally than you do on your Hadoop cluster machines. This is actually a pretty common error in Hadoop. It mostly occurs when you create a really long relation in Pig at any point, and happens because Pig always wants to load an entire relation into memory and doesn't want to lazy load it in any way.
When you do something like GROUP BY where the tuple you're grouping by is non-sparse over many records, you frequently wind up creating single long relations at least temporarily since you're basically taking a whole bunch of individual relations and cramming them all into one single long relation. Either change your code so you don't wind up creating single very long relations at any point (i.e. group by something more sparse), or increase the memory available to Java.

Reset Redis "used_memory_peak" stat

I'm using Redis (2.4.2) and with the INFO command I can read stats about my Redis server.
There are many stats, including some about how much memory is used. And one is "used_memory_peak" that seems to hold the maximum amount of memory Redis has ever taken.
I've deleted a bunch of key, and I'd like to reset this stat since it affects the scale of my Munin graphs.
There is a CONFIG RESETSTAT command, but it doesn't seem to affect this particular stat.
Any idea how I could do this, without having to export/delete/import my dataset ?
EDIT :
According to #antirez himself (issue 369 on GitHub), this is an intended behavior, but it this feature could be improved to be more useful in a future release.
The implementation of CONFIG RESETSTAT is quite simple:
} else if (!strcasecmp(c->argv[1]->ptr,"resetstat")) {
if (c->argc != 2) goto badarity;
server.stat_keyspace_hits = 0;
server.stat_keyspace_misses = 0;
server.stat_numcommands = 0;
server.stat_numconnections = 0;
server.stat_expiredkeys = 0;
addReply(c,shared.ok);
So it does not initialize the server.stat_peak_memory field used to store the maximum amount of memory ever used by Redis. I don't know if it is a bug or a feature.
Here is a hack to reset the value without having to stop Redis. The idea is to use gdb in batch mode to just change the value of the variable (which is part of a static structure). Normally Redis is compiled with debugging symbols.
# Here we have plenty of things in this instance
> ./redis-cli info | grep peak
used_memory_peak:1363052184
used_memory_peak_human:1.27G
# Let's do some cleaning: everything is wiped out
# don't do this in production !!!
> ./redis-cli flushdb
OK
# Again the same values, while some memory has been freed
> ./redis-cli info | grep peak
used_memory_peak:1363052184
used_memory_peak_human:1.27G
# Here is the magic command: reset the parameter with gdb (output and warnings to be ignored)
> gdb -batch -n -ex 'set variable server.stat_peak_memory = 0' ./redis-server `pidof redis-server`
Missing separate debuginfo for /lib64/libm.so.6
Missing separate debuginfo for /lib64/libdl.so.2
Missing separate debuginfo for /lib64/libpthread.so.0
[Thread debugging using libthread_db enabled]
[New Thread 0x41001940 (LWP 22837)]
[New Thread 0x40800940 (LWP 22836)]
Missing separate debuginfo for /lib64/libc.so.6
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff51ff000
0x00002af0b5eef218 in epoll_wait () from /lib64/libc.so.6
# And now, result is different: great !
> ./redis-cli info | grep peak
used_memory_peak:718768
used_memory_peak_human:701.92K
This is a hack: think twice before applying this trick on a production instance.
Simple trick to clear peal memory::
Step 1:
/home/logproc/redis/bin/redis-cli BGREWRITEAOF
wait till it finish rewriting aof file.
Step 2:
restart redis db
Done. Thats It.