JMX issue with DataStax OpsCenter Agent - datastax

We're running an 8 node DSE cluster (just Cassandra) divided among two datacenters. Everything is working fine apart from the agent on one node which stubbornly refuses to cooperate.
Here is the version info:
Cassandra 3.0.9.1346
DSE 5.0.3
OpsCenter 6.03
All nodes have upgraded SSTables and have been repaired.
Here is the log:
INFO [async-dispatch-47] 2016-11-08 14:33:13,811 Starting system.
INFO [async-dispatch-47] 2016-11-08 14:33:13,832 Starting DynamicEnvironmentComponent
WARN [async-dispatch-47] 2016-11-08 14:33:13,845 Exception while processing JMX data: java.lang.NullPointerException
ERROR [async-dispatch-47] 2016-11-08 14:33:13,845 Error starting DynamicEnvironmentComponent.
java.lang.NullPointerException
at clojure.java.io$as_relative_path.invoke(io.clj:404)
at clojure.java.io$file.invoke(io.clj:416)
at opsagent.environment.collection$cassandra_yaml_location__GT_install_location.invoke(collection.clj:118)
at opsagent.environment.dynamic$dynamic_env_state.invoke(dynamic.clj:61)
at clojure.core$partial$fn__4527.invoke(core.clj:2492)
at opsagent.jmx$create_jmx_pool_with_config$wrapper__11504.doInvoke(jmx.clj:221)
at clojure.lang.RestFn.invoke(RestFn.java:410)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.core$apply.invoke(core.clj:630)
at opsagent.environment.dynamic$add_dynamic_state.invoke(dynamic.clj:143)
at opsagent.environment.dynamic.DynamicEnvironmentComponent.start(dynamic.clj:168)
at com.stuartsierra.component$fn__8838$G__8832__8840.invoke(component.clj:4)
at com.stuartsierra.component$fn__8838$G__8831__8843.invoke(component.clj:4)
at clojure.lang.Var.invoke(Var.java:379)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.Var.applyTo(Var.java:700)
at clojure.core$apply.invoke(core.clj:632)
at com.stuartsierra.component$try_action.invoke(component.clj:116)
at clojure.lang.Var.invoke(Var.java:401)
at opsagent.config_service$update_system$fn__20056.invoke(config_service.clj:200)
at clojure.lang.ArraySeq.reduce(ArraySeq.java:114)
at clojure.core$reduce.invoke(core.clj:6518)
at opsagent.config_service$update_system.doInvoke(config_service.clj:194)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at opsagent.config_service$start_system_BANG_.invoke(config_service.clj:219)
at opsagent.config_service$fn__20133$fn__20134$state_machine__4719__auto____20135$fn__20137.invoke(config_service.clj:245)
at opsagent.config_service$fn__20133$fn__20134$state_machine__4719__auto____20135.invoke(config_service.clj:242)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at clojure.core.async$ioc_alts_BANG_$fn__4884.invoke(async.clj:362)
at clojure.core.async$do_alts$fn__4838$fn__4841.invoke(async.clj:231)
at clojure.core.async.impl.channels.ManyToManyChannel$fn__1215.invoke(channels.clj:262)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
INFO [async-dispatch-47] 2016-11-08 14:33:13,848 Finished starting system.
Any ideas?
Edit: here is the requested additional info.
The node's addresses are as follows:
listen: 10.2.2.22
rpc: 10.1.2.22
All ips are private and SSL is disabled because the DCs are connected via a VPN.
# nodetool status
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.72 12.14 GB 256 ? a15c57e1-3c53-4d4f-9df9-29945b9f1c88 RAC1
UN 192.168.1.92 11.36 GB 256 ? 9820b96a-a3c6-460f-839b-5dabc89313a0 RAC1
UN 192.168.1.82 11.67 GB 256 ? f9c13cb0-ee44-4ce2-ac7e-14ec1f7c1d23 RAC1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.2.2.32 11.04 GB 256 ? bfe86bb3-d272-4946-ac8a-e176fe9f8e64 RAC3
UN 10.2.2.22 11.29 GB 256 ? c8694e93-0d8a-41b4-82a3-6c450497e8ec RAC2
UN 10.2.2.52 11.46 GB 256 ? e941faf1-ad5b-46a7-8857-bdf9dd2a3459 RAC5
UN 10.2.2.42 10.9 GB 256 ? 7bbd2397-a3bc-4cfe-9c03-334186e7e0dd RAC4
UN 10.2.2.12 5.29 GB 256 ? bf7a0587-2b09-47d6-b6d5-24e1422318b9 RAC1
# address.yaml
stomp_interface: 192.168.1.31
use_ssl: 0
cassandara_conf: /etc/dse/cassandra
root#anc-t2:~# netstat -lnt Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address
State tcp 0 0 10.1.2.22:9160 0.0.0.0:*
LISTEN tcp 0 0 10.1.2.22:9042 0.0.0.0:*
LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN tcp 0 0 10.2.2.22:7000 0.0.0.0:*
LISTEN tcp 0 0 0.0.0.0:17500 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:7199 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:17600 0.0.0.0:*
LISTEN tcp 0 0 10.2.2.22:8609 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:54882 0.0.0.0:*
LISTEN tcp 0 0 127.0.0.1:17603 0.0.0.0:*
LISTEN tcp6 0 0 :::6900 :::*
LISTEN tcp6 0 0 :::61621 :::*
LISTEN tcp6 0 0 :::9910 :::*
LISTEN tcp6 0 0 :::22 :::*
LISTEN tcp6 0 0 :::17500 :::*
LISTEN
# opscenterd.conf
[webserver] port = 8888 interface = 0.0.0.0
# The following settings can be used to enable ssl support for the opscenter
# web application. Change these values to point to the ssl certificate and key
# that you wish to use for your OpsCenter install, as well as the port you would like
# to serve ssl traffic from.
#ssl_keyfile = /var/lib/opscenter/ssl/opscenter.key
#ssl_certfile = /var/lib/opscenter/ssl/opscenter.pem
#ssl_port = 8443
[authentication]
# Set this option to True to enable OpsCenter authentication. A default admin
# account will be created with the username "admin" and password "admin".
# Accounts and roles can then be created and modified from within the web UI. enabled = False
# To help us better understand the needs of users and to improve OpsCenter, OpsCenter
# reports information about itself and the clusters it manages to a central DataStax
# server. This information is reported anonymously, and potentially sensitive
# information, such as IP addresses, are hashed in a non-reversible way:
# http://www.datastax.com/documentation/opscenter/help/statsReporterProperties.html [stat_reporter]
# The interval setting determines how often statistics are reported. To disable
# reporting, set to 0
# interval = 86400 # 24 hours
# cluster.conf
[jmx]
username =
password =
port = 7199
[agents]
[cassandra]
username =
seed_hosts = 192.168.1.72
password =
cql_port = 9042
cqlsh:OpsCenter> desc KEYSPACE;
CREATE KEYSPACE "OpsCenter" WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true;
CREATE TABLE "OpsCenter".rollup_state (
node text,
name text,
res int,
avg float,
histogram blob,
max float,
min float,
ts timestamp,
type int,
value float,
weight float,
PRIMARY KEY (node, name, res)
) WITH CLUSTERING ORDER BY (name ASC, res ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"version": [5, 2, 1], "info": "OpsCenter management data."}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".events_timeline (
key text,
column1 bigint,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '8', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".settings (
key blob,
column1 blob,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '12', 'min_threshold': '8'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 1.0
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".rollups60 (
key text,
timestamp varint,
value blob,
PRIMARY KEY (key, timestamp)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (timestamp ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".backup_reports (
week text,
event_time timestamp,
backup_id text,
type text,
destination text,
deleted_at timestamp,
full_status text,
keyspaces text,
status text,
PRIMARY KEY (week, event_time, backup_id, type, destination)
) WITH CLUSTERING ORDER BY (event_time DESC, backup_id ASC, type ASC, destination ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".rollups86400 (
key text,
timestamp varint,
value blob,
PRIMARY KEY (key, timestamp)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (timestamp ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '8', 'min_threshold': '2'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".bestpractice_results (
key text,
column1 varint,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".pdps (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".rollups7200 (
key text,
timestamp varint,
value blob,
PRIMARY KEY (key, timestamp)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (timestamp ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '8', 'min_threshold': '2'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".events (
key text PRIMARY KEY,
action bigint,
api_source_ip text,
column_family text,
event_source text,
"keyspace" text,
level bigint,
message text,
source_node text,
success boolean,
target_node text,
time bigint,
user text
) WITH COMPACT STORAGE
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '12', 'min_threshold': '8'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';
CREATE TABLE "OpsCenter".rollups300 (
key text,
timestamp varint,
value blob,
PRIMARY KEY (key, timestamp)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (timestamp ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = '{"info": "OpsCenter management data.", "version": [5, 2, 1]}'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '16', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.25
AND speculative_retry = '99PERCENTILE';

Related

Combine 2 different sized arrays element-wise based on index pairing array

Say, we had 2 arrays of unique values:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) # any values are possible,
b = np.array([0, 11, 12, 13, 14, 15, 16, 17, 18, 19]) # sorted values are for demonstration
, where a[0] corresponds to b[0], a[1] to b[11], a[2]-b[12], etc.
Then, due to some circumstances we randomly lost some of it and received noise elements from/to both a & b. Now 'useful data' in a and b are kind of 'eroded' like this:
a = np.array([0, 1, 313, 2, 3, 4, 5, 934, 6, 8, 9, 730, 241, 521])
b = np.array([112, 514, 11, 13, 16, 955, 17, 18, 112])
The noise elements have negligible probability to coincide with any of 'useful data'. So, if to search them, we could find the left ones and to define the 'index pairing array':
cor_tab = np.array([[1,2], [4,3], [8,4], [9,7]])
which, if applied, provides pairs of 'useful data' left:
np.column_stack((a[cor_tab[:,0]], b[cor_tab[:,1]]))
array([[1, 11],
[3, 13],
[6, 16],
[8, 18]])
The question: Given the 'eroded' a and b, how to combine them into numpy array such that:
values indexed in cor_tab are paired in the same column/row,
lost values are treated as -1,
noise as 'don't care', and
array looks like this:
[[ -1 112],
[ 0 514],
[ 1 11],
[313 -1],
[ 2 -1],
[ 3 13],
[ 4 -1],
[ 5 -1],
[934 -1],
[ 6 16],
[ -1 955],
[ -1 17],
[ 8 18],
[ 9 -1],
[730 -1],
[241 -1],
[521 112]]
, where 'useful data' is at indices: 2, 5, 9, 12?
Initially I solved this, in dubious way:
import numpy as np
def combine(aa, bb, t):
c0 = np.empty((0), int)
c1 = np.empty((0), int)
# add -1 & 'noise' at the left side:
if t[0][0] > t[0][1]:
c0 = np.append(c0, aa[: t[0][0]])
c1 = np.append(c1, [np.append([-1] * (t[0][0] - t[0][1]), bb[: t[0][1]])])
else:
c0 = np.append(c0, [np.append([-1] * (t[0][1] - t[0][0]), aa[: t[0][0]])])
c1 = np.append(c1, bb[: t[0][1]])
ind_compenstr = t[0][0] - t[0][1] # 'index compensator'
for i, ii in enumerate(t):
x = ii[0] - ii[1] - ind_compenstr
# add -1 & 'noise' in the middle:
if x > 0:
c0 = np.append(c0, [aa[ii[0]-x:ii[0]]])
c1 = np.append(c1, [[-1] * x])
elif x == 0:
c0 = np.append(c0, [aa[ii[0]-x:ii[0]]])
c1 = np.append(c1, [bb[ii[1]-x:ii[1]]])
else:
x = abs(x)
c0 = np.append(c0, [[-1] * x])
c1 = np.append(c1, [bb[ii[1]-x:ii[1]]])
# add useful elements:
c0 = np.append(c0, aa[ii[0]])
c1 = np.append(c1, bb[ii[1]])
ind_compenstr += x
# add -1 & 'noise' at the right side:
l0 = len(aa) - t[-1][0]
l1 = len(bb) - t[-1][1]
if l0 > l1:
c0 = np.append(c0, aa[t[-1][0] + 1:])
c1 = np.append(c1, [np.append(bb[t[-1][1] + 1:], [-1] * (l0 - l1))])
else:
c0 = np.append(c0, [np.append(aa[t[-1][0] + 1:], [-1] * (l1 - l0))])
c1 = np.append(c1, bb[t[-1][1] + 1:])
return np.array([c0,c1])
But bellow I suggest another solution.
It is difficult to understand what the question want, but IIUC, at first, we need to find the column size of the expected array that contains combined uncommon values between the two arrays (np.union1d), and then create an array based on that size full filled by -1 (np.full). Now, using np.searchsorted, the indices of values of an array in another array will be achieved. Values that are not contained in the other array can be given by np.in1d in invert mode. So we can achieve the goal by indexing as:
union_ = np.union1d(a, b)
# [0 1 2 3 4 5 6 7 8 9]
res = np.full((2, union_.size), -1)
# [[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
# [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]]
arange_row_ids = np.arange(union_.size)
# [0 1 2 3 4 5 6 7 8 9]
col_inds = np.searchsorted(a, b)[np.in1d(b, a, invert=True)]
# np.searchsorted(a, b) ---> [1 3 6 7 7]
# np.in1d(b, a, invert=True) ---> [False False False True False]
# [7]
res[0, np.delete(arange_row_ids, col_inds + np.arange(col_inds.size))] = a
# np.delete(arange_row_ids, col_inds + np.arange(col_inds.size)) ---> [0 1 2 3 4 5 6 8 9]
# [[ 0 1 2 3 4 5 6 -1 8 9]
# [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]]
col_inds = np.searchsorted(b, a)[np.in1d(a, b, invert=True)]
# np.searchsorted(b, a) ---> [0 0 1 1 2 2 2 4 5]
# np.in1d(a, b, invert=True) ---> [ True False True False True True False False True]
# [0 1 2 2 5]
res[1, np.delete(arange_row_ids, col_inds + np.arange(col_inds.size))] = b
# np.delete(arange_row_ids, col_inds + np.arange(col_inds.size)) ---> [1 3 6 7 8]
# [[ 0 1 2 3 4 5 6 -1 8 9]
# [-1 1 -1 3 -1 -1 6 7 8 -1]]
The question is not clear enough to see if the answer is the expected one, but I think it is helpful that could help for further modifications based on the need.
Here's a partially vectorized solution:
import numpy as np
# this function if from Divakar's answer at #https://stackoverflow.com/questions/38619143/convert-python-#sequence-to-numpy-array-filling-missing-values that I used as #function:
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())[::-1]
out = np.full(mask.shape, -1, dtype=int)
out[mask] = np.concatenate(v)
return out
# 2 arrays with eroded useful data and the index pairing array:
a = np.array([0, 1, 313, 2, 3, 4, 5, 934, 6, 8, 9, 730, 241, 521])
b = np.array([112, 514, 11, 13, 16, 955, 17, 18, 112])
cor_tab = np.array([[1,2], [4,3], [8,4], [9,7]])
# split every array by correspondent indices in `cor_tab`:
aa = np.split(a, cor_tab[:,0]+1)
bb = np.split(b, cor_tab[:,1]+1)
#initiate 2 flat empty arrays:
aaa = np.empty((0), int)
bbb = np.empty((0), int)
# loop over the splitted arrays:
for i, j in zip(aa,bb):
c = boolean_indexing([i, j])
aaa = np.append(aaa, c[0])
bbb = np.append(bbb, c[1])
ccc = np.array([aaa,bbb]).T
In case of other types of data, here is another example. Lets take two arrays of letters:
a = np.array(['y', 'w', 'a', 'e', 'i', 'o', 'u', 'y', 'w', 'a', 'e', 'i', 'o', 'u'])
b = np.array(['t', 'h', 'b', 't', 'c', 'n', 's', 'j', 'p', 'z', 'n', 'h', 't', 's', 'm', 'p'])
, and index pairing array:
cor_tab = np.array([[2,0], [3,2], [4,3], [5,5], [6,6], [9,10], [11,12], [13,13]])
np.column_stack((a[cor_tab[:,0]], b[cor_tab[:,1]]))
array([['a', 't'], # useful data
['e', 'b'],
['i', 't'],
['o', 'n'],
['u', 's'],
['a', 'n'],
['i', 't'],
['u', 's']], dtype='<U1')
The only correction required is dtype='<U1' in boolean_indexing(). Result is:
[['y' '-'],
['w' '-'],
['a' 't'],
['-' 'h'],
['e' 'b'],
['i' 't'],
['-' 'c'],
['o' 'n'],
['u' 's'],
['-' 'j'],
['y' 'p'],
['w' 'z'],
['a' 'n'],
['e' 'h'],
['i' 't'],
['o' '-'],
['u' 's'],
['-' 'm'],
['-' 'p']]
It works for floats as well if change dtype in boolean_indexing() to float.

Remove nan from pandas binner

I have created the following pandas dataframe called train:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
ds = {
'matchKey' : [621062, 622750, 623508, 626451, 626611, 626796, 627114, 630055, 630225],
'og_max_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_min_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_max_max_dpd' : [0, 0, -99999, 1, 0, 5, 0, 4, 0],
'Target':[1,0,1,0,0,1,1,1,0]
}
train = pd.DataFrame(data=ds)
The dataframe looks like this:
print(train)
matchKey og_max_last_dpd og_min_last_dpd og_max_max_dpd Target
0 621062 10 10 0 1
1 622750 10 10 0 0
2 623508 -99999 -99999 -99999 1
3 626451 10 10 1 0
4 626611 10 10 0 0
5 626796 10 10 5 1
6 627114 10 10 0 1
7 630055 10 10 4 1
8 630225 10 10 0 0
I have then binned the column called og_max_max_dpd using this code:
def mono_bin(Y, X, char, n=20):
X2 = X.fillna(-99999)
r = 0
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n, duplicates="drop")})#,include_lowest=True
d2 = d1.groupby("Bucket", as_index=True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.min().X, columns=["min_" + X.name])
d3["max_" + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3["total"] = d2.count().Y
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
pinf = float("+inf")
array = []
for i in range(len(d4) - 1):
array.append(d4["max_" + char].iloc[i])
return [ninf] + array + [pinf]
binner = mono_bin(train['Target'], train['og_max_max_dpd'], 'og_max_max_dpd')
I have printed out the binner which looks like this:
print(binner)
[-inf, -99999.0, nan, 0.0, nan, nan, 1.0, nan, nan, 4.0, nan, inf]
I want to remove the nan from that list so that the binner looks like this:
[-inf, -99999.0, 0.0, 1.0, 4.0, inf]
Does anyone know how to remove the nan?
You can simply use dropna to remove it from d4:
...
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
d4.dropna(inplace=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
...

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.
This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4
I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

SSL handshake 20s delay to complete process

We are experiencing a delay when accessing https:// connection to our server.
There is a delay around the 20s to display the content of the page which was trace to SSL handshake process
Using
strace -o /tmp/curl.trace.log -f -tt curl -kvv --trace-time https://10.20.23.7:
We trace this to
91445 09:22:03.711590 getpeername(3, {sa_family=AF_INET, sin_port=htons(9070), sin_addr=inet_addr("10.20.23.7")}, [16]) = 0
91445 09:22:03.773153 sendto(3, "\26\3\1\0\220\1\0\0\214\3\3\332\3n\204\371\2472v\341E[\2247=2;\336\214\266}\4"..., 149, 0, NULL, 0) = 149
91445 09:22:03.773267 recvfrom(3, 0x1c2b258, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
91445 09:22:03.773315 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 5000) = 0 (Timeout)
91445 09:22:08.778047 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 5000) = 0 (Timeout)
91445 09:22:13.783040 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 5000) = 0 (Timeout)
91445 09:22:18.787090 poll([{fd=3, events=POLLIN|POLLPRI}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
91445 09:22:20.183606 recvfrom(3, "\26\3\3\n\264", 5, 0, NULL, NULL) = 5
91445 09:22:20.183724 recvfrom(3, "\2\0\0M\3\3]\312\300\34\367\363\256%V\25s)\tSVX\317*\272=\205(\311\1<\345"..., 2740, 0, NULL, NULL) = 2740
91445 09:22:20.184242 fcntl(4, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = 0
As I understand recvfrom is not able to read request which was sendto(..)
What mean line sendto, where request was sendto?
91445 09:22:03.773153 sendto(3, "\26\3\1\0\220\1\0\0\214\3\3\332\3n\204\371\2472v\341E[\2247=2;\336\214\266}\4"..., 149, 0, NULL, 0) = 149
Thank you for info.

How to speed up scrapy

I need to collect a lot(really a lot) of data for statistics, all the necessary information is in <script type="application/ld+json"></script>
and I wrote scrapy parser (script inside html) under it, but parsing is very slow (about 3 pages per second). Is there any way to speed up the process? Ideally I would like to see 10+ pages per second
#spider.py:
import scrapy
import json
class Spider(scrapy.Spider):
name = 'scrape'
start_urls = [
about 10000 urls
]
def parse(self, response):
data = json.loads(response.css('script[type="application/ld+json"]::text').extract_first())
name = data['name']
image = data['image']
path = response.css('span[itemprop="name"]::text').extract()
yield {
'name': name,
'image': image,
'path': path
}
return
#settings.py:
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.33
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
AUTOTHROTTLE_DEBUG = False
LOG_ENABLED = False
My PC specs:
16GB ram, i5 2400, ssd, 1gb ethernet
#Edited
settings.py
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0
DOWNLOAD_TIMEOUT = 30
RANDOMIZE_DOWNLOAD_DELAY = True
REACTOR_THREADPOOL_MAXSIZE = 128
CONCURRENT_REQUESTS = 256
CONCURRENT_REQUESTS_PER_DOMAIN = 256
CONCURRENT_REQUESTS_PER_IP = 256
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 0.25
AUTOTHROTTLE_TARGET_CONCURRENCY = 128
AUTOTHROTTLE_DEBUG = True
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 401, 403, 404, 405, 406, 407, 408, 409, 410, 429]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 80,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 120,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 130,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 900,
'scraper.middlewares.ScraperDownloaderMiddleware': 1000
}