I tried to use cbbackup and cbtransfer cli of couchbase v4.0.0 to backup couchbase v3.0.1. Both the two commands fails. Below is the output of cbbackup command. I wonder why the high version cli can not work with low version couchbase? Are they forward-compatible?
[cb-v4.0.0]$ /opt/couchbase/bin/cbbackup http://<cb-v3.0.1>:8091 /tmp/cbbackup -u 'xxxx' -p '***' -v
2016-02-28 03:05:28,679: mt cbbackup...
2016-02-28 03:05:28,679: mt source : http://<cb-v3.0.1>:8091
2016-02-28 03:05:28,679: mt sink : /tmp/cbbackup
2016-02-28 03:05:28,679: mt opts : {'username': '<xxx>', 'verbose': 1, 'dry_run': False, 'extra': {'max_retry': 10.0, 'rehash': 0.0, 'dcp_consumer_queue_length': 1000.0, 'data_only': 0.0, 'uncompress': 0.0, 'nmv_retry': 1.0, 'cbb_max_mb': 100000.0, 'report': 5.0, 'mcd_compatible': 1.0, 'try_xwm': 1.0, 'backoff_cap': 10.0, 'batch_max_bytes': 400000.0, 'report_full': 2000.0, 'flow_control': 1.0, 'batch_max_size': 1000.0, 'seqno': 0.0, 'design_doc_only': 0.0, 'recv_min_bytes': 4096.0}, 'single_node': False, 'ssl': False, 'vbucket_list': None, 'threads': 4, 'mode': 'full', 'key': None, 'password': '<xxx>', 'id': None, 'silent': False, 'bucket_source': None}
2016-02-28 03:05:28,713: mt bucket: productbucket
2016-02-28 03:05:44,590: w0 source : http://<cb-v3.0.1>:8091(productbucket#172.31.2.154:8091)
2016-02-28 03:05:44,590: w0 sink : /tmp/cbbackup(productbucket#172.31.2.154:8091)
2016-02-28 03:05:44,590: w0 : total | last | per sec
2016-02-28 03:05:44,590: w0 batch : 171 | 171 | 10.8
2016-02-28 03:05:44,590: w0 byte : 69447174 | 69447174 | 4374738.4
2016-02-28 03:05:44,590: w0 msg : 28346 | 28346 | 1785.6
[####################] 100.0% (28346/estimated 28346 msgs)
bucket: productbucket, msgs transferred...
: total | last | per sec
batch : 171 | 171 | 10.7
byte : 69447174 | 69447174 | 4331609.4
msg : 28346 | 28346 | 1768.0
Traceback (most recent call last):
File "/opt/couchbase/lib/python/cbbackup", line 12, in <module>
pump_transfer.exit_handler(pump_transfer.Backup().main(sys.argv))
File "/opt/couchbase/lib/python/pump_transfer.py", line 94, in main
rv = pumpStation.run()
File "/opt/couchbase/lib/python/pump.py", line 140, in run
self.transfer_bucket_index(source_bucket, source_map, sink_map)
File "/opt/couchbase/lib/python/pump.py", line 267, in transfer_bucket_index
source_bucket, source_map)
File "/opt/couchbase/lib/python/pump_dcp.py", line 92, in provide_index
err, index_server = pump.filter_server(opts, source_spec, 'index')
File "/opt/couchbase/lib/python/pump.py", line 1057, in filter_server
if filtor in node["services"] and node["status"] == "healthy":
KeyError: 'services'
It's unfortunately not supported right now. We're currently in the process of making all of the Couchbase tools supported for a target set of releases.
One workaround is to use XDCR from the 3.0.1 cluster to move your data to the 4.0 cluster.
Apart from the above accepted solution you can use the cbtransfer tool to backup data from 3.x server bucket and then use the backup data to populate 4.x server bucket
# ssh to 3.x server
$ cbtransfer -b <bucket> http://<3.x.server.ip>:8091 bucket-backup
# copy back-up data to 4.x server from 3.x server using scp or similar tool
# ssh 4.x server
$ cbtransfer -B <bucket> http://<4.x.server.ip>:8091 bucket-backup
The reason to switch servers is to use the correct bin executable files for cbtransfer.
Related
I'm running Spark drivers on AKS cluster pods. Spark job is to pick up 250 milions of records from Azure Storage from given location >> create dataframe from them >> select one column >> create a list by Pandas toList() method >> take list and check existance of each record from list in Redis DB >> return results to driver for further processing
Provided data is splitted in 50 parquet files, each one containing 5 milion of records.
DAG from job failing:
DAG from job failing
Stage failing:
Stage failing
Because of node specification I have limited compute power. For now, this is configuration of Spark, that I have:
spark.driver.maxResultSize=5G
spark.driver.memory=10G
spark.executor.memory=8G
spark.executor.cores=3
spark.driver.cores=3
spark.executor.instances=10
spark.sql.sources.partitionOverwriteMode=dynamic
spark.driver.memoryOverhead=0
spark.executor.memoryOverhead=2G
and defined in code:
spark_conf.set('spark.default.parallelism', num_executors * executor_cores)
spark_conf.set('spark.sql.shuffle.partitions', num_executors * executor_cores)
Error received on driver Pod:
[Stage 1:======================================================> (38 + 2) / 40]
[Stage 1:=======================================================> (39 + 1) / 40]
[Stage 2:> (0 + 24) / 30]
[Stage 2:> (0 + 25) / 30]
[Stage 2:> (0 + 27) / 30]
[Stage 2:> (0 + 28) / 30]
[Stage 2:> (0 + 29) / 30]
[Stage 2:> (0 + 30) / 30]
[Stage 2:=> (1 + 29) / 30]
[2022-12-28 15:28:45,156: ERROR/ForkPoolWorker-1] Exception while sending command.
2022-12-28T16:28:45+01:00 Traceback (most recent call last):
2022-12-28T16:28:45+01:00 File "/opt/spark/python/pyspark/sql/pandas/conversion.py", line 241, in _collect_as_arrow
2022-12-28T16:28:45+01:00 results = list(_load_from_socket((port, auth_secret), ArrowCollectSerializer()))
2022-12-28T16:28:45+01:00 File "/opt/spark/python/pyspark/sql/pandas/serializers.py", line 60, in load_stream
2022-12-28T16:28:45+01:00 for batch in self.serializer.load_stream(stream):
2022-12-28T16:28:45+01:00 File "/opt/spark/python/pyspark/sql/pandas/serializers.py", line 99, in load_stream
2022-12-28T16:28:45+01:00 for batch in reader:
2022-12-28T16:28:45+01:00 File "pyarrow/ipc.pxi", line 436, in __iter__
2022-12-28T16:28:45+01:00 File "pyarrow/ipc.pxi", line 468, in pyarrow.lib.RecordBatchReader.read_next_batch
2022-12-28T16:28:45+01:00 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
2022-12-28T16:28:45+01:00 OSError: Expected to be able to read 449936 bytes for message body, got 442208
Running different Spark configuration is leading to different errors, but each of them seems to be caused by insuffitient memory (SIGKILL, Java Heap Size, etc.)
Is Pandas method bottlenecking somehow whole process by its usage of memory here?
I was able to run such process for 1,5 milion records - can you advise me how to configure and tune up Spark to process the amount of data mentioned in the beginning? - is it possible to do so with these resources available?
I have two Geopandas dataframes. the schema looks like below.
inventoryid object
dsuid object
basinquantum object
reservoir object
geometry geometry
crs_epsg object
buffer_dist float64
buffer geometry
The second dataframe schema looks like this
API12 object
geometry geometry
Basin object
since first dataframe having two geometry types am setting geometry to buffer column
wells1=wells1.set_geometry("buffer")
I am performing intersection operation
res_intersection = gpd.overlay(wells2,wells1,how='intersection')
Although geometry column is present but still i am getting error like
"['geometry'] not found in axis"
creating a repeatable section of code that matches your description
this does fail in my environment with a different error when using pygeos GEOSException: IllegalArgumentException: Argument must be Polygonal or LinearRing
will further work on this and if necessary raise BUG against geopandas
will try against main branch not just v0.11 and investigate other versions of pygeos and geos
this does work with rtree
import geopandas as gpd
pygeos = False
gpd._compat.set_use_pygeos(pygeos)
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
# construct geodataframes logically equivalent to question
# two geometry columns
wells1 = cities.iloc[[11, 36, 50, 113, 158, 161, 172, 199]].copy()
wells1["buffer"] = (
wells1.to_crs(wells1.estimate_utm_crs()).buffer(1.5 * 10**5).to_crs(wells1.crs)
)
wells1 = wells1.set_geometry("buffer")
wells2 = world.iloc[[43, 54, 55, 58, 82, 129, 130]].copy()
gdf_i = gpd.overlay(wells2, wells1, how="intersection")
gdf_i.explore(height=300, width=500)
output
Environment
gpd.get_versions()
SYSTEM INFO
python : 3.10.6 (main, Aug 30 2022, 05:11:14) [Clang 13.0.0 (clang-1300.0.29.30)]
machine : macOS-11.6.8-x86_64-i386-64bit
GEOS, GDAL, PROJ INFO
GEOS : 3.11.0
GEOS lib : /usr/local/Cellar/geos/3.11.0/lib/libgeos_c.dylib
GDAL : 3.4.1
PROJ : 9.1.0
PYTHON DEPENDENCIES
geopandas : 0.11.1
numpy : 1.23.3
pandas : 1.4.4
pyproj : 3.4.0
shapely : 1.8.4
fiona : 1.8.21
geoalchemy2: None
geopy : None
matplotlib : 3.5.3
mapclassify: 2.4.3
pygeos : 0.13
pyogrio : None
psycopg2 : None
pyarrow : None
rtree : 1.0.0
I'm trying to improve my keras neural network hyperparameters by optimizing them with the weights and biases library (wandb).
Here is my configuration:
method: bayes
metric:
goal: maximize
name: Search elo
parameters:
batch_number:
distribution: int_uniform
max: 100
min: 1
batch_size:
distribution: int_uniform
max: 1024
min: 1
epochs:
distribution: int_uniform
max: 10
min: 1
neural_net_blocks:
distribution: int_uniform
max: 5
min: 1
num_simulations:
distribution: int_uniform
max: 800
min: 1
pb_c_base:
distribution: int_uniform
max: 25000
min: 15000
pb_c_init:
distribution: uniform
max: 3
min: 1
root_dirichlet_alpha:
distribution: uniform
max: 4
min: 0
root_exploration_fraction:
distribution: uniform
max: 1
min: 0
program: ../Main.py
However, when I run wandb agent arkleseisure/projectname/sweepcode, I get this error, repeated every time a sweep launches.
2020-09-13 12:15:02,188 - wandb.wandb_agent - INFO - Running runs: ['klawqpqv']
2020-09-13 12:15:02,189 - wandb.wandb_agent - INFO - Cleaning up finished run: klawqpqv
2020-09-13 12:15:03,063 - wandb.wandb_agent - INFO - Agent received command: run
2020-09-13 12:15:03,063 - wandb.wandb_agent - INFO - Agent starting run with config:
batch_number: 75
batch_size: 380
epochs: 10
neural_net_blocks: 4
num_simulations: 301
pb_c_base: 17138
pb_c_init: 1.5509741790555416
root_dirichlet_alpha: 2.7032316257955133
root_exploration_fraction: 0.5768106739703028
2020-09-13 12:15:03,245 - wandb.wandb_agent - INFO - About to run command: python ../Main.py --batch_number=75 --batch_size=380 --epochs=10 --neural_net_blocks=4 --num_simulations=301 --p
b_c_base=17138 --pb_c_init=1.5509741790555416 --root_dirichlet_alpha=2.7032316257955133 --root_exploration_fraction=0.5768106739703028
Traceback (most recent call last):
File "../Main.py", line 3, in <module>
import numpy
ModuleNotFoundError: No module named 'numpy'
The sweep crashes after three failed attempts, and I was wondering what I am doing wrong. Surely when W & B is made for machine learning projects, it must be possible to import numpy, so what can I change. My code before that point just imports other files from my project. When I run the code normally, it doesn't crash, but executes perfectly ordinarily.
The most likely problem you are running into is that wandb agent is running the python script with a different python interpreter than you were intending.
The solution is to specify the python interpreter by adding something like to the sweep configuration (where python3 is the interpreter you wish to use):
command:
- ${env}
- python3
- ${program}
- ${args}
This is feature is documented at: https://docs.wandb.com/sweeps/configuration#command
And there is a FAQ for setting the python interpreter at:
https://docs.wandb.com/sweeps/faq#sweep-with-custom-commands
To understand a bit more about what is going on you can look at the debugging line that you posted that says: "About to run command:"
python ../Main.py --batch_number=75 --batch_size=380 --epochs=10 --neural_net_blocks=4 --num_simulations=301 --pb_c_base=17138 --pb_c_init=1.5509741790555416 --root_dirichlet_alpha=2.7032316257955133 --root_exploration_fraction=0.5768106739703028
By default wandb agent uses a python interpreter named python. This allows users to customize their environment so python points to their interpreter of choice by using pyenv, virtualenv or other tools.
If you typically run commands with the command-line python2 or python3, you can customize how the agent executes your program by specifying the command key in your configuration file as described above. Alternatively, if your program is executable and your python interpreter is in the first line of your script using #!/usr/bin/env python3 syntax, you can set your command array to be:
command:
- ${env}
- ${program}
- ${args}
Does anyone have experience using pandas UDFs on a local pyspark session running on Windows? I've used them on linux with good results, but I've been unsuccessful on my Windows machine.
Environment:
python==3.7
pyarrow==0.15
pyspark==2.3.4
pandas==0.24
java version "1.8.0_74"
Sample script:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.fallback.enabled", "false")
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
#pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
out_df = df.groupby("id").apply(subtract_mean).toPandas()
print(out_df.head())
# +---+----+
# | id| v|
# +---+----+
# | 1|-0.5|
# | 1| 0.5|
# | 2|-3.0|
# | 2|-1.0|
# | 2| 4.0|
# +---+----+
After running for a loooong time (splits the toPandas stage into 200 tasks each taking over a second) it returns an error like this:
Traceback (most recent call last):
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\dataframe.py", line 1953, in toPandas
tables = self._collectAsArrow()
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\dataframe.py", line 2004, in _collectAsArrow
sock_info = self._jdf.collectAsArrowToPython()
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\miniconda3\envs\pandas_udf\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o62.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 69 in stage 3.0 failed 1 times, most recent failure: Lost task 69.0 in stage 3.0 (TID 201, localhost, executor driver): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(Unknown Source)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
at org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:161)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:290)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.hasNext(ArrowConverters.scala:96)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.foreach(ArrowConverters.scala:94)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.to(ArrowConverters.scala:94)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toBuffer(ArrowConverters.scala:94)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.toArray(ArrowConverters.scala:94)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Your java.lang.IllegalArgumentException in pandas_udf has to do with pyarrow version, not with OS environment. See this issue for details.
You have two routs of action:
Downgrade pyarrow to v.0.14, or
Add environment variable ARROW_PRE_0_15_IPC_FORMAT=1 to SPARK_HOME/conf/spark-env.sh
On Windows, you'll need to have a spark-env.cmd file in the conf directory: set ARROW_PRE_0_15_IPC_FORMAT=1, as suggested by Jonathan Taws
Addendum to the answer of Sergey:
if you prefer to build your own sparkSession in python and not change your config files, you'll need to set both spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT and the environment variable of the local executor spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT
spark_session = SparkSession.builder \
.master("yarn") \
.config('spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT',1)\
.config('spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT',1)
spark = spark_session.getOrCreate()
Hope this helps!
My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root