How to run query with lists and sets in cuDF

How to run query with lists and sets in cuDF - pandas

I am using cudf (dask-cudf) to handle tens~billions of data for social media. I'm trying to use query in extracting only the relevant users from the mother data set.
However, unlike pandas, cudf's query will error if I pass in a list or set.
The environment is anaconda rapids22.12 and cuda is 11.4.
The error is as follows:
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Internal error at <numba.core.typeinfer.CallConstraint object at 0x7f381a6097f0>.
Failed in cuda mode pipeline (step: native lowering)
Failed in nopython mode pipeline (step: native lowering)
NRT required but not enabled
During: lowering "$6for_iter.1 = iternext(value=$phi6.0)" at /home/user/.pyenv/versions/anaconda3-2020.11/envs/rapids-22.12/lib/python3.8/site-packages/numba/cpython/listobj.py (664)
During: lowering "$6compare_op.2 = src in __CUDF_ENVREF__test" at <string> (2)
During: resolving callee type: type(CUDADispatcher(<function queryexpr_5ee033e5bcab9f09 at 0x7f381b909ee0>))
During: typing of call at <string> (6)
Enable logging at debug level for details.
File "<string>", line 6:
<source missing, REPL/exec in use?>
test code is as follows:
df is a cudf.DataFrame and is a table of edge lists consisting of "src" and "dst" columns
test = list(test_userid)[0:2]
df.query("(src==#test)or(dst==#test)") #ok if one value not list
df.query("src.isin(#test)") #ng
df.query("src in #test") #ng
df.query("src==#test") #ng
It is not essential to use query, so if there is a way to extract other than query, I would like to know that as well.
I have confirmed that the code can successfully extract if it is by pandas. Also, the cudf query works correctly if it is a single value, not a list.
I believe that it should work properly even if you pass lists to cudf.

Related

Presto - Unable to enable performance tuning

I'm getting below errors after enabling the performance tuning parameters,
Performance tuning parameters used,
optimizer.join-reordering-strategy=AUTOMATIC optimizer.join_distribution_type=AUTOMATIC experimental.enable-dynamic-filtering=TRUE
I'm using amazon emr,
presto version: Presto CLI 0.267-amzn-1
I'm adding these parameters,
/etc/presto/conf/config.properties
`2022-07-11T11:02:36.728Z ERROR main com.facebook.presto.server.PrestoServer Unable to create injector, see the following errors:
Configuration property 'optimizer.join_distribution_type' was not used
at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:244)
1 error
com.google.inject.CreationException: Unable to create injector, see the following errors:
Configuration property 'optimizer.join_distribution_type' was not used
at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:244)
1 error
at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:543)
at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:159)
at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
at com.google.inject.Guice.createInjector(Guice.java:87)
at com.facebook.airlift.bootstrap.Bootstrap.initialize(Bootstrap.java:251)
at com.facebook.presto.server.PrestoServer.run(PrestoServer.java:143)
at com.facebook.presto.server.PrestoServer.main(PrestoServer.java:85)
2022-07-11T11:02:42.674Z INFO main com.facebook.airlift.log.Logging Disabling stderr output`
Any idea how to fix this issue?

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.

A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

Using Optaplanner for VRPPD

I am trying to run the example "optaplanner-mixedvrp-experiment" developed by Geoffrey De Smet and when I run it it throws me the following error:
Caused by: java.lang.IllegalStateException: The entity (MY) has a
variable (previousStandstill) with value (MUNO) which has a
sourceVariableName variable (nextVisit) with a value (WERBOMONT) which
is not null. Verify the consistency of your input problem for that
sourceVariableName variable.
I have not made any change, I have only cloned and executed it, I import and solve it and it throws me this error.
Do you know what could be happening?
I am applying it in the development of a variant of VRP with multiple deliveries and collections, but it throws me the same error. I have activated the FULL_ASSERT mode and nextVisit, previousStandstill, visitIndex are always null

It's been a long time since I looked at that code, so it's using an old version of optaplanner. Our goal is still to clean it up and offer an out of the box example for VRPPD (and probably remove some boilerplate along the way, using the upcoming #CollectionPlanningVariabe etc). That being said, we have multiple users&customers who used that optaplanner-mixedvrp-experiment to successfully build VRPPD implementations.
Which dataset did you try?
FWIW, that IllegalStateException says that when A.previous = B, the B.next is not A. So either the dataset importer didn't import it correctly - before calling solve() - especially if it fails before the first CH step in FULL_ASSERT. Or one of the custom moves corrupted the model.

Running TensorFlow label_image example throws access violation

I verified that the path to the binary protobuf file (inception_v3_2016_08_28_frozen.pb) is correct. In the LoadGraph function, ReadBinaryProto appears to succeed (load_graph_status passes the ok check), but the call to Session->Create using the resulting graph_def throws an exception:
"Exception thrown: read access violation
session->_Mypair._Myval2 was nullptr."
If I examine the graph_def object, it doesn't really appear to contain anything (version_ is 0, _cached_size_ is 0, all pointers appear to be NULL, etc.).

Resolving this required adding the Visual Studio /WHOLEARCHIVE flag to a bunch of the TF libary files. The ones I ended up whole archiving (some may not be strictly necessary) were:
/WHOLEARCHIVE:tf_protos_cc.lib
/WHOLEARCHIVE:tf_c.lib
/WHOLEARCHIVE:tf_cc.lib
/WHOLEARCHIVE:tf_cc_framework.lib
/WHOLEARCHIVE:tf_cc_ops.lib
/WHOLEARCHIVE:tf_cc_while_loop.lib
/WHOLEARCHIVE:tf_core_cpu.lib
/WHOLEARCHIVE:tf_core_direct_session.lib
/WHOLEARCHIVE:tf_core_framework.lib
/WHOLEARCHIVE:tf_core_kernels.lib
/WHOLEARCHIVE:tf_core_lib.lib
/WHOLEARCHIVE:tf_core_ops.lib

can't set portTypeName in axistools-maven-plugin

It seems that the portTypeName parameter of the axistools-maven-plugin (version 1.3) cannot be set.
The classOfPortType parameter is a required parameter and cannot be omitted but when setting it alongside portTypeName the following error appear:
Embedded error: Java2WSDL execution failed
invalid parameters, can not use portTypeName and classOfPortType together
I see there is Jira issue here. Is there a workaround?
Ronen.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to run query with lists and sets in cuDF - pandas

Related

Presto - Unable to enable performance tuning

Pylint: same pylint and pandas version on 2 machines, 1 fails

Using Optaplanner for VRPPD

Running TensorFlow label_image example throws access violation

can't set portTypeName in axistools-maven-plugin

Categories

Resources