The number of partitions [100000] relevant to the query is too large - sql

I execute the following script in dolphindb.
select count(*) from pt
It throws an exception.The number of partitions [100000] relevant to the query is too large. Please add more specific filtering conditions on partition columns in WHERE clause, or consider changing the value of the configuration parameter maxPartitionNumPerQuery.
How to change the configuration parameter maxPartitionNumPerQuery?

The parameter maxPartitionNumPerQuery namely specifies the maximal number of partitions a query can involve. It is designed to prevent users from submitting a large query by accident. The default value for this parameter is 65536. Please configure the new value in cluster.cfg

Related

"Cannot construct data type datetime" when filtering data, but all values filtered DO have valid dates

I am convinced that this question is NOT a duplicate of:
Cannot construct data type datetime, some of the arguments have values which are not valid
In that case the values passed in are explicitly not valid. Whereas in this case the values that the function could be expected to be called upon are all valid.
I know what the actual problem is, and it's not something that would help most people that find the other question. But it IS something that would be good to be findable on SO.
Please read the answer, and understand why it's different from the linked question before voting to close as dupe of that question.
I've run some SQL that's errored with the error message: Cannot construct data type datetime, some of the arguments have values which are not valid.
My SQL uses DATETIMEFROMPARTS, but it's fine evaluating that function in the select - it's only a problem when I filter on the selected value.
It's also demonstrating weird, can't-possibly-be-happening behaviour w.r.t. other changes to the query.
My query looks roughly like this:
WITH FilteredDataWithDate (
SELECT *, DATETIMEFROMPARTS(...some integer columns representing date data...) AS Date
FROM Table
WHERE <unrelated pre-condition filter>
)
SELECT * FROM FilteredDataWithDate
WHERE Date > '2020-01-01'
If I run that query, then it errors with the invalid data error.
But if I omit the final Date > filter, then it happily renders every result record, so clearly none of the values it's filtering on are invalid.
I've also manually examined the contents of Table WHERE <unrelated pre-condition filter> and verified that everything is a valid date.
It also has a wild collection of other behaviours:
If I replace all of ...some integer columns representing date data... with hard-coded numbers then it's fine.
If I replace some parts of that data with hardcoded values, that fixes it, but others don't. I don't find any particular patterns in what does or doesn't help.
If I remove most of the * columns from the Table select. Then it starts to be fine again.
Specifically, it appears to break any time I include an nvarchar(max) column in the CTE.
If I add an additional filter to the CTE that limits the results to Id values in the following ranges, then the results are:
130,000 and 140,000. Error.
130,000 and 135,000. Fine.
135,000 and 140,000. Fine.!!!!
Filtering by the Date column breaks everything ... but ORDER BY Date is fine. (and confirms that all dates lie within perfectly sensible bounds.)
Adding TOP 1000000 makes it work ... even though there are only about 1000 rows.
... WTAF?!
This took me a while to decode, but it turns out that the SS compiler doesn't necessarily restrict its execution of the function just to rows that are, or could be, relevant to the result set.
Depending on the execution plan it arrives at, the function could get called on any record in Table, even one that doesn't satisfy WHERE <unrelated pre-condition filter>.
This was found by another user, for another function, over here.
So the fact that it could return all the results without the filter wasn't actually proving that every input into the function was valid. And indeed there were some records in the table that weren't in the result set, but still had invalid data.
That actually means that even if you were to add an explicit WHERE filter to exclude rows containing invalid date-component data ... that isn't actually guaranteed to fix it, because the function may still get called against the 'excluded' rows.
Each of the random other things I did will have been influencing the query plan in one way or another that happened to fix/break things.
The solution is, naturally, to fix the underlying table data.

Conditionally LIMIT in BigQuery

I have read that in Postgres setting LIMIT NULL will effectively not limit the results of the SELECT. However in BigQuery when I set LIMIT NULL based on a condition I see Syntax error: Unexpected keyword NULL.
I'd like to figure out a way to limit or not based on a condition (could be an argument passed into a procedure, or a parameter passed in by a query job, anything I can write a CASE or IF statement for). The mechanism for setting the condition shouldn't matter, what I'm looking for is whether there is a way to syntactically indicate a value for LIMIT, that will not limit, in a valid way to BigQuery.
The LIMIT clause works differently within BigQuery. It specifies the maximum number of depression inputs in the result. The LIMIT n must be a constant INT64.
Using the LIMIT clause, you can overcome the limitation on cache result size:
Using filters to limit the result set.
Using a LIMIT clause to reduce the result set, especially if you are
using an ORDER BY clause.
You can see this example:
SELECT
title
FROM
`my-project.mydataset.mytable`
ORDER BY
title DESC
LIMIT
100
This will only return 100 rows.
The best practice is to use it if you are sorting a very large number of values. You can see this document with examples.
If you want to return all rows from a table, you need to omit the LIMIT clause.
SELECT
title
FROM
`my-project.mydataset.mytable`
ORDER BY
title DESC
This example will return all the rows from a table. It is not recommended to omit LIMIT if your tables are too large, as it will consume a lot of resources.
One solution to optimize resources is to use cluster tables. This will save costs and querying times. You can see this document with a detailed explanation of how it works.
You can write a stored procedure that dynamically creates a query based on input parameters. Once your sql query is ready, you can use execute immediate to run that. In this way, you can control what value should be provided to the limit clause of your query.
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#execute_immediate
Hope this answers your query.

Can you get column names from a Select statement in the big query SDK without running it

Given a SELECT statement in Big Query and the Java SDK, what are my options to get the actual column names without fetching the data? I know I can execute the statement and then get the Schema via the TableResult. But is there a way to get the names without fetching data? We have a tool where we run arbitrary queries which are not known upfront and in my code I want to access the result columns by name.
Update: someone flagged this as duplicate of a 7 year old entry. I am however looking for a way to use the Java SDK alone to get the column names, not to do some magic with the query itself or query some metatable.
There are few options but the easiest is to add limit 0 to your query so for example:
SELECT * FROM projectId.datasetId.tableId limit 0

SSIS Variables and Parameters written to wrong Access Fields

I'm using a Control Flow and Data Flow tasks to record the number of rows read from an Excel data source in Visual Studio SSIS. The data is then processed into good and bad tables, the rows in these counted and the results written into a statistics table via a parameterised SQL statement.
For reasons unknown the data seems to be getting written into the wrong fields in the statistics table and despite recreating the variables and explicitly setting the columns for each variable I can't fix or identify the problem.
Three variables are set:
1. Total rows read from source Excel via a Row Count task (approx 28964 rows)
Rows written to table as 'good' data after processing (most of the source files, approx 28,540)
Rows written to table as 'bad' data after processing (approx 424)
Then the varables are stored in a separate table via a SQL command that reads parameters set from the variables. A final percentage field is calculated from the total rows and the errors.
However, the results in the table seem to be in the wrong fields (see image).
I've checked this several times and recreated all the tables and variables but get the same result. All tables are Access.
Any ideas?
Any help is much appreciated.
Is that an Access parameterised query?
I've never run one of those from SSIS. I do know that SSIS can be weird about mapping the values from the variables to the query parameters. Have you noticed that the display order of your variables (in the variable-to-parameter mapping) is the same as how they get assigned to parameters?
It looks as though the GoodRows value (28540) is going to P1, BadRows to P2 and TotalRows to P3. That's the order that the variables appear in the mapping.
This is exactly the bizarre, infuriating thing that I've seen SSIS do - though not specifically with Access SQL statements. SSIS sometimes maps your variables to the parameters in the order that they appear in the mapping list, completely ignoring what you specify in the Parameter Name column.
Try deleting all the mappings, and mapping the variables one after another so that they appear in the order P1, P2, P3 in the mapping table.
I recommend that you create a fourth variable for the fourth parameter rather than trying to do math in the ExecuteSQL task.
Instead of using P1, P2 & P3 in the Parameter Names column of the Parameter-mapping tab, try using their zero-based ordinal position.
In the query itself, use question marks for the parameters:
...VALUES ("France", ?, ?, ?, ?)
In other words, for the parameter used first in the query, use 0 for the name. Use 1 for the next parameter, 2 for the next parameter, and so on.
If that doesn't work, you can use your variables to build a string variable that holds the entire SQL string that you want to execute, and use the "SQL from Variable" option in the ExecuteSQL task.
Please try to replace the Parameter Names in the Parameter Mapping with 0, 1 and 2.
Just use numbers in the column order you need. In my SSIS-Projects this works fine.

JDBC RDD Query Statement without '?'

I am using Spark with Scala and trying to get data from a database using JdbcRDD.
val rdd = new JdbcRDD(sparkContext,
driverFactory,
testQuery,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
Within the query there are no ? values to set (since the query is quite long I am not putting it here.) So I get an error saying that,
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
I have no idea what the problem is. Can someone suggest any kind of solution ?
Got the same problem.
Used this:
SELECT * FROM tbl WHERE ... AND ? = ?
And then call it with lowerbound 1, higher bound 1 and partition 1.
Will always run only one partition.
Your problem is Spark expected that your query String has a couple of ? parameters.
From Spark user list:
In order for Spark to split the JDBC query in parallel, it expects an
upper and lower bound for your input data, as well as a number of
partitions so that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an
upper and lower bound on your timestamp range, and spark should be
able to create new sub-queries to split up the data.
Another option is to load up the whole table using the HadoopInputFormat
class of your database as a NewHadoopRDD.