Case insensitive/Case preserving Monetdb tables

Case insensitive/Case preserving Monetdb tables - case-sensitive

Can I set monetdb to be case insensitive (but case preserving) ?
I am aware that workarounds exist: For example using ilike for search, and using lower in group by clauses but, especially in the latter case, the performance degradation is significant, especially in large datasets.
I wonder if there is a way to get case insensitive/case preserving tables in monetdb.

Related

How to generate a numeric identifier for entries based on a string

I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.

First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.

What's the most efficient way to do a case-insensitive like expression?

In Pervasive v13, is there a "more performant" way to perform a case-insensitive like expression than is shown below?
select * from table_name
where upper(field_name) like '%TEST%'
The UPPER function above has performance cost that I'd like to avoid.
I disagree with those who say that the performance-overhead of UPPER is minor; it is doubling the execution time compared to the exact same query without UPPER.
Background:
I was very satisfied with the execution time of this wildcard-like-expression until I realized the result set was missing records due to mismatched capitalization.
Then, I implemented the UPPER technique (above). This achieved including those missing records, but it doubled the execution time of my query.
This UPPER technique, for case-insensitive comparison, seems outlandishly intensive to me at even a conceptual level. Instead of changing a field's case, for every record in a large database table, I'm hoping that the SQL standard provides some type of syntactical flag that modifies the like-expression's behavior regarding case-sensitivity.
From there, behind the scenes, the database engine could generate a compiled regular expression (or some other optimized case-insensitive evaluator) that could well outperform this UPPER technique. This seems like a possibility that might exist.
However, I must admit, at some level there still must be a conversion to make the letter-comparisons. And perhaps, this UPPER technique is no worse than any other method that might achieve the same result set.
Regardless, I'm posting this question in hopes someone might reveal a more performant syntax I'm unaware of.

You do not need the UPPER, when you define your table using CASE.
The CASE keyword causes PSQL to ignore case when evaluating
restriction clauses involving a string column. CASE can be specified
as a column attribute in a CREATE TABLE or ALTER TABLE statement, or
in an ORDER BY clause of a SELECT statement.
(see: https://docs.actian.com/psql/psqlv13/index.html#page/sqlref%2Fsqlref.CASE_(string).htm )
CREATE TABLE table_name (field_name VARCHAR(100) CASE)

SQL: like v. equals performance comparison

I have a large table (100 million rows) which is properly indexed in a traditional RDBMS system (Oracle, MySQL, Postgres, SQL Server, etc.). I would like to perform a SELECT query which can be formulated with either of the following criteria options:
One that can be represented by a single criteria:
LIKE "T40%"
which only looks for matches at the beginning of the string field due to the wildcard
or
One that requires a list of say 200 exact criteria:
WHERE IN("T40.x21","T40.x32","T40.x43")
etc.
All other things being equal. Which should I expect to be more performant?

Assuming that both queries return the same set of rows (i.e. the list of items that you supply in the IN expression is exhaustive) you should expect almost identical performance, perhaps with some advantage for the LIKE query.
RDBMS engines have been using index searches for begins-with LIKE queries, so LIKE 'T40%' will produce records after an index search
Your IN query would be optimized for index search as well, perhaps giving RDBMS a tighter lower and upper bounds. However, there would be an additional filtering step to eliminate records outside your IN list, which is a waste of CPU cycles under the assumption that all rows would be returned anyway.
In case you'd parameterize your query, the second query becomes harder to pass to an RDBMS from your host program. All other things being equal, I would use LIKE.

i would suggest to go with LIKE operator because the ESCAPE OPTION Has to be used along with '\' symbol to increase the exact matching the character string.

Cost of logic in a query

I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?

Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?

For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.

You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.

What generic techniques can be applied to optimize SQL queries?

What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?

Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints

Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).

The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.

There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005

Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.

I think using SQL query analyzer would be a good start.

In Oracle you can look at the explain plan to compare variations on your query

Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.

The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.

Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor

Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas