I'm looking for an explanation of the concept of binary search in SAP. If my table has duplicates, how the search is done?
Assuming that you have already read the extensive article on binary search as well as the ABAP documentation, you will probably have overlooked the following paragraph:
If there are multiple hits (due to an incomplete search key or
duplicate entries in the table), binary searches (using the BINARY
SEARCH addition in standard tables; automatic in sorted tables) also
return the first hit in accordance with the order of the rows in the
primary index. This is the row with the lowest row number.
I believe that the binary search in abap works like the normal binary search algorithm, but after it finds a record that matches the key, it goes up the table by index until the key is different. After that if returns the first hit in the table that matches your key.
Related
I have multiple versions of the same article. I want to make the look up faster. I think indexing will help. Is there a specific type of indexing that will work the best to get multiple records from the same table. My understanding is that index works by creating a sorted list and execute a binary search. But if multiple records have the same value then it'd just return the first one it finds not all of them.
Have two tables - one with article metadata that doesn't change (indexed by article id) and one that has a compound index on article id and date with all date-specific data.
Even if all of your data was in one table (meaning that date-agnostic data is repeated) ten a compound index on article ID and date will make searches by article ID faster.
You shouldn't need to worry about specifics like binary searches - modern DBs will take of those details for you.
Your understanding is wrong. If you execute a query such as:
select h.*
from history h
where article_id = XXX
Then the query returns ALL articles for that article_id. If you have an index on history(article_id) then the index will (probably) be used. The index will be scanned for all matching articles.
I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings
When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?
Moreover, I want to find did you mean queries
For example lets say i have a record as house. The user types hause
Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you
I have tried soundex but the results it generates are terrible
It returns so many unrelated words
And since there are so many records in my database and i need very fast results, i need something SQL server natively supports
Any ideas? Any way to achieve such thing with using indexes?
I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.
I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records
Short answer is no, Full Text Search cannot search for words that are similar, but different.
Full Text Search uses stemmers and thesaurus files:
The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").
A Full-Text Search thesaurus defines a set of synonyms for a specific language.
Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...
Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.
What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.
I found the following in ML documentation:
a range index lets the server map values to fragments, and fragments to values...The former capability is used to support "range predicates" ....The latter is used to support fast order by operations..
Can anyone please explain this to me.Some sort of diagram depicting how this mapping is maintained would be very helpful.
Yes, do read Jason's excellent paper for all the detail of the inner workings of MarkLogic.
A simple summary of range indexes is this: A range index is a sorted term list. Term lists are an inverted index of values stored in documents. For word indexes, for example, a list of terms (a term list) is created that contains all the words in all the documents. Each term in the list is a word, say "humdinger", and an associated set of fragment IDs where that word occurs. When you do a word search for "humdinger", ML checks the term lists to find out which fragments that word occurs in. Easy. A more complex search is simply the set intersections of all the matched terms from all applicable term lists.
Most "regular" indexes are not sorted, they're organized as hashes to make matching terms efficient. They produce a set of results, but not ordered (relevance ordering is applied after). A range index on the other hand, is a term list that's sorted by the values of its terms. A range index therefore represents the range of unique values that occur in all instances of an element or attribute in the database.
Because range index term lists are ordered, when you get matches in a search you not only know which fragments they occur in, you also know the sorted order of the possible values for that field. MarkLogic's XQuery is optimized to recognize when you've supplied an "order by" clause that refers to a element or attribute which is range indexed. This lets it sort not by comparing the matched documents, but by iterating down the sorted term list and fetching matched documents in that order. This makes it much faster because the documents themselves need not be touched to determine their sort order.
But wait, there's more. If you're paginating through search results, taking only a slice of the matching results, then fast sorting by a range indexed field helps you there as well. If you're careful not to access any other part of the document (other than the range index element) before applying the page-window selection predicate, then the documents outside that window will never need to be fetched. The combination of pre-sorted selection and fast skip ahead is really the only way you can efficiently step through large, sorted result sets.
Range indexes have one more useful feature. You can access their values as lexicons, enumerating the unique values that occur in a given element or attribute across your entire database but without every actually looking inside any documents. This comes in handy for things like auto-suggest and getting counts for facets.
I hope that clarifies what range indexes are.
Take a look at Jason Hunter's writeup in Inside MarkLogic Server. There's a whole section on range indexes.
I'm currently trying to determine how to build out a keyword dimension table. We're tracking website visits to our website, and would like to be able to find the most used keywords used to search via a search engine for the site as well as any search terms used during the visit on the site (price > $100, review > 4 stars, etc). Since the keywords are completely dynamic and can be used in an infinite number of combinations, I'm having a hard time trying to determine how to store these keywords. I have a pageview fact table that includes a record every time a page is viewed. The source I'm pulling from includes all the search terms in a delimited list I am able to parse with a regular expression, I just don't know how to store it in the database since the number of keywords can vary so widely from pageview to pageview. I'm thinking this may be more suited for a NOSQL solution that trying to cram it into a MSSQL table, but I don't know. Any help is greatly appreciated!
Depending on how you want to analyze the data, there's a few solutions.
But for the amount of data that you are probably analyzing, I'd just create a table that uses the PK of the fact to store each keyword.
FACT_PAGEVIEW_ID bigint -- Surrogate key of fact table. Or natural key if you don't have a surrogate.
KEYWORD varchar(255) -- or whatever max len the keywords are
VALUE varchar(255)
The granularity of this table is 1 row per ID/Keyword combination. You may have to add value as well if you allow the same keyword multiple times in a querystring.
This allows you to group the keywords by pageview, or start with the pageview fact, filter it, then join to this to identify keywords.
The other option would be a keyword dimension and a many-many bridge table with a "keyword group", but since any number of combinations can be used, this is probably the quicker way and will likely get you 90% of the way there. Most questions, like "what combination of keywords are used most frequently", and "what keywords are most used by the top 10% of the user base" can be answered with this structure.
I'm developing a job service that has features like radial search, full-text search, the ability to do full-text search + disable certain job listings (such as un-checking a textbox and no longer returning full-time jobs).
The developer who is working on Sphinx wants the database information to all be stored as intergers with a key (so under the table "Job Type" values might be stored such as 1="part-time" and 2="full-time")... whereas the other developers want to keep the database as strings (so under the table "Job Type" it says "part-time" or "full-time".
Is there a reason to keep the database as ints? Or should strings be fine?
Thanks!
Walker
Choosing your key can have a dramatic performance impact. Whenever possible, use ints instead of strings. This is called using a "surrogate key", where the key presents a unique and quick way to find the data, rather than the data standing on it's own.
String comparisons are resource intensive, potentially orders of magnitude worse than comparing numbers.
You can drive your UI off off the surrogate key, but show another column (such as job_type). This way, when you hit the database you pass the int in, and avoid looking through to the table to find a row with a matching string.
When it comes to joining tables in the database, they will run much faster if you have int's or another number as your primary keys.
Edit: In the specific case you have mentioned, if you only have two options for what your field may be, and it's unlikely to change, you may want to look into something like a bit field, and you could name it IsFullTime. A bit or boolean field holds a 1 or a 0, and nothing else, and typically isn't related to another field.
if you are normalizing your structure (i hope you are) then numeric keys will be most efficient.
Aside from the usual reasons to use integer primary keys, the use of integers with Sphinx is essential, as the result set returned by a successful Sphinx search is a list of document IDs associated with the matched items. These IDs are then used to extract the relevant data from the database. Sphinx does not return rows from the database directly.
For more details, see the Sphinx manual, especially 3.5. Restrictions on the source data.