Lucene Query - AND operator failing in Azure Search? - lucene

I have a search index of sandwiches. The index has three fields: id, meat, and bread. Each field is an Edm.String. In this index, here is a subset of my data:
ID | Meat | Bread
-----------------------
1 | Ham | White
2 | Turkey | Hoagie
3 | Tuna | Wheat
4 | Roast Beef | White
5 | Ham | Wheat
6 | Roast Beef | Rye
7 | Turkey | Wheat
I need to write a query that returns all ham or turkey sandwiches on wheat bread. In an attempt to do this, I've created the following:
{
"search":"(meat:(Ham|Turkey) AND bread:\"Wheat\")",
"searchMode":"all",
"select":"id,meat,bread"
}
When I run this query, I'm not seeing any results. What am I missing? What am I doing wrong? I'm trying to understand full queries. Do field-level queries support the phrase operator? I'm not sure what I'm doing wrong.

You need to use "queryType": "full" to request the Lucene syntax. See an example on MSDN.
That said, what you're trying to accomplish is easier and more efficiently done using filters. Assuming you make the relevant fields in your index filterable, you can use the following filter expression for your example: $filter=(meat eq 'Ham' or meat eq 'Turkey') and bread eq 'Wheat'. For more on filters, see this article. Hope this helps!

Related

Best data structure for finding tags of nested locations

Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.
Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.
If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.
If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.

Oracle SQL - Give each row in a result set a unique identifier depending on a value in a column

I have a result set, being returned from a view, that returns a list of items and the country they originated from, an example would be:
ID | Description | Country_Name
------------------------------------
1 | Item 1 | United Kingdom
2 | Item 2 | France
3 | Item 3 | United Kingdom
4 | Item 4 | France
5 | Item 5 | France
6 | Item 6 | Germany
I wanted to query this data, returning all columns (There are more columns than ID, Description and Country_Name, I've omitted them for brevity's sake) with an extra one added on giving a unique value depending on the value that is inside the field Country_name
ID | Description | Country_Name | Country_Relation
---------------------------------------------------------
1 | Item 1 | United Kingdom | 1
2 | Item 2 | France | 2
3 | Item 3 | United Kingdom | 1
4 | Item 4 | France | 2
5 | Item 5 | France | 2
6 | Item 6 | Germany | 3
The reason behind this, is we're using a Jasper report and need to show these items with an asterisk next to it (Or in this case a number) explaining some details about the country. So the report would look like this:
Desc. Country
Item 1 United Kingdom(1)
Item 2 France(2)
Item 3 United Kingdom(1)
Item 4 France(2)
Item 5 France(2)
Item 6 Germany(3)
And then further down the report would be a field stating:
1: Here are some details about the UK
2: Here are some details about France
3: Here are some details about Germany
I'm having difficulty trying to generate a unique number to go along side each country, starting at one each time the report is ran, incrementing it when a new country is found and keeping track of where to assign it. I would hazard a guess at using temporary tables to do such a thing, but I feel that's overkill.
Question
Is this kind of thing possible in Oracle SQL or am I attempting to do something that is rather large and cumbersome?
Are there better ways of doing this inside of a Jasper report?
At the moment, I'm looking at just having the subtext underneath each individual item and repeating the same information several times, just to avoid this situation, rather than having them aggregated and having the subtext once. It's not clean, but it saves this rather odd hassle.
You are looking for dense_rank():
select t.*, dense_rank() over (order by country_name) as country_relation
from t;
I don't know if this can be done inside Jasper reports. However, it is easy enough to set up a view to handle this in Oracle.

Creating an SSIS job to split a column and insert into database

I have a column called Description:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Description/Title |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka) |
| Beethoven, Piano Sonatas 8, 23 & 26. (Justus Frantz) |
| Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro) |
| Puccini, Ponchielli, Bizet, Tchaikovsky, Donizetti, Verdi: Arias from Boheme, Manon Lescaut, Tosca, Gioconda, Carmen, Eugen Onegin, Favorita, Rigoletto, Luisa Miller, Ballo, Aida. (Peter Dvorsky, ten. w.Hungarian State Opera Orch./ Mihaly) |
| Thomas, Leslie: 'The Virgin Soldiers' (Hywel Bennett reads abridged version. Listening time app. 2 hrs. 45 mins. DOLBY) |
| Katalsky, A. {1856-1926}: Liturgy for A Cappella Chorus. Rachmaninov, 6 Choral Songs w.Piano. (Bolshoi Theater Children's Choir/ Zabornok. DOLBY) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Please note that above I'm only showing 1 field.
Also, the output that I would like is:
+-------+-------+
| Word | Count |
+-------+-------+
| Arias | 3 |
| Duets | 2 |
| Liszt | 10 |
| Tosca | 1 |
+-------+-------+
I want this output to encompass EVERY record. I do not want a separate one of these for each record, just one global one.
I am choosing to use SSIS to do this job. I'd like your input on which controls to use to help with this task:
I'm not looking for a solution, but simply some direction on how to get started with this. I understand this can be done many different ways, but I cannot seem to think of a way to do this most efficiently. Thank you for any guidance.
FYI:
This script does an excellent job of concatenating everything:
select description + ', ' as 'data()'
from [BroincInventory]
for xml path('')
But I need guidance on how to work with this result to create the required output. How can this be done with c# or with one of the SSIS components?
edit: As siyual points out below I need a script task. The script above obviously will not work since there's a limit to the size of a data point.
I think term extraction might be the component you are looking for. Check this out: http://www.mssqltips.com/sqlservertip/3194/simple-text-mining-with-the-ssis-term-extraction-component/

Need a feedback for matrix question table design

In the survey, there is a type of question called Matrix which it's like this:
| Is Friendly | Weather | Comments
===========================================
Sydney | Y | 5 | 'bla'
-------------------------------------------
Singapore | Y | 10 | 'test'
-------------------------------------------
Jakarta | N | 0 | 'test2
-------------------------------------------
Try to get a feedback in term of designing SQL table for question and answer. I could have a design that you can only have 3 label sets (Is Friendly, Weather, Comment) or maybe extended to 10 to be save which means I have 10 columns.
What do you think about this approach, I know this is not relation database in such but at least from query point of view for answer to pull out.
Your thought?
In Sql Server you can make use of PIVOT.
This will allow you to design the table differently.
You would then have a table with columns
EntryType (eg. IsFriendly, Weather, Comment)
City_Region (eg. Sydney, Singapore, Jakarta)
EntryValue (eg. Y, 5, bla)
This will basically give you the functionality to have "dynamic" columns.

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.