I am interested in the implementation of the search engine in autoscout24.de. It is a platform where you can sell/buy cars. Every car advert has properties: make, price, kilometers, color, etc. (in sum over 50 different properties) that can be searched for.
I am specifically interested in the detail search that works like this: every possible property is displayed on the page. In brackets behind each property there is the number of cars that will match the new search if the property is selected.
Example: I'll start with empty search criterias.
Property make:
BMW (100.000)
Volkswagen (200.000)
Ford (150.000)
...
Property color:
black (210.000)
silver (50.000)
white (100.000)
...
and so on for the other properties.
I'd like to know:
How would you implement this kind of search with SQL?
How would you implement it with an in-memory data structure?
Range queries should be supported, too (all cars with price from X to Y)
Update:
The numbers in brackets show the number of results after the addition of the search criteria. So it changes each time a property is added / removed...
So a naive algorithm would work like this:
find all cars with current search criteria (e.g. make Ford)
for each property do: find all cars that matches previous search criteria ("Ford") AND the search criteria for the chosen property. Write the count in brackets behind the property.
This algorithm is naive because it would execute 1 + N queries (N=#properties). Nobody wants to do that ;-)
I believe that this is referred to as "faceted search". The Apache Solr project might be worth looking at.
It's a basic code
Create a result object with one counter for each property that the cars have
Check all cars one by one, if the car match the filter then add one to each of the numbers
...But it's blasting fast !
I think they do it on several computers, shreading data across them. Each computer compute 5% of the data and send the result to the front computer wich sum all counts.
There are tools for that : look for "map reduce", "elastic search", "strom"...
Have a properties table:
+Properties
id
title
value
count
The count field allows you to "earn" an extra query , so instead of checking how much cars have a certain property , you can just update this field when adding new cars.
Example of rows in this table:
1 'color' 'white' 1000
2 'color' 'black' 122
3 'km' '5000' 1233
4 'km' '30000' 54
And for the cars table , for each property add a field.
+Cars
id
color
km
and the color and km fields will hold the ID's of the property's row in the Properies table.
EDIT: if you're planning not to use mysql db , you might consider using XML files to contain the properties data. But once again, you should update its count value anytime you add / remove or update a car.
<Properties>
<Property>
<Type>Color</Type>
<Value>White</Value>
<Count>1000</Count>
</Property>
</Properties>
Related
This is not really an issue that affects the code but rather a question of the table's appearance.
So, the table is the summary of records for income and expenses of different business departments. Let's call each department a type of the record. Each of those types has subtype1. Each subtype1 has subtypes2 and each subtype2 has subtypes3.
So the sample data would be something like this.
1, Type1, sum of subtypes1
1.1, Subtype1, sum of subtypes2
1.1.1 Subtype2, sum of subtypes3
1.1.1.1 Subtype3, amount
1.1.1.2 Subtype3, amount
1.2, Subtype1, sum of subtypes2
1.2.1, Subtype2, sum of subtypes3
1.2.1.1, Subtype3, amount
Each subtype can have different number of "children subtypes". Children subtypes can't go further than subtype3.
Then I am using VBA script to group the records of the same subtype under their direct parent up to the main type. Everything works fine, I can expand or hide every single level of this structure.
However, logically the group outline on the left side of the table for rows should show 4 levels. Instead it shows 8 levels of groups. First 4 do exactly what you would expect, show or hide respective subtypes while the other 4 levels do absolutely nothing which is also expected because I don't see a reason for them to be there.
Any ideas why extra levels have been created and how to get rid of them?
I might have explained this in a not very clear way so feel free to ask for further information.
Try stepping through your code in trace mode to watch the groups being set up. (open the VBA window and use the F8 key to loop one line at a time)
This may reveal why the extra groups are being defined and suggest what to change.
How can I re-use a single complex dataset across a number of tables?
The dataset has a number of computed columns that needs to be reported both in detail and in summary. Here's a very simplified example dataset:
is_food sale_association food_type total_sold total_associations percent_total
1 Before Movie Popcorn 50 3 x BirtMath.safeDivide(...)
0 Before Movie Soda 10 2 x BirtMath.safeDivide(...)
1 During Movie Jujubee 10 1 x BirtMath.safeDivide(...)
0 After Movie Soda 15 2 x BirtMath.safeDivide(...)
From this one dataset, I'd want to create a detailed summary of all food types while rolling up non food (using the 'is_food' column), another summary of all food types, another detailed summary of food with rolled up non-food by sale_association, etc. etc.
The report would also contain a number of percentages (6 in the most complex table) that need to be calculated (some across a row, others across all rows in a given group), all of which can have a zero value for the denominator and so need to be guarded against with safeDivide (which is a PITA to do in the source SQL query which itself is doing aggregation -- checking for divide by zero when both the numerator and denominator are sums leads to hairy queries).
Obviously I can do this by focusing the() SQL query as appropriate, but it seems like a waste of time and effort to create 12 or 15 queries that are very similar when I've already managed to create the monster query for the most detailed table.
What doesn't seem straightforward is how to perform the rollups in a table. I managed to hack something together by hiding rows that would later be summed up (e.g. "is_food == 0" in the example) and then creating custom data bindings that are displayed in a footer row. Not only does it feel like a hack, it also interferes with the ability to naturally order rows. Again, going back to the example, if I was ordering by total_sold and summarizing rows with is_food == 0, the natural order should be Popcorn, Non-food, Jujubee.
There's nothing in the BIRT wiki about this, nor does "BIRT: A Field Guide, 3rd E." really delve into the topic.
This seems like a fairly open-ended question (although I agree that re-using a single dataset makes much more sense than having multiple queries retrieving the same data in slightly different ways). A few general suggestions:
Use the most detailed version of the data required as a common dataset for each BIRT report item (typically BIRT tables)
Where summary-only level reporting is required, add groups to the BIRT table at the desired level, add data items as required to the group headers/footers and delete the detail level row(s) from the BIRT table.
Where detail-level reporting is required in some cases (eg. for food items but not for non-food items), add groups to the BIRT table as above, and set the visibility of the detail row (in Property Editor - Properties - Visibility) to check Hide Element, then specify the appropriate expression to suppress the non-required rows (non-food items, in this example).
Aggregations (ie. summary expressions) can be added to tables by selecting the whole table, selecting the Binding tab within the Property Editor and clicking the Add Aggregation... button.
Here is my dilemma, I am building a POS system for a pretty large retailer, they have different products which have different attributes (size, color, etc...).
When they receive the merchandise from the supplier they want to do their own labeling with their own UPC Bar Codes but they also want to differentiate between the different sizes using the code on the article.
Say they received Brand A shirts with 4 sizes S,M,L,XL then they should have different bar codes for each size.
So I thought of having a base code for the article and then concatenating numbers depending on the attributes to have different codes? and if no attributes are available just add 0s
I am storing the sizes and colors as attributes in the database as an (Entity-Attribute-Value). Is their a better way other than having to start concatenating numbers from the attributes to come up with the full code?
Thanks for your help!
edit-------------------------------------------------------
I am making the example a bit clearer
so the base code for the shirts is: 9 123456
Then for Color blue is: 789
and then for size S: 012
so the full code is 9 123456 789012
for another article that doesn't have size or color or actually any attribute
the base code would be 9 654321
plus 000000 for the attributes part
this is just for simplicity sake as I can use only one digit per attribute.
The other issue is when linking to the OrderDetails table I need to reference all the attributes to know that the customer actually bought Size S in Blue
One possible option is to create a table that stores the bar code as the key. Then have an attribute for the size and the color.
Actually #jzd your answer is pretty close but I would like to keep the attributes as key value pair.
The idea is to use and an attribute set and have a bar code associated with each set. here is a rough schema
AttributeSet Table:
AttributeSetId
ProductId
AttributeSetName
BarCode
AttributeUse Table:
AttributeSetId
AttributeId
AttributeSetInstance Table:
AttributeSetId
AttributeId
AttributeValueId
if you forget the barcode for a minute...
do you have a database to track this inventory?
are the items stored discretely in this database?
if so, then just add a unique number to the item, called the UPC vale - i recommend not trying to make an intelligent key s\as you are describing
I have a problem naming the elements in my application's data model.
In the application, the user has the possibility to create his own metamodel. He does so by creating entity types and a type defines which properties an entity has. However, there are three kinds of entity types:
There is always exactly one instance of the type.
For instance, I want to model the company I am working for. It has a name, a share price and a number of employees. These values change over time, but there is always exactly one company.
There are different instances of the type, each is unique.
Example: Cities. A city has a name and a population count, there are different cities and each city exists exactly once.
Each instance of the type defines multiple entities.
Example: Cars. A car has a color and a manufacturer. But there is not only one red mercedes. And even though they are similar, red mercedes #1 is different from red mercedes #2.
So lets say you are a user of this tool and you understood the concept of these three flavors. You want to create a new entity type and are prompted to choose between option 1, 2 and 3. How would you name these options?
Edit:
Documentation and help is available to the user. Also the user can be expecteted to have a technical/programming background, so understanding these three concepts should be no problem.
First of all let me make sure I understand the problem,
Here's what you have (correct me if I'm wrong):
#of instances , is/are Unique
(1,true)
(n,true)
(n,false)
If so,
for #of instances I would use single \ plural
for is\are unique (\ not unique) I would use unique \ ununique.
so you'll get:
singleUnique
pluralUnique
pluralUnunique
That's the best I could think of.. I don't know exactly who are your users and what is the environment, But if you have an option of adding tips (or documentation) that should be used for sure.
I'm developing a website with a custom search function and I want to collect statistics on what the users search for.
It is not a full text search of the website content, but rather a search for companies with search modes like:
by company name
by area code
by provided services
...
How to design the database for storing statistics about the searches?
What information is most relevant and how should I query for them?
Well, it's dependent on how the different search modes work, but generally I would say that a table with 3 columns would work:
SearchType SearchValue Count
Whenever someone does a search, say they search for "Company Name: Initech", first query to see if there are any rows in the table with SearchType = "Company Name" (or whatever enum/id value you've given this search type) and SearchValue = "Initech". If there is already a row for this, UPDATE the row by incrementing the Count column. If there is not already a row for this search, insert a new one with a Count of 1.
By doing this, you'll have a fair amount of flexibility for querying it later. You can figure out what the most popular searches for each type are:
... ORDER BY Count DESC WHERE SearchType = 'Some Search Type'
You can figure out the most popular search types:
... GROUP BY SearchType ORDER BY SUM(Count) DESC
Etc.
This is a pretty general question but here's what I would do:
Option 1
If you want to strictly separate all three search types, then create a table for each. For company name, you could simply store the CompanyID (assuming your website is maintaining a list of companies) and a search count. For area code, store the area code and a search count. If the area code doesn't exist, insert it. Provided services is most dependent on your setup. The most general way would be to store key words and a search count, again inserting if not already there.
Optionally, you could store search date information as well. As an example, you'd have a table with Provided Services Keyword and a unique ID. You'd have another table with an FK to that ID and a SearchDate. That way you could make sense of the data over time while minimizing storage.
Option 2
Treat all searches the same. One table with a Keyword column and a count column, incorporating SearchDate if needed.
You may want to check this:
http://www.microsoft.com/sqlserver/2005/en/us/express-starter-schemas.aspx