Find Similar Rows in Database

Find Similar Rows in Database - sql

I try to design my app to find database entries which are similar.
Let's for example take the table car (Everything in one table to keep the example simple):
CarID | Car Name | Brand | Year | Top Speed | Performance | Displacement | Price
1 Z3 BMW 1990 250 5.4 123 23456
2 3er BMW 2000 256 5.4 123 23000
3 Mustang Ford 2000 190 9.8 120 23000
Now i want to do Queries like that:
"Search for Cars similar to Z3 (all brands)" (ignore "Car Name")
Similar in this context means that the row where the most columns are exactly the same is the most similar.
In this example it would be "3er BMW" since 2 columns(Performance and Displacement are the same)
Can you give me hints how to design database queries/application like that. The application gonna be really big with a lot of entries.
Also I would really appreciate useful links or books. (No problem for me to investigate further if i know where to search or what to read)

You could try to give each record a 'score' depending on its fields
You could weigh a column's score depending on how important the property is for the comparison (for instance top speed could be more important than brand)
You'll end up with a score for each record, and you will be able to find similar records by comparing scores and finding the records that are +/- 5% (for example) of the record you're looking at

The methods of finding relationships and similarities in data is called Data Mining, in your case you could already try clustering and classify your data in order to see what are the different groups that show up.
I think this book is a good start for an introduction to data mining. Hope this helps.

To solve your problem, you have to use a cluster algorithm. First, you need define a similarity metric, than you need to count the similarity between your input tuples (all Z3) and the rest of the database. You can speed up the process using algorithms, such as k-means. Please take a look on this question, there you will find a discussion on similar problem as yours - Finding groups of similar strings in a large set of strings.
This link is very helpful as well: http://matpalm.com/resemblance/.
Regarding the implementation if you have a lot of tuples (and more than several machines) you can use http://mahout.apache.org/. It is machine learning framework based on hadoop. You will need a lot of computation power, because cluster algorithms are complex.

Have a look at one of the existing search engines like Lucene. They implement a lot of things like that.
This paper might also be useful: Supporting developers with natural language queries

Not really an answer to your question, but you say you have lot of entries, you should consider normalizing your car table, move Brand to a separate table and "Car name"/model to a separate table. This will reduce the amount of data to compare during the lookups.

Related

How to do product catalog building with MySQL and tensorflow?

I am trying to build a product catalog with data from several e-commerce websites. The goal is to build a product catalog where every product is specified as good as possible using leveraged data across multiple sources.
This seems to be a highly complex task as there is sometimes misinformation and in some cases the unique identifier is misspelled or not even present.
The current approach is to transform the extracted data into our format and then load it into a mysql database. In this process obvious duplicates get removed and I end up with about 250.000 datasets.
Now I am facing the problem on how to brake this down even further as there are thousands of duplicates but I can not say which as some info might not be accurate.
e.g.
ref_id | title | img | color_id | size | length | diameter | dial_id
this one dataset might be incomplete or might even contain wrong values.
Looking more into the topic, this seems to be a common use case for deep learning with e.g. tensorflow.
I am looking for an answer that will help me in order to create the process on how to do that. Is tensorflow the right tool? Should I write all datasets to the db and keep the records? How could a process look like etc.

SSAS Cube Design Issue linked with Performance

I have identified 3 dimensions and 1 measure table.
It will be Star schema.
My measure group would have Count(A/C number).
Each dimension table has look up table tied to A/c number kind of one to one relationship.
Dim1
ID1
Cat1
Dim2
ID2
Cat2
Dim3
ID3
Cat3
Fact
A/c number
Count(A/c)
ID1
ID2
ID3
Above is just an example,
Of course in real time there are 15 dimension table(one to one relation) with fact table and data close to million records that's why we need to come up with best design/performance.
I know FACT/Measure is always aggregate or a measure of business and in this case measure is count(A/c number).
Question:
1. Do i need to add A/c number to the fact table.
Remember adding A/c number to the fact table, fact would be huge/big.
Good or bad, performance wise??
Do i create additional Factless fact table similar to fact table but fact table will have only count(a/c number) and Factless fact table would have actual a/c numbers with dimension values too.. this would be a big table.
Good or bad, performance wise??
Do i create additional column(a/c number) along with look up values on the dimension tables so fact table would have facts.. Good or bad, performance wise??
Also i need to know, dimension process/deploy is faster(or should be faster) fact process/deploy is faster(or should be faster) and what's preferred in real time.
I want to know which option to select in real time or is there better solution.
Please let me know!!

If I understood correctly you are talking about a degenarated dimension.
That´s a common practice and in my opinion is the correct way to tackle your issue.
For instance, let´s say that we have a order details table with a granularity of one row per order line. Something like this:
Please visit this link to see the image because I'm not still able to post images in the forum:
http://i623.photobucket.com/albums/tt313/pauldj54/degeneratedDimension.jpg
If your measure is the count of orders, from the example above the result is: 2
Please check the following link:
Creating Degenerated dimension
Let me know if you have further questions.
Kind Regards,
Paul

You described your volume as "close to million records". That sounds trivial to process on any server (or even desktop or laptop) built in the last 5 years.
Therefore I would not limit the design to solve an imagined performance issue.

Database modeling for stock prices

I have recently been given the assignment of modelling a database fit to
store stock prices for over 140 companies. The data will be collected
every 15 min for 8.5 h each day from all these companies. The problem I'm
facing right now is how to setup the database to achieve fast search/fetch
given this data.
One solution would be to store everything in one table with the following columns:
| Company name | Price | Date | Etc... |
Or I could create a table for each company and just store the price and the date for
when the data was collected (and other parameters not known atm).
What is your thought about these kind of solutions? I hope the problem was explained
in sufficient detail, else please let me know.
Any other solution would be greatly appreciated!

I take it you're concerned about performance given the large number of records your likely to generate - 140 companies * 4 data points / hour * 8.5 hours * 250 trading days / year means you're looking at around 1.2 million data points per year.
Modern relational database systems can easily handle that number of records - subject to some important considerations - in a single table - I don't see an issue with storing 100 years of data points.
So, yes, your initial design is probably the best:
Company name | Price | Date | Etc... |
Create indexes on Company name and date; that will allow you to answer questions like:
what was the highest share price for company x
what was the share price for company x on date y
on date y, what was the highest share price
To help prevent performance problems, I'd build a test database, and populate it with sample data (tools like dbMonster make this easy), and then build the queries you (think you) will run against the real system; use the tuning tools for your database system to optimize those queries and/or indices.

On top of what has already been said, I'd like to say the following thing: Don't use "Company name" or something like "Ticker Symbol" as your primary key. As you're likely to find out, stock prices have two important characteristics that are often ignored:
some companies can be quoted on multiple stock exchanges, and therefore have different quote prices on each stock exchange.
some companies are quoted on multiple times on the same stock exchange, but in different currencies.
As a result, a properly generic solution should use the (ISIN, currency, stock exchange) triplet as identifier for a quote.

The first, more important question is what are the types and usage patterns of the queries that will be executed against this table. Is this an Online Transactional Processing (OLTP) application, where the great majority of queries are against a single record, or at most a small set of records? or is to an Online Analytical Processing application, where most queries will need to read, and process, significantly large sets of data to generate aggregations and do analysis. These two very different types of systems should be modeled in different ways.
If it is the first type of app, (OLTP), your first option is a better one, but the usage patterns and types of queries would still be important to determine the types of indices to place on the table.
If it is an OLAP application, (and a system storing billions of stock prices sounds more like an OLAP app) then the data structure you set up might be better organized to store pre-aggregated data values, or even go all the way an use a multi-dimensional database like an OLAP cube, based on a star schema.

Put them into a single table. Modern DB engines can easily handle those volumes you specified.
rowid | StockCode | priceTimeInUTC | PriceCode | AskPrice | BidPrice | Volume
rowid: Identity UniqueIdentifier.
StockCode instead of Company. Companies have multiple types of socks.
PriceTimeInUTC is to standardize any datetime into a specific timezone.
Also datetime2 (more accurate).
PriceCode is used to identify what of price it is: Options/Futures/CommonStock, PreferredStock, etc
AskPrice is the Buying price
BidPrice is the Selling price.
Volume (for buy/sell) might be useful for you.
Separately, have a StockCode table and a PriceCode table.

That is a Brute Force approach. The second you add searchable factors it can change everything. A more flexible and elegant option is a star schema, which can scale to any
amount of data. I am a private party working on this myself.

Is denormalizing acceptable in this case?

I have the following locations table:
----------------------------------------------------------
| ID | zoneID | storeID | address | latitude | longitude |
----------------------------------------------------------
and the phones table:
-----------------------
| locationID | number |
-----------------------
Now, keep in mind that for any giving store it can be up to five phone numbers, top. Order doesn't matter.
Recently we needed to add another table which would contain stores related info which would also include phone numbers.
Now, to this new table doesn't apply locationID so we can't store the phones in the previous phone table.
Keeping the DB normalized would require, in the end, 2 new tables and a total of 4 joins to retrieve the data. Denormalizing it would render the old table like:
----------------------------------------------------------------------------------
| ID | zoneID | storeID | address | latitude | longitude | phone1 | ... | phone5 |
----------------------------------------------------------------------------------
and having a total of 2 tables and 2 joins.
I'm not a fan of having data1, data2, data3 fields as it can be a huge pain. So, what's your opinion.

My opinion, for what it's worth, is that de-normalisation is something you do to gain performance if, and only if, you actually have a performance problem. I always design for 3NF and only revert if absolutely necessary.
It's not something you do to make your queries look nicer. Any decent database developer would not fear a moderately complex SQL statement although I do have to admit I've seen some multi-hundred-line statements that gave me the shivers - mind you, these were from customers who had no control over the schema: a DBA would have first re-engineered the schema to avoid such a monstrosity.
But, as long as you're happy with the limitations imposed by de-normalisation, you can do whatever you want. It's not as if there's a band of 3NF police roaming the planet looking for violators :-)
The immediate limitations (there may be others) that I can see are:
You'll be limited (initially, without a schema change) to five phone numbers per location. From your description, it doesn't appear you see this as a problem.
You'll waste space storing data that doesn't have to be there. In other words, every row uses space for five numbers regardless of what they actually have, although this impact is probably minimal (e.g., if they're varchar and nullable).
Your queries to look up a phone number will be complicated since you'll have to check five different columns. Whether that's one of your use cases, I don't know, so it may be irrelevant.
You should probably choose one way or the other though (I'm not sure if that's your intent here). I'd be particularly annoyed if I came across a schema that had phone numbers in both the store table and a separate phone numbers table, especially if they disagreed with each other. Even when I de-normalise, I tend to use insert/update triggers to ensure data consistency is maintained.

I think your problem stems from an erroneous model.
Why do you have a location id and a store id? Can a store occupy more than one location?
Is the phone number tied to a geographic location?
Just key everything by StoreId and your problems will disappear.

just try to relate your new table with old location table, as both the tables represent the store you should be able to find someway to relate both. if you can do that your problem is solved, because than you can keep using phone table as before.
Related the new table with old location table will help you beyond getting phone numbers

How to reuse results with a schema for end of day stock-data

I am creating a database schema to be used for technical analysis like top-volume gainers, top-price gainers etc.I have checked answers to questions here, like the design question. Having taken the hint from boe100 's answer there I have a schema modeled pretty much on it, thusly:
Symbol - char 6 //primary
Date - date //primary
Open - decimal 18, 4
High - decimal 18, 4
Low - decimal 18, 4
Close - decimal 18, 4
Volume - int
Right now this table containing End Of Day( EOD) data will be about 3 million rows for 3 years. Later when I get/need more data it could be 20 million rows.
The front end will be asking requests like "give me the top price gainers on date X over Y days". That request is one of the simpler ones, and as such is not too costly, time wise, I assume.
But a request like " give me top volume gainers for the last 10 days, with the previous 100 days acting as baseline", could prove 10-100 times costlier. The result of such a request would be a float which signifies how many times the volume as grown etc.
One option I have is adding a column for each such result. And if the user asks for volume gain in 10 days over 20 days, that would require another column. The total such columns could easily cross 100, specially if I start adding other results as columns, like MACD-10, MACD-100. each of which will require its own column.
Is this a feasible solution?
Another option being that I keep the result in cached html files and present them to the user. I dont have much experience in web-development, so to me it looks messy; but I could be wrong ( ofc!) . Is that a option too?
Let me add that I am/will be using mod_perl to present the response to the user. With much of the work on mysql database being done using perl. I would like to have a response time of 1-2 seconds.

You should keep your data normalised as much as possible, and let the RDBMS do its work: efficiently performing queries based on the normalised data.
Don't second-guess what will or will not be efficient; instead, only optimise in response to specific, measured inefficiencies as reported by the RDBMS's query explainer.
Valid tools for optimisation include, in rough order of preference:
Normalising the data further, to allow the RDBMS to decide for itself how best to answer the query.
Refactoring the specific query to remove the inefficiencies reported by the query explainer. This will give good feedback on how the application might be made more efficient, or might lead to a better normalisation of relations as above.
Creating indexes on attributes that turn out, in practice, to be used in a great many transactions. This can be quite effective, but it is a trade-off of slowdown on most write operations as indexes are maintained, to gain speed in some specific read operations when the indexes are used.
Creating supplementary tables to hold intermediary pre-computed results for use in future queries. This is rarely a good idea, not least because it totally breaks the DRY principle; you now have to come up with a strategy of keeping duplicate information (the original data and the derived data) in sync, when the RDBMS will do its job best when there is no duplicated data.
None of those involve messing around inside the tables that store the primary data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas