How to do product catalog building with MySQL and tensorflow? - tensorflow

I am trying to build a product catalog with data from several e-commerce websites. The goal is to build a product catalog where every product is specified as good as possible using leveraged data across multiple sources.
This seems to be a highly complex task as there is sometimes misinformation and in some cases the unique identifier is misspelled or not even present.
The current approach is to transform the extracted data into our format and then load it into a mysql database. In this process obvious duplicates get removed and I end up with about 250.000 datasets.
Now I am facing the problem on how to brake this down even further as there are thousands of duplicates but I can not say which as some info might not be accurate.
e.g.
ref_id | title | img | color_id | size | length | diameter | dial_id
this one dataset might be incomplete or might even contain wrong values.
Looking more into the topic, this seems to be a common use case for deep learning with e.g. tensorflow.
I am looking for an answer that will help me in order to create the process on how to do that. Is tensorflow the right tool? Should I write all datasets to the db and keep the records? How could a process look like etc.

Related

Process Mining algorithm

If I have windows usage data table like
StartTime | EndTime | Window | Value
that records a history of windows usage - how can we mine this data to get some repetitive patters, e.g. wnd1->wnd2->wnd3 (set of records running consistently, set of records in different patterns may vary.. ) ?
What algorithm is better to use for this? Are there any implementations for Excel, Python and Delphi?
It seems that your data is not suitable for process mining. In process mining, we need a mandatory field that is "case-Id". Without this info, it is almost impossible to benefit most process mining techniques.
It would be great if you can provide case-Id or use Sequence mining techniques instead.

How to perform a nasty insert/update with lots of data, possible key violations and error list return

I am looking for a strategy for this complicated operation to perform.
We used to have a solution using NHibernate, it was ok for little amounts of data, but now its just way too slow.
The client uploads a file with card ID's into the system, file could contain up to ~150+k card ID's, which are assigned to a customer group.
A single record is either inserted, either his state changes.
Table eg.
+++++++++++++++++++++++++++++++++++
|card_id* | card_group_id* |state |
| 11112 | meow | 0 |
| 11131 | meow | 1 |
+++++++++++++++++++++++++++++++++++
Some ID's may violate PK/one of three FKs, and those ID's should be returned to the client.
So I basicly need to perform a somewhat one-at-a-time insert, "did it work? OK. if not, add to failed ID list."
Can you suggest me a strategy on achieving this?
I can't tolerate data loss like 1 bad ID failing the operation for another 1000 in the same statement, it needs to be somewhat fast so the client doesn't starve to death waiting for a result notification.
The backend is ASP.NET-based, with database in SQL Server 2012.
Use the right tool for the job. In this case is SSIS, and probably Merge Transformation. From two sources (one the table, the other the file) create a merge, then take the two outputs, insert the not-matched and insert into errors the matched. SSIS will give you batching, bulk insert, input file parsing and so much more.
Do not attempt to do this in straight SQL, handling 150k input records is far from trivial and you'll end up writing a single-use app instead of using an off-the-shelf component.
BULKINSERT your Data to a #TEMPTABLE, then iterate over it with a CURSOR.
Problems:
Quick & Dirty
Cursor PERFORMANCE!!

Database modeling for stock prices

I have recently been given the assignment of modelling a database fit to
store stock prices for over 140 companies. The data will be collected
every 15 min for 8.5 h each day from all these companies. The problem I'm
facing right now is how to setup the database to achieve fast search/fetch
given this data.
One solution would be to store everything in one table with the following columns:
| Company name | Price | Date | Etc... |
Or I could create a table for each company and just store the price and the date for
when the data was collected (and other parameters not known atm).
What is your thought about these kind of solutions? I hope the problem was explained
in sufficient detail, else please let me know.
Any other solution would be greatly appreciated!
I take it you're concerned about performance given the large number of records your likely to generate - 140 companies * 4 data points / hour * 8.5 hours * 250 trading days / year means you're looking at around 1.2 million data points per year.
Modern relational database systems can easily handle that number of records - subject to some important considerations - in a single table - I don't see an issue with storing 100 years of data points.
So, yes, your initial design is probably the best:
Company name | Price | Date | Etc... |
Create indexes on Company name and date; that will allow you to answer questions like:
what was the highest share price for company x
what was the share price for company x on date y
on date y, what was the highest share price
To help prevent performance problems, I'd build a test database, and populate it with sample data (tools like dbMonster make this easy), and then build the queries you (think you) will run against the real system; use the tuning tools for your database system to optimize those queries and/or indices.
On top of what has already been said, I'd like to say the following thing: Don't use "Company name" or something like "Ticker Symbol" as your primary key. As you're likely to find out, stock prices have two important characteristics that are often ignored:
some companies can be quoted on multiple stock exchanges, and therefore have different quote prices on each stock exchange.
some companies are quoted on multiple times on the same stock exchange, but in different currencies.
As a result, a properly generic solution should use the (ISIN, currency, stock exchange) triplet as identifier for a quote.
The first, more important question is what are the types and usage patterns of the queries that will be executed against this table. Is this an Online Transactional Processing (OLTP) application, where the great majority of queries are against a single record, or at most a small set of records? or is to an Online Analytical Processing application, where most queries will need to read, and process, significantly large sets of data to generate aggregations and do analysis. These two very different types of systems should be modeled in different ways.
If it is the first type of app, (OLTP), your first option is a better one, but the usage patterns and types of queries would still be important to determine the types of indices to place on the table.
If it is an OLAP application, (and a system storing billions of stock prices sounds more like an OLAP app) then the data structure you set up might be better organized to store pre-aggregated data values, or even go all the way an use a multi-dimensional database like an OLAP cube, based on a star schema.
Put them into a single table. Modern DB engines can easily handle those volumes you specified.
rowid | StockCode | priceTimeInUTC | PriceCode | AskPrice | BidPrice | Volume
rowid: Identity UniqueIdentifier.
StockCode instead of Company. Companies have multiple types of socks.
PriceTimeInUTC is to standardize any datetime into a specific timezone.
Also datetime2 (more accurate).
PriceCode is used to identify what of price it is: Options/Futures/CommonStock, PreferredStock, etc
AskPrice is the Buying price
BidPrice is the Selling price.
Volume (for buy/sell) might be useful for you.
Separately, have a StockCode table and a PriceCode table.
That is a Brute Force approach. The second you add searchable factors it can change everything. A more flexible and elegant option is a star schema, which can scale to any
amount of data. I am a private party working on this myself.

MYSQL summary tables for Web App advice

I have a database where i have the data in a number of tables with relationships for example
TABLE Cars (stock)
---------------------
Model colourid Doors
----------------------
xyz 0 2
xyz 1 4
TABLE Colour
Colourid Name
---------------------
0 Red
1 Green
I need to produce several regular summaries for example a summery in the format.
| colour | Num Doors
Model | red green blue | 2 4 5 6
---------|----------------------|------------------
XYZ | 1 2 3 | 4 5 3 5 <<< Numbers in stock
UPDATE - "a car can have an arrangement of doors for example 2 door cars or cars with 4 doors. In the summary it shows the number of cars in stock with each door configuration for a particular model eg there are 4 cars of xyz with 2 doors. Please bare in mind that this is only an example, cars may not be the best example its all i could come up at the time"
Unfortunately rearranging tables may make them better for summaries but not for the day to day operations.
I can think of several ways to produce theses summary's eg/ multiple SQL queries and put the table together at presentation level, SQL level UNION with multiple queries, VIEWS with multiple nested queries or lastly cron jobs or trigger code to produce data in a summary table with data arranged suitable for summary queries and reporting.
I wonder if anyone could please give me some guidance considering these methods aren't very efficient, made worse in a multi user environments and where that regular summaries may be required.
I think you need a data warehousing solution - basically build a new schema just for reporting purpose and populate these tables periodically.
There can be several update mechanisms for the summary tables -
Background job scheduled to do this periodically. This is best if up-to-date information is not needed.
Update the summary table using triggers on the main transaction tables. This could get somewhat complicated, but it might be warrantied if you need up-to-date information.
Update the report tables whenever a report is drawn just before showing the report. You can use some anchor values to ensure that you are not recalculating entire report too frequently, just consider the new rows or newly updated rows after the last time the report was drawn.
Only problem is that you will need to alter the table several times whenever new values get added in the pivoted columns.
Just a small variation on Roopesh's answer
Depending on the size of the database, available server resources, how often you would run these reports and particularly if you can not allow to have stale reports you might do the conceptually the same as above, but not using real tables, but views
Here are two links that should get you started
Pivot in MySQL
MySQL Wizardry
Notes:
you don't have to run any DDL (you can even skip CREATE VIEW and use straight dynamic SQL) as compared to having materialized results
the complexity is comparable, but little lower (adding new value in materialized scenario requires 1) ALTER TABLE ADD COLUMN, 2) INSERT; with this approach you only modify SELECT to analyze one more case. basically the complexity is identical to the INSERT)
performance can be much worse if users are looking at the reports many times from the database directly, but as stated before it also guarantees that data is fresh

Find Similar Rows in Database

I try to design my app to find database entries which are similar.
Let's for example take the table car (Everything in one table to keep the example simple):
CarID | Car Name | Brand | Year | Top Speed | Performance | Displacement | Price
1 Z3 BMW 1990 250 5.4 123 23456
2 3er BMW 2000 256 5.4 123 23000
3 Mustang Ford 2000 190 9.8 120 23000
Now i want to do Queries like that:
"Search for Cars similar to Z3 (all brands)" (ignore "Car Name")
Similar in this context means that the row where the most columns are exactly the same is the most similar.
In this example it would be "3er BMW" since 2 columns(Performance and Displacement are the same)
Can you give me hints how to design database queries/application like that. The application gonna be really big with a lot of entries.
Also I would really appreciate useful links or books. (No problem for me to investigate further if i know where to search or what to read)
You could try to give each record a 'score' depending on its fields
You could weigh a column's score depending on how important the property is for the comparison (for instance top speed could be more important than brand)
You'll end up with a score for each record, and you will be able to find similar records by comparing scores and finding the records that are +/- 5% (for example) of the record you're looking at
The methods of finding relationships and similarities in data is called Data Mining, in your case you could already try clustering and classify your data in order to see what are the different groups that show up.
I think this book is a good start for an introduction to data mining. Hope this helps.
To solve your problem, you have to use a cluster algorithm. First, you need define a similarity metric, than you need to count the similarity between your input tuples (all Z3) and the rest of the database. You can speed up the process using algorithms, such as k-means. Please take a look on this question, there you will find a discussion on similar problem as yours - Finding groups of similar strings in a large set of strings.
This link is very helpful as well: http://matpalm.com/resemblance/.
Regarding the implementation if you have a lot of tuples (and more than several machines) you can use http://mahout.apache.org/. It is machine learning framework based on hadoop. You will need a lot of computation power, because cluster algorithms are complex.
Have a look at one of the existing search engines like Lucene. They implement a lot of things like that.
This paper might also be useful: Supporting developers with natural language queries
Not really an answer to your question, but you say you have lot of entries, you should consider normalizing your car table, move Brand to a separate table and "Car name"/model to a separate table. This will reduce the amount of data to compare during the lookups.