Synchronizing millions of products

Synchronizing millions of products - shopware6

My team and I are trying update a huge number (millions) of products through integration with ERP.
We want to use the sync api.
https://forum.shopware.com/t/sync-api-upsert-mit-productnumber-als-unique-key/68556
Explains what we want to do. Our ERP system is not aware of the product id (UUID) from shopware and only knows a product SKU. This leaves us with having to do a product lookup in shopware for each product number, to get product id and then update the product data.
Is there a workaround so we can upsert by product number or other great ideas on how we can speed up things?
Kind regards
One idea is to generate our own product UUID based on MD5 hash of the product number. This way we will always know the uuid without having to do lookup in database.

There are several approaches to solve this. This are some which i use in my daily business.
API-only
If you do can only work with the sync-API, then i would recommend to build a sep. storage only for product synchronization. This would contain the following information:
SKU
Shopware-UUID
hash
Every synchronized product will have an existing UUID. If there is no UUID, you know that the product has not been synchronized to Shopware. The hash itself can be used internally to check for changes in the data. If there is a change, that specific entity will be updated then. In that way a simple and fast lookup would be possible without the API.
Direct access to the db
This is not my recommendation if you do not have a team of developers. Direct access to the database will provide you the fastest ways to access that information. Actually we do this for merchants with lots of product data but you will loose some benefits like the automatically queued events for updating entities in the catalog, prices, categories, etc. when using the sync api. So i only recommend it for read access which is what you want.
About the UUID
The approach you described is exactly what UUID v5 is. Shopware uses UUID v4 internally, which is just a random 128 bit sequence while v5 utilizes sha1 to generate that. So this is a good practice.

Shopwares API only allows updated by the primary key/id, so updates by other (unique) are not supported. Neither in the the sync api, nor in the default crud api.
But providing IDs for the products yourself instead of letting shopware auto generate them is totally valid, so your idea of generating the id based on the product number sounds reasonable and feasible to me.
Another solution would be to store the mapping between the shopware ids and the SKUs somewhere where you can access it faster, this would probably mean storing the ids either directly in the ERP (in some form of custom field etc.) or store that information in the mapping service. But i would prefer having a fast and consistent way of regenerating the ids without the need to store the mapping.

You could use the Shopware CSV import which allows mapping using the product number since a while. The CSV import also can be driven by API.

Related

Two shops and sync clients between them with passwords

Is this possible to sync customers between two seperate prestashop 1.7 shops? I dont want to use multistore option..is there a module for that or maybe some database operations?

Customers are stored in a single database table (ps_customer) , so if you are able to write a synchronization routine between the two database tables you should be able to achieve that.
There are several additional considerations though :
Both stores must have the same "cookie_key" set in the site parameters for same passwords to be validated in both shops, so you'll have to start with at least one empty store.
Customers have different relationships to databases based on their id_customer auto_increment values (addresses, orders, third party modules etc.), so you'll need to know what you're doing and make sure the two shops can't have conflicts between customer ids (IE: you can start one of the two shops with a very high id_customer..) - Also not sure if you need to handle also addresses synchronzation.. This would add some complexity.
I hope I've given you some good starting points - but I would stick with native "multishop" PS feature for that - It would be far easier despite still having a lot of bugs :)

Will redis lua scripting increase read performance in this case?

Let's suppose that we have simplified online store with following entities: products, related products and categories. Relations are pretty simple:
Products are linked to categories
Products have related products
Products prices are region specific
All entities are stored as json strings in Redis, something like:
Products (and related products): {"productId":1, "title": "Product", "relatedProductIds":[4,5,6], "categoryIds":[1,2,3]}
Product prices: {"price": 555.5, "region": "eu"}
Category: {"categoryId": 12, "title": "Category"}
Our task is to fetch whole tree from redis for some product ids: products, products categories, products prices, product related products, categories of related products, related products prices. Doing this from client is okay, but requires multiple requests to redis. Whole procedure could be described in this way:
Fetch all required products and prices and json decode them. Now we know list of primary products categories and related product ids.
Fetch categories, related products and prices of related products and json decode them. Now we know list of related products categories.
Fetch related products categories.
It works and it works really fast, but could this be done using redis lua scripting? I'm planning to do this job via lua, combine one big json (or multiple small jsons) and return to client. Is it worth doing and why?
Thank you and sorry for my english.

Yes, Lua script should probably get you much better latency than couple of hopes.
Especially if you don't need to retrieve all the fields, this way you should reduce data load on network and return only the relevant fields.
BTW, you might want to check https://redisgears.io and https://redisjson.io in case you have more complex use cases, that might require supporting Redis Cluster or more complex JSONs.

Lua script performance depends on what you did inside script. As Redis uses single thread command executor, if your script execute for a long time, it will block other commands and make Redis slower for response.
For your case I would recommend to query the different data directly from Redis and handle their relations in your application. In this way, Redis is only used for data storage not for logic, and make it pure and faster.

How to store persistent data without a database in .netcore API

I have a dotnet core 2.0 API. I want to store the total number of requests and also the number of requests made per user. I can't store this information in the database, because the API connects to many different databases dynamically depending on what the user needs. Also, I can't just use logging because I want to retrieve these numbers using a request in the API.
The only thing I can think of would be using a custom JSON file, and to continually update the file using middle ware. But this seems cumbersome, and I feel like there's got to be an easier way to store small amounts of permanent data. Maybe there's a nuget package someone can recommend?

I assume you could use cache memory depending for how long you want this data to be stored.
In other case, as suggested, your only choices is a file or a database.

Data modeling Issue for Moqui custom application

We are working on one custom project management application on top of Moqui framework. Our requirement is, we need to inform any changes in ticket to the developers associated with the project through email.
Currently we are using WorkEffortParty entity to store all parties associated with the project and then PartyContactMech entity to store their email addresses. Here we need to iterate through WorkEffortParty and PartyContactMech everytime to fetch all email address to which we need to send emails for changes in tickets every time.
To avoid these iterations, we are now thinking of giving feature to add comma separated email addresses at project level. Project admin can add email addresses of associated parties or mailing list address to which he needs to send email notification for ticket change.
For this requirement, we studied around the data model but we didn't got the right place to store this information. Do we need to extend any entity for this or is there any best practice for this? This requirement is very useful in any project management application. We appreciate any help on this data modeling problem.

The best practice is to use existing data model elements as they are available. Having a normalized data model involves more work in querying data, but also more flexibility in addressing a wide variety of requirements without changes to the data structures.
In this case with a joined query you can get a list of email addresses in a single query based on the project's workEffortId. If you are dealing with massive data and message volumes there are better solutions than denormalizing source data, but I doubt that's the case... unless you're dealing with more than thousands of projects and millions of messages per day the basic query and iterate approach will work just fine.
If you need to go beyond that the easiest approach with Moqui is to use a DataDocument and DataFeed to send updates on the fly to ElasticSearch, and then use it for your high volume queries and filtering (with arbitrarily complex filtering, etc requirements).
Your question is way too open to answer directly, data modeling is a complex topic and without good understanding of context and intended usage there are no good answers. In general it's best to start with a data model based on decades of experience and used in a large number of production systems. The Mantle UDM is one such model.

How to manage multiple versions of the same record

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis

I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!

This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia

You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas