I have 5 distinct tables namely:
deposits
withdrawals
payments
transfers
exchanges
I need to merge these tables and return a paginated result from a Nodejs API to clients. I don't know how to do this efficiently, given that the tables are really big(>100K rows each).
I was thinking of 2 approaches that I can take.
Stream individual query results from each table to the node js backend and implements some merging logic there.
OR
Somehow perform the merging at DB level, by creating some sort of virtual table or some persisted view, and perform a query on that.
What do you guys think?
Thank you for your time!
If you plan to manipulate with DB data then from my perspective the most efficient way accomplishing this would be on DB level, composing the specific SQL view or preparing advanced query leveraging combine techniques via JOIN, UNION or Sub queries. Saying above, and assuming Node.js Streaming concept, it is always good practice adopting streaming against large datasets.
const request = new sql.Request()
request.stream = true // You can set streaming differently for each request
request.query('select * from verylargetable')
If you will expect to further use this data for some other client applications, you may consider using CSV stringifier Node.js package, converting SQL output data into CSV text format.
it is difficult to provide a solution without knowing the structure of tables. You can use a view created by using INNER JOIN and use that view to select the attributes to return data to API requests Or U can Directly use INNER JOIN to return data. Return Data using procedure.
Related
I am currently creating my webshop for my local parts store using prestashop 1.7.8.6
I developed scripts on myself and have successfully made the website work correctly.
But, with 2 millions rows of products including each 30 columns and multiple joins, i can't have a decent loading time on getProducts query.
Even with indexes and cache...
I use a simple query on prestashop product table and join id products from car filter table to match ps_product table.
I would like to know if it would be better to create tables for each vehicle using an id, and fill it with ps_product data, to use this table only instead of using multiple joins.
I'am using innoDB as engine.
Thanks
Prestashop does not have great performance with such a huge amount of data/products, as you have seen using native methods, so the best option is to strengthen your MySQL server.
Consider using one or more dedicated machines for SQL with replication),
by saving your data in external tables or store it to some distributed NoSQL system built to deal with large amount of data (like Elasticsearch or similar) so you can scale it easily and you can write your own code/module to retrieve what you need.
For example, we have a web application that uses PostgreSQL. The application has AuthorService that implements CRUD operations for Author entity. AuthorService uses "authors" table in database.
Now we need to implement BookService, which should fetch data from "books" table. BookService must join the Author entity.
If we use SQL JOIN in the BookService, then we need to repeat some logic (code) from the AuthorService in the BookService, since the AuthorService contains the access control logic for the Author entity and logic for generating the URLs of the author's photos (S3 signed URL)
OR we can use the AuthorService inside the BookService to fetch the data and after we can join this data in the application instead of PostgreSQL (we can write a loop that join entities), but in this case we may have performance problems.
Which option is better?
I feel the right place to do the JOIN is in the database, even if it might mean some extra code needed from the application side as you have said so.
Joining inside the application layer would blank out any database optimizations which the database optimizer is capable of making use had "join" been inside the db. The database the optimizer chooses the option to return back records on the basis of statistics on the tables/columns/histograms values and a whole lot of optimizations .
Take for example a looping logic. If we have a small table called dept and a large table called emp and if we are to perform a query join on the two in the db. It is most likely going to use a nested loop which might be more efficient since the large table needs to be traversed just once to get all matching records.And if the dept table is wide(many columns) the optimizer can choose to use an index and get the same output in an efficent manner
In case both of the tables are large the optimizer may choose a hash join or sorted join.
Consider the alternative, in your application if you were to join, you would be using just the looping logic all the time(mostly a nested loop) or if you are to implement an sophisticated algorithm of doing the "join" you would be duplicating all of the effort which has gone into making the database.
So best option in my humble opinion - Use db for any SET related operations (JOIN,FILTER,AGGREGATION)
We query a relational database using standardized SQL. The result of a query is a two dimensional table; rows and columns.
I really like the well structure of a rdms (i honestly never worked professionally with other db systems). But the query language or more exactly the result set sql produces is quite a limitation affecting performance in general.
Let's create a simple example: Customer - Order (1 - n)
I want to query all customers starting with letter "A" having an order this year and display each with all his/her orders.
I have two options to query this data.
Option 1
Load data with a single query with a join between both tables.
Downside: The result which is transferred to the client, contains duplicated customer data which represents an overhead.
Option 2
Query the customers and start a second query to load their orders.
Downsides: 2 queries which result in twice the network latency, the where in term of the second query can potentially be very big, which could lead to query length limitation violation, performance is not optimal because both queries peform a join/filtering to/of orders
There would be of course an option three where we start query with the orders table.
So generally there exists the problem that we have to estimate based on the specific situation what the better trade is. Single query with data overhead or multiple queries with worse execution time. Both strategies can be bad in complex situations where a lot of data in well normalized form has to be queries.
So ideally SQL would be able to specify the result of a query in form of an object structure. Imagine the result of the query would be structured as xml or json instead of a table. If you ever worked with an ORM like EntityFramework you maybe know the "Include" command. With support of an "include" like command in sql and returning the result not as join but structured like an object, world would be a better place. Another scenario would be an include like query but without duplicates. So basically two tables in one result. To visualize it results could look like:
{
{ customer 1 { order 1 order 2} }
{ customer 2 { order 3 order 4} }
} or
{
{ customer1, customer2 }
{ order1, order2, order3, order4 }
}
MS SQL Server has a feature "Multiple Result Sets" which i think comes quite close. But it is not part of Standard SQL. Also i am unsure about ORM Mappers really using such feature. And i assume it is still two queries executed (but one client to server request). Instead of something like "select customers include orders From customers join orders where customers starts with 'A' and orders..."
Do you generally face the same problem? How do you solve it if so? Do you know a database query language which can do that maybe even with existing ORM Mapper supporting that (probably not)? I have no real working experience with other database systems, but i don't think that all the new database systems address this problem? (but other problems of course) What is interesting is that in graph databases joins are basically free as far as i understand.
I think you can alter your application workflow to solve this issue.
New application workflow:
Query the Customer table which customer start with a letter 'A'. Send the result to client for display.
User select a customer from client and send back the customer id to server
Query the Order table by the customer id and send the result to client for display.
There is a possibility to return json on some SQL-Server. If you have a table A relate to table B and every entry on table point to maximum one entry at table A then you can reduce overload on traffic as you described. On example could be an address and their contacts.
SELECT * FROM Address
JOIN Contact ON Address.AddressId = Contact.AddressId
FOR JSON AUTO
The SQL return result would be smaller:
"AddressId": "3396B2F8",
"Contact": [{
"ContactId": "05E41746",
... some other information
}, {
"ContactId": "025417A5",
... some other information
}, {
"ContactId": "15E417D5",
... some other information
}
}
]
But actually, I don't know any ORM which process JSON for traffic reduction.
If you had some contacts for different addresses it could be counterproductive.
Don’t forget that JSON also has some overhand and it need to be serialized and deserialized
The optimum for traffic reduction would be if the SQL-Server split the joined result in Multiple Result Sets and the client respectively the Object-Relational-Mapper map them together. I’m would be interested if you find a solution for your problem.
Another train of thought would be to use a graph database.
I am trying to do this:
Concatenate many rows into a single text string?
And I want to have the query results join against other tables. So I want to have the csv queries be an indexed view.
I tried the CTE and XML queries to get the csv results and created views using these queries. But SQL Server prevented me from creating an index on these views because CTE and subqueries are not allowed for indexed views.
Are there any other good ways to be able to join a large CSV result set against other tables and still get fast performance? Thanks
Other way is to do materialization by yourself. You create table with required structure and fill it with content of your SELECT. After that you track changes manually and provide actual data in your "cache" table. You can do this by triggers on ALL tables, including in base SELECT (synchronous, but a LOT of pain in complex systems) or by async processing ( Jobs, self-written service, analysis of CDC logs and etc).
I'm working with an architecture that requires mixing large (static) tables joined with very frequently changing data.
To emphasize the point, imagine SO website user access inner joined to their user database).
SELECT * FROM UserProfile INNER JOIN OnlineUser (on UserProfiles.id = OnlineUser.id)
where UserProfile reside in a large SQL table and OnlineUser is dynamic data on the Webserver.
Putting everything memory takes up a lot of room and putting everything in the database would really tax the server (shudder to think of it). Is there a better way of doing this?
God Jon Skeet says LINQ can't cope with doing a join between an in-memory collection and a database table. He suggests a contains clause or list, both of which wouldn't be appropriate in this case.
Edit:
An in-memory table (PINTABLE) in SQL Server could do this. Since that feature has been deprecated, is it safe to assume SQL server 2008 will figure out to keep it in memory to reduce IO?
You have to do the actual join operation somewhere. Either you bring the OnlineUser data to the SQL Server and perform the join there, or you bring the UserProfile data into memory and perform the join there.
Usually it's better to let the database server sort out the data needed and only read that into memory. One way to solve the problem is to create a temp table on the SQL Server, where you put the data fields required for the join operation (in your example OnlineUser.id).
Then execute a SQL query to get the data required into memory. Make sure that you have indexes that speeds up the filtering. Once retrieved to memory, match the two collections (e.g. by having them sorted on the same key and then use the linq Zip operator)