SQL -> MongoDB Export Performance Issues - sql

I'm trying to set up an automated process to regularly transform and export a large MS SQL 2008 database to MongoDB.
There is not a 1-1 correspondence between tables in SQL and collections in MongoDB -- for example the Address table in SQL is translated into an array embedded in each customer's record in Mongo and so on.
Right now I have a 3 step process:
Export all the relevant portions of the database to XML using a FOR XML query.
Translate XML to mongoimport-friendly JSON using XSLT
Import to mongo using mongoimport
The bottleneck right now seems to be #2. XML->JSON conversion for 3 million customer records (each with demographic info and embedded address and order arrays) takes hours with libxslt.
It seems hard to believe that there's not already some pre-built way to do this, but I can't seem to find one anywhere.
Questions:
A) Are there any pre-existing utilities I could use to do this?
B) If no, is there a way I could speed up my process?
C) Am I approaching the whole problem the wrong way?

Another approach is to go through each table and add information to mongo on a record by record basis and let Mongo do the denormalizing! For instance to add each phone number, just go through the phone number table and do a '$addToSet' for each phone number to the record.
You can also do this in parallel and do tables separately. This may speed things up but may 'fragment' the mongo database more.
You may want to add any required indexes before you start, otherwise adding the indexes at the end may be a large delay.

Related

Google BigQuery move to SQL Server, Big Data table optimisation

I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.

Is SQL Server Express 14 appropriate to use for conducting calculations over tables?

I am building a factory QC fixture that measures, analyzes, and stores data on the physical dimensions of products leaving a factory. The raw data for each measured product starts off as a table with 5 columns, and up to 86000 rows. To get useful information, this table must undergo some processing. This data is collected in LabVIEW, but stored in an SQL server database. I want to ask whether it's best to pass the data to the server and process it in there (via stored procedure), or process it outside the server and then add it in?
Let me tell you about the processing to be done:
To get meaningful information from the raw data, each record in the table needs to be passed into a function that calculates parameters of interest. The function also uses other records (I'll call them secondary records) from the raw table. The contents of the record originally passed into the function dictate what secondary records the function uses to perform calculations. The function also utilizes some trigonometric operators. My concern is that SQL will be very slow or buggy when doing calculations over a big table. I am not sure if this sort of task is something SQL is efficient at doing, or if I'm better off trying to get it done through the GPU using CUDA.
Hopefully I'm clear enough on what I need, please let me know if not.
Thanks in advance for your answers!
Generally we need SQL server to help us sort, search, index, update data and share the data with multiple users (according to their privileges) at the same time. I see no work for SQL server in the task you've described. Looks like no one needs any of those 860000 raws before they've been processed.

Efficiency of analysing exported SQL data sheet in SQL server

Apologies for the long title.
I have been given a large flat file to be analysed in SQL server, which was generated by another SQL database which I do not have direct access to. Due to the way the query had been generated, there are over 5000 different rows, for only 900 unique objects.
My question is a straightforward one: I am not attempting to create a long term database. Would it more time-efficient to to split this back into separate tables to re run queries, or would it be easier to analyse it as is?
5,000 rows is not a large table. The time and effort to reverse engineer a "normalised" database, and write and debug the ETL, will not be repaid in reduced runtime. Index it well for your queries and remember to put a DISTINCT in where it's required. You'll be fine.

How to model data for a CouchDB geocoder

I am working on a CouchDB based geocoding application using a large national dataset that is supplied relationally. There are some 250 million records split over 9 tables (The ER Diagram can be viewed at http://bit.ly/1dlgZBt). I am quite new to nosql document databases and CouchDB in particular and am considering how to model this. I have currently loaded the data into a CouchDB database per table with a type field indicating which kind of record it is. The _id attribute is set to be the primary key for table [A] and [C], for everything else it is auto-generated by Couch. I plan on setting up Lucene with Couch for indexing and full text search. The X and Y Point coordinates are all stored in table [A] but to find these I will need to search using data in [Table E], [Tables B, C & D combined] and/or [Table I] with the option of filtering results based on data in [Table F].
My original intention was to create a single CouchDB database which would combine all of these tables into a single structure with [Table A] as the root and all related tables nested under this. I would then build my various search indexes on this and also setup a spatial index using GeoCouch for reverse geocoding. However I have read articles that suggest view collation as an alternative approach.
An important factor here I guess is reads vs writes. The plan is that this data will never be updated, only read. Data is released every quarter at which time the existing DB would be blown away and a new DB created.
I would welcome any suggestions for how best to setup and organise this from any experienced Couch or related document database users.
Many thanks in advance for any assistance.
guygrange,
While I am far from an expert in document database design, the key thing to recognize about documents DBs is that everything is about making your queries fast by keeping all of the necessary information in a single document. Hence, you need to look at your queries and how you expect to access this data. For example, I can easily imagine a geocoding application to not need access to everything in each table for your most frequent queries. Hence, to save on bandwidth, you would make a main document that has the main information you most frequently care about along with a key for the rest of the appropriate data. Then you could fetch the remaining data with that key and merge the dictionaries for easy management in your client code.
Anon,
Andrew

Improving query performance of of database table with large number of columns and rows(50 columns, 5mm rows)

We are building an caching solution for our user data. The data is currently stored i sybase and is distributed across 5 - 6 tables but query service built on top of it using hibernate and we are getting a very poor performance. In order to load the data into the cache it would take in the range of 10 - 15 hours.
So we have decided to create a denormalized table of 50 - 60 columns and 5mm rows into another relational database (UDB), populate that table first and then populate the cache from the new denormalized table using JDBC so the time to build us cache is lower. This gives us a lot better performance and now we can build the cache in around an hour but this also does not meet our requirement of building the cache whithin 5 mins. The denormlized table is queried using the following query
select * from users where user id in (...)
Here user id is the primary key. We also tried a query
select * from user where user_location in (...)
and created a non unique index on location also but that also did not help.
So is there a way we can make the queries faster. If not then we are also open to consider some NOSQL solutions.
Which NOSQL solution would be suited for our needs. Apart from the large table we would be making around 1mm updates on the table on a daily basis.
I have read about mongo db and seems that it might work but no one has posted any experience with mongo db with so many rows and so many daily updates.
Please let us know your thoughts.
The short answer here, relating to MongoDB, is yes - it can be used in this way to create a denormalized cache in front of an RDBMS. Others have used MongoDB to store datasets of similar (and larger) sizes to the one you described, and can keep a dataset of that size in RAM. There are some details missing here in terms of your data, but it is certainly not beyond the capabilities of MongoDB and is one of the more frequently used implementations:
http://www.mongodb.org/display/DOCS/The+Database+and+Caching
The key will be the size of your working data set and therefore your available RAM (MongoDB maps data into memory). For larger solutions, write heavy scaling, and similar issues, there are numerous approaches (sharding, replica sets) that can be employed.
With the level of detail given it is hard to say for certain that MongoDB will meet all of your requirements, but given that others have already done similar implementations and based on the information given there is no reason it will not work either.