Does anyone have an experience with large data sets in Data Studio?
I want to use a data that is close to 40 million rows and a dozen columns. I was trying to check it by myself but after connection to BigQuery query a Configuration error occurred.
If you have a dataset stored in BigQuery, Data Studio should have no problem handing it - through BigQuery. Size shouldn't really be a problem.
I've noticed that when Data Studio accesses a BigQuery table, it is limited to 20,000,000 rows.
Specifically, LIMIT 20000000 is applied to the actual query in BQ and there is no way to configure / change that (to the best of my knowledge).
Related
We are working building a new data pipeline for our project and we have to move incremental updates that happen throughout the day on our SQL servers into Azure synapse for some number crunching.
We have to get updates which occur across 60+ tables ( 1-2 million updates a day ) into synapse to crunch some aggregates and statistics as they happen throughout the day.
One of the requirements is being near real time and doing a bulk import into synapse is not ideal because it takes more than 10 mins to do full compute on all data.
I have been reading about CDC feed into synapse https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-data-capture-feature-portal and it is one possible solution.
Wondering if there are other alternatives to this or suggestions for achieving the end goal of data crunching near real time for DB updates.
Change Data Capture (CDC) is the suited way to capture the changes and add to the destination location (storage/database).
Apart from that, you can also use watermark column to capture the changes in multiple tables in SQL Server.
Select one column for each table in the source data store, which you
can identify the new or updated records for every run. Normally, the
data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
Here is the high-level solution diagram for this approach:
Step-by-Step approach is given in this official document Incrementally load data from multiple tables in SQL Server to Azure SQL Database using PowerShell.
Our current data set is not friendly in terms of looking at historic records. I can see what a value for an account is at the time of execution but if I want to look up last month's counts and values that's often lost. To fix this I want to take a "snapshot" of our data by running it at specific times and storing the results in the cloud. We're looking at just over 30,000 records and I'd only run it at the end of the month keeping 12 separate months at a time so the count doesn't get too high.
I can't seem to find anything about how I could do this so I'm hoping someone has experience or knowledge and would like to share.
FYI we're using an on premise oracle DB.
Thanks!
You can use Azure Data Factory (ADF) to schedule a monthly run for a pipe that execute a query/stored procedure against your Azure SQL and writes the data to Azure Storage.
Problem:
I need to get data sets from CSV files into SQL Server Express (SSMS v17.6) as efficiently as possible. The data sets update daily into the same CSV files on my local hard drive. Currently using MS Access 2010 (v14.0) as a middleman to aggregate the CSV files into linked tables.
Using the solutions below, the data transfers perfectly into SQL Server and does exactly what I want. But I cannot figure out how to refresh/update/sync the data at the end of each day with the newly added CSV data without having to re-import the entire data set each time.
Solutions:
Upsizing Wizard in MS Access - This works best in transferring all the tables perfectly to SQL Server databases. I cannot figure out how to update the tables though without deleting and repeating the same steps each day. None of the solutions or links that I have tried have panned out.
SQL Server Import/Export Wizard - This works fine also in getting the data over to SSMS one time. But I also cannot figure out how to update/sync this data with the new tables. Another issue is that choosing Microsoft Access as the data source through this method requires a .mdb file. The latest MS Access file formats are .accdb files so I have to save the database in an older .mdb version in order to export it to SQL Server.
Constraints:
I have no loyalty towards MS Access. I really am just looking for the most efficient way to get these CSV files consistently into a format where I can perform SQL queries on them. From all I have read, MS Access seems like the best way to do that.
I also have limited coding knowledge so more advanced VBA/C++ solutions will probably go over my head.
TLDR:
Trying to get several different daily updating local CSV files into a program where I can run SQL queries on them without having to do a full delete and re-import each day. Currently using MS Access 2010 to SQL Server Express (SSMS v17.6) which fulfills my needs, but does not update daily with the new data without re-importing everything.
Thank you!
You can use a staging table strategy to solve this problem.
When it's time to perform the daily update, import all of the data into one or more staging tables. Execute SQL statement to insert rows that exist in the imported data but not in the base data into the base data; similarly, delete rows from the base data that don't exist in the imported data; similarly, update base data rows that have changed values in the imported data.
Use your data dependencies to determine in which order tables should be modified.
I would run all deletes first, then inserts, and finally all updates.
This should be a fun challenge!
EDIT
You said:
I need to get data sets from CSV files into SQL Server Express (SSMS
v17.6) as efficiently as possible.
The most efficient way to put data into SQL Server tables is using SQL Bulk Copy. This can be implemented from the command line, an SSIS job, or through ADO.Net via any .Net language.
You state:
But I cannot figure out how to refresh/update/sync the data at the end
of each day with the newly added CSV data without having to re-import
the entire data set each time.
It seems you have two choices:
Toss the old data and replace it with the new data
Modify the old data so that it comes into alignment with the new data
In order to do number 1 above, you'd simply replace all the existing data with the new data, which you've already said you don't want to do, or at least you don't think you can do this efficiently. In order to do number 2 above, you have to compare the old data with the new data. In order to compare two sets of data, both sets of data have to be accessible wherever the comparison is to take place. So, you could perform the comparison in SQL Server, but the new data will need to be loaded into the database for comparison purposes. You can then purge the staging table after the process completes.
In thinking further about your issue, it seems the underlying issue is:
I really am just looking for the most efficient way to get these CSV
files consistently into a format where I can perform SQL queries on
them.
There exist applications built specifically to allow you to query this type of data.
You may want to have a look at Log Parser Lizard or Splunk. These are great tools for querying and digging into data hidden inside flat data files.
An Append Query is able to incrementally add additional new records to an existing table. However the question is whether your starting point data set (CSV) is just new records or whether that data set includes records already in the table.
This is a classic dilemma that needs to be managed in the Append Query set up.
If the CSV includes prior records - then you have to establish the 'new records' data sub set inside the CSV and append just those. For instance if you have a sequencing field then you can use a > logic from the existing table max. If that is not there then one would need to do a NOT compare of the table data with the csv data to identify which csv records are not already in the table.
You state you seek something 'more efficient' - but in truth there is nothing more efficient than a wholesale delete of all records and write of all records. Most of the time one can't do that - but if you can I would just stick with it.
I am new to Tableau, and having performance issues and need some help. I have a hive query result in Azure Blob Storage named as part-00000.
The issue having this performance is I want to execute the custom query in Tableau and generates the graphical reports at Tableau.
So can I do this? How ?
I have 7.0 M Data in Hive table.
you can find custom query in data source connection check linked image
You might want to consider creating an extract instead of a live connection. Additional considerations would include hiding unused fields and using filters at the data source level to limit data as per requirement.
I have a table in a SQL Server database which contains large amount of data, around 2 million records (~approx 20 columns for each row). The data in this table gets overridden at the end of each day with new data.
Once the new data is available I need to copy this data from the SQL Server database to a MongoDB table.
The question is on the way by which it can be achieved the fastest?
Some options :
A simple application that reads and writes
Some sort of export/import tool.
Generating a\multiple file\s from SQL and then reading concurrently to import in MongoDB
From my expirience:
A simple application that reads and writes.
Will be the slowest.
Some sort of export/import tool.
Should be much faster than the first option. Take a look at the bcp utility to export data from SQL and then import data with mongoimport. However, the way you store data in mongo might differ a lot from the SQL one so it might be quite a challenge to do the mapping with export/import tools.
Generating a\multiple file\s from SQL and then reading concurrently to
import in MongoDB
Paralleling might speed up the procees a bit but I don't think you will be satisfied with the results.
From your question the data gets overriden at the end of each day. Not sure how you do it now but I think it makes sense to write data to both SQL and Mongo at that time. This way you won't have to query the data from SQL again to update Mongo. You will be just writing to mongo at the same time you are updating SQL.
Hope it helps!