Simple Question about how Tableau Desktop talks to a very large database - sql

I am just curious, as to how Tableau talks to a large data source- for example if I have a data source that has 1.4 million records, and I make a simple table with this data, maybe a graph etc, then how does tableau get this data? Does it go query the data source, ask the data source how much it has, then pull in the first 10,000, does it go back and retrieve the next 10k etc? or does it do it in one go? Also I want to know where Tableau stores this data it receives?
Hope my question makes sense - Just trying to understand the underlying mechanisms.
Thank you!

Tableau can work with external data sources in more than one way. You can extract the entire DB content to a local file (called an extract) or you can have a live connection to the database.
If the connection is live, then Tableau sends the DB queries designed to return the data you want not the entire content of the DB. So if you have 1.4m records containing, say, a full year's sales information and you want monthly totals, Tableau will send a query asking the DB to return the monthly totals. This will result in just 12 numbers being returned to Tableau: the DB itself will do the work and Tableau doesn't need to pull 1.4m numbers and add them up. This is how most data sources work: the user requests a result (using SQL queries) and the DB works out how to return that result. This means you don't need to copy the entire database every time you want to add some numbers up.
Live queries won't sample the database: the answers you get will usually be the correct totals (though some sources like Google's BigQuery will use sampling for some statistical aggregates unless told otherwise).
Both Tableau and many databases will cache the results of queries done recently so the results will be faster. Tableau's results will be held locally.

Related

Azure SQL reporting on last READ records

Is it possible to generate a report from Azure MS SQL Server which shows which records in a table were last read from?
We have a table which we would like to begin cleaning records out of and it would be useful to know which data it contains that is no longer used by the client application. Unfortunately, it does not contain a datetime field which shows when the records were last accessed.
It is not a feature in SQL Server. The reason is that it would make the database a lot slower if we turned every read into a write. Since we have to log everything, we'd generate tons of log write traffic. There is a feature called Temporal Tables which doesn't quite do what you ask but it does have start/end dates for rows. You could track when you don't want to see a row anymore and then it would go into the history table. You can then remove rows from the history table after some period of non-use. The retention feature can be seen here and you can read a conceptual overview of temporal tables here

Google BigQuery move to SQL Server, Big Data table optimisation

I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

How do I create an Excel Pivot connected to an Access DB that downloads only the queried data?

I have a table of around 60 columns and 400,000 rows and increasing. Our company laptops and MS Excel cannot handle this much data in RAM. So I decided to store the data in MS Access and link it to Excel.
However the pivot in Excel still downloads all the data into Excel, and then performs the filters and operations on the data. Which worked with lesser data, but with more data now has started giving memory errors. Also even though the data in the pivot might be only 50 cells, the file size is 30+ MBs...
So is it possible to create a connection to Access in such a way that it downloads only the data that is queried, does the operations before hand and then sends the revised data to Excel?
I saw this setup in my previous company (where the Excel pivot would only download what it needed). But it was querying an SQL DB as far as I remember. (Sadly couldn't learn more about it since the IT director was intent on being the only guy who knew core operations (He basically had the company's IT operations hostage in exchange for his job security))... But I digress.
I've tried searching for this on the internet for a few days, but it's a very specific problem that I can't find in Google :/
Any help or even pointers would be highly appreciated!
Edit: I'd just like to point out that I'm trying to create an OLAP connection for analysis, so the pivot would be changing fields. My understanding of how pivots work, was that when we select the fields in the pivot, excel would design a query (based on the select fields) and send it to the connection DB to retrieve the data requested. If this is not how it happens, how do I make something like this happen? Hope that elaborates.
I suppose that you created a single massive table in Access to store all your data, so if you just link that table as the data source, Excel won't know which particular bit of data is relevant and will most probably have to go through all of it itself.
Instead, you can try a combination of different approaches:
Create a query that pre-filters the data from Access and link that query to Excel.
Use a SQL Command Type for your Connection Properties instead of a Table.
Test that query in Access to make sure it runs well and is fast enough.
Make sure that all important fields have indexes (fields you filter, fields you group by, any field that Excel has to go through to decide whether it should be included or not in the pivot, make sure that that field has a sensible index).
Make sure that you have set a Primary Key in your table(s) in Access. Just use the default auto-increment ID if it's not already used.
If all else fails, break down that huge table: it's not so much the amount of records that's too much it's more the high number of columns.
If you use calculated fields in your pivot or filter data based on some criteria, consider adding columns to your table(s) in Access that contain pre-calculated data. For instance you could run a query from Access to update these additional fields or add some VBA to do that.
It should works pretty well though: to give you an idea, I've made some tests with Excel 2013 linked to a 250MB ACCDB containing 33 fields and 392498 rows (a log of stock operations): most operations on the pivot in Excel only take a fraction of a second, maybe a couple of seconds for the most data-intensive ones.
Another thing: Access has support pivot tables and pivot charts. Maybe you don't need Excel if Access is enough. You can use the Access Runtime 2013 or 2013 (it's free) as a front-end on each machine that needs access to the data. Each front-end can then be linked to the backend database that holds the data on a network share. The tools are a bit more clunky than in Excel, but they work.
Another possible solution, to avoid the creation of queries in the Access DB, is to use PowerPivot add-in in Excel, implementing there queries and normalizations.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).