Storage of website analytical data - relational or time series? - sql

We have a requirement to store website analytical data (think: views on a page, interactions, etc). Note: this is seperate to Google Analytics data, as we want to own the data and enrich it as we see fit.
Storage requirements:
each 'event' will have a timestamp, event type and some other metadata (user id, etc)
the storage is append only. No updates or deletes
writes are consistent, but not IOT scale. Maybe, 50/sec
estimating growth of about 100 million rows a year
Query requirements:
graphing data cumulatively over a period of time
slice/filter data by all the metadata as well as day/week/month/year slices
will likely need to be integrated into a larger data warehouse
Question: Is this a no brainer for a time series DB like InfluxDB,or can I get away with a well tuned SQL server table?

Related

Is there an option to limit the number to columns in a sink export into BigQuery?

I created a sink export to load audit logs into BigQuery. However, there are a large number of columns that I don't need from the audit log. Is there a way to pick and choose the columns in the sink export?
We need to define our reason for wanting to reducing the number of columns. My thinking is that you are concerned about costs. If we look at active storage, we find that the current price is $0.02 / GB with the first 10GB free each month. If the data is untouched for 90 days, that storage cost drops to $0.01/GB. Next we have to estimate how much storage is used for recording all columns for a month vs recording just the storage you want to record. If we can make some projections, then we can make a call on how much the cost might change if we reduced storage usage. What we will want to estimate will be the number of log records to be exported per month and the size of the average log record if written as-is today vs a log record with only minimally needed fields.
If we do find that there is a distinction that makes a significant cost saving, one further thought would be to export the log entries to Pub/Sub and have them trigger a cloud function. However, I'm dubious that we might end up finding that the savings on BQ storage is then lost due to the cost of Pub/Sub and Cloud Function (and possibly BQ streaming insert).
Another thought might be to realize that the BQ log records are written to tables named by "day". We could have a batch job that runs after a days worth of records are written that copies only the columns of interest to a new table. Again, we are going to have to watch that we don't end up with higher costs elsewhere in our attempt to reduce storage costs.

Multi-Date data Load into BigQuery Partitioned table

I am trying to explore BigQuery's abilities to load CSV file (Doulbelick impression data) into BigQuery's partitioned table. My use case includes:
1. Reading daily (nightly load) dumps (csv) from Google cloud storage for my customer's (ad agency) 30 different clients into BQ. Daily dump may contain data from previous day/week. All data should be loaded into respective daily partition (into BQ) so as to provide daily reporting to individual clients.
2.The purpose here is to build an analytical system that gives ad agency an ability to run "Trends & Pattern over time and across clients".
I am new to BQ and thus trying to understand its Schema layout.
Should i create a single table with daily partitions (holding data from all 50 clients/50 daily load -csv files)? Does the partitions need to be created well in advance ?
Should i create 50 different tables(partitioned by date) for each client so as NOT to run into any data sharing/security concerns of a single table option ?
My customer wants a simple solution with min cost.
If you are going to use transfer service (as mentioned in the comment), you don't need to create tables by hand. Instead transfer service will do that for you. Transfer service will schedule daily jobs and load data into partition. Also, if there is short delay (2-3 days), transfer service will still pick up the data.

Using PowerBI to visualize large amounts of data on a SQL Data Warehouse

I have a SQL DW which is about 30 GB. I want to use PowerBI to visualize this data, but I noticed PowerBI desktop only supports file size up to 250MB. What is the best way to connect to PowerBI to visualize this data?
You have a couple of choices depending on your use case:
Direct query of the source data
View based aggregations of the source data
Direct Query
For smaller datasets (think in the thousands of rows), you can simply connect PowerBI directly to Azure SQL Data Warehouse and use the table view to pull in the data as necessary.
View Based Aggregations
For larger datasets (think millions, billions, even trillions of rows) you're better served by running the aggregations within SQL Data Warehouse. This can take the shape of view that is creating the aggregations (think sales by hour instead of every individual sale) or you can create a permanent table at data loading time through a CTAS operation that contains the aggregations your users commonly query against. This latter CTAS operation model is a simple select with filter operation for the user (say Aggregated Sales greater than today - 90 days). Once the view or reporting table is created, you can simply connect to PowerBI as you normally would.
The PowerBI team has a blog post - Exploring Azure SQL Data Warehouse with PowerBI - that covers this as well.
You could also create a query (power query - M) that retrieves only the required data level (ie groups, joins, filters, etc). If done right the queries are translated to tsql and only limited amount of data is downloaded into power bi designer

Retrieve about 100 000 of records from BigQuery or Datastore

We would like to cache some data on Google Compute Engine (about 100 000 rows of data). Each row has 3-4 columns. Would you recommend to load this data from Google Cloud Datastore or BigQuery?
BigQuery does the job of "creating" this data. However, we are not sure it is a good practice to read a medium amount of data from it remotely.
BigQuery is really focused on analytical queries (ie, SELECT user_agent, SUM(request_cost) FROM my_table_of_requests WHERE user_agent != NULL GROUP BY user_agent) rather than exporting 100s of thousands of rows.
Datastore is focused on application-level data retrieval (ie, "get these exact rows") rather than analytical queries, but it provides secondary indexing (aka, filtering) as well as other fancy OLTP things (like ACID transactions, automatic data replication, etc). For your 100k rows, you'll be paying $0.06 just to retrieve them all once.
If you just want to dump 100k rows of data into something, and then read it back (without any filtering on the server side, or need for transactions or replication), neither of these seem like the right choice. You might want to consider just storing a CSV output file of the data in Google Cloud Storage and calling it a day.
If you do need advanced querying, transactions, etc, Datastore will get the job done, but might be more costly than you'd expect. You might want to consider loading this data into a SQL database (ie, PostgreSQL or MySQL) which should easily handle 100k rows.

Quickly Large Data Pivoting

We are developing a product which can be used for developing predictive models and the slicing and dicing of the data in order to provide BI.
We are having two kind of data access requirements.
For predictive modeling, we need to read data on daily basis and do it row by row. In this the normal SQL Server database is sufficient and we are not getting any issues.
In case of slicing and dicing data of huge sizes like 1GB of data having let us say 300 M rows. We want to pivot that data easily with minimum response time.
The current SQL Database is having response time issues in this.
We like our product to run on any normal client machine with 2GB RAM with Core 2 Duo processor.
I would like to know how should I store this data and then how I can create a pivoting experience for each of the dimension.
Ideally we will have data of let us say daily sales by sales person by region by product for a large corporation. Then we would like to slice and dice it based on any dimension and also be able to perform aggregation, unique values, maximum, minimum, average values and some other statistical functions.
I would build an in-memory cube on top of that data. To give you an example, icCube is having sub-second response time for 3/4 measures over 50M rows on a single core i5 - without any cache or pre-aggregation (i.e., this response time is constant in all the dimensions).
Contact us directly for more details about how to integrate it into your product.
You could also use PowerPivot to do this. This is a free addin for Excel 2010, which would allow large data sets to be handled, sliced+diced, etc.
If you want to code around it, you can connect to the PowerPivot database (effectively an SSAS cube) using the SSAS database connector
Hope that is of some use..