Visualising data in Tableau when connected to BigQuery taking an eternity - google-bigquery

I have a dataset that I've loaded into BigQuery, it consists of 3 separate tables with a common identifier in each of the files.
When I set up my project in Tableau I performed an inner join on two of the tables. I set the connection up as an extract and not live.
There's some geo info in my file, lats and longs. When I drag lat to the rows section on my worksheet it's taking an eternity to perform that task, currently it's taken 18 mins and counting to just process whatever it's doing when I drag the lat to the row section.
Is there some other way that I can take a random sample of my data for working on it rather than having to wait for each query to process? My data is not even that big, it's around 1M rows.

I've found Tableau to bog down quite a bit long before 1 million rows, and I supsect the join compounds the problem for you.
Aggregating as much as possible in BigQuery itself, before making the extract, is your friend. The random excerpt is a good idea, too. You could try:
SELECT
*
FROM
([subquery joining your tables])
WHERE RAND() < 0.05 # or whatever gives an acceptable sample size

Related

How can I optimize my varchar(max) column?

I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code

Loading a spark DF from S3, multiple files. Which of these approaches is best?

I have a s3 bucket with partitioned data underlying Athena. Using Athena I see there are 104 billion rows in my table. This about 2 years of data.
Let's call it big_table.
Partitioning is by day, by hour so 07-12-2018-00,01,02 ... 24 for each day. Athena field is partition_datetime.
In my use case I need the data from 1 month only, which is about 400 million rows.
So the question has arisen - load directly from:
1. files
spark.load(['s3://my_bucket/my_schema/my_table_directory/07-01-2018-00/file.snappy.parquet',\
's3://my_bucket/my_schema/my_table_directory/07-01-2018-01/file.snappy.parquet' ],\
.
.
.
's3://my_bucket/my_schema/my_table_directory/07-31-2018-23/file.snappy.parquet'])
or 2. via pyspark using SQL
df = spark.read.parquet('s3://my_bucket/my_schema/my_table_directory')
df = df.registerTempTable('tmp')
df = spark.sql("select * from my_schema.my_table_directory where partition_datetime >= '07-01-2018-00' and partition_datetime < '08-01-2018-00'")
I think #1 is more efficient because we are only bringing in the data for the period in question.
2 seems inefficient to me because the entire 104 billion rows (or more accurately partition_datetime fields) have to be traversed to satisfy the SELECT. I'm counseled that this really isn't an issue because of lazy execution and there is never a df with all 104 billion rows. I still say at some point each partition must be visited by the SELECT, therefore option 1 is more efficient.
I am interested in other opinions on this. Please chime in
What you are saying might be true, but it is not efficient as it will never scale. If you want data for three months, you cannot specify 90 lines of code in your load command. It is just not a good idea when it comes to big data. You can always perform operations on a dataset that big by using a spark standalone or a YARN cluster.
You could use wildcards in your path to load only files in a given range.
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-{01,02,03}-2018-*/')
or
spark.read.parquet('s3://my_bucket/my_schema/my_table_directory/07-*-2018-*/')
Thom, you are right. #1 is more efficient and the way to do it. However, you can create a collection of list of files to read and then ask spark to read those files only.
This blog might be helpful for your situation.

Optimizing Geometry.STIntersect Query

I am attempting to move simple Geoprocessing routines from ESRI based processes to SQL Server. My assumption is that it will be far more efficient. For my initial test I am working on an intersection routine to associate overlapping linear data.
In my WCASING table I have 1610 records. I am trying to associate these Casings with their associated mains. I have ~277,000 Mains.
I am running the query below to get a general sense of how long it will take to find individual matches. This query returned 5 valid intersections in 40 seconds.
SELECT Top 5 [WCASING].[OBJECTID] As CasingOBJECTID,
[WPUMPPRESSUREMAIN].[OBJECTID] AS MainObjectID, [WCASING].[Shape]
FROM [dbo].[WPUMPPRESSUREMAIN]
JOIN [WCASING]
ON [WCASING].[Shape].STIntersects([WPUMPPRESSUREMAIN].[Shape]) = 1
My Primary questions;
Will this process faster depending on the search order
Finding 'A' inside of 'B' vs
Finding 'B' inside of 'A'
Initial return on 5 records from these datasets is that it does not matter
Will this process faster, if I first buffer to limit to a smaller main set and then search
Can I use SQL Server Tuning to work with Geometry based queries
I will be working on these processes over the next few weeks. In the meantime I would greatly appreciate insight and pointers to white papers associated with these tuning options. Thus far I have not found a great resource.
Thank You,
Rick

Optimizing R code for ETL

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.
The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.
Here's an outline of the code:
Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
rbind the results together into a single resulting data frame (~10 - 15m)
I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.
The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.
Any thoughts on other optimization techniques?
update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R
update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).
update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?