Disable Result set cache for Big Query

Disable Result set cache for Big Query - google-bigquery

I am trying decide which tool works best for my organization. For that reason I am testing performance of powerbi, looker and tableau against bigQuery etc. Since this is a benchmarking exercise and I am planning to test it for multiple iterations, I want to disable the result set caching property of big Query. In the official documentation, they are letting us disable by passing query config use_query_cache=False
Since I am connecting from front end tools, I am not really sure how to pass this parameter. Can someone help in achieving this? or alternative options if available?

I've not tried this, but I'd think it could be passed like other options.
In this documentation, they have code that looks like this:
Source = GoogleBigQuery.Database(
[BillingProject="Include-Billing-Project-Id-Here", UseStorageApi=false])
I'd expect your parameter to look similar. I.e.
Source = GoogleBigQuery.Database(
[BillingProject="Include-Billing-Project-Id-Here", UseQueryCache=false])

Related

Databricks: Best practice for creating queries for data transformation?

My apologies in advance for sounding like a newbie. This is really just a curiosity question I have as an outsider observing my team clash with our client. Please ask any questions you have, and I will try my best to answer it.
Currently, we are storing our transformation queries in a DynamoDB table. When needed, we pull into Databricks and execute the query. Simple as that. Our client has called this out as “hard coding” (more on that soon)
Our client has come up with an alternative that involves creating JSON config files containing the transformation rules (all tables/attributes required, target table names, Alias names, join keys, etc. etc.). From here, the SQL query is dynamically created. This approach is still “hard coding” since these config files would need to be manually edited anytime there is a change in the rules.
The way I see this: I think storing the transform rules in JSON is more business user friendly, but that’s about where I see the pros end. It brings in much more complexity to the code and likely will need to be continuously developed to support new queries. Also, I don’t see anyway to prevent “hard coding”. The client business leads seem to think there is some magical tool to convert plain English text to complex SQL queries
I just wanted to get some experts thoughts on this. Which solution is better, or is there another approach that should be taken?

Can I take advantage of Yugabytes compatability?

Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?

These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.

Is there a way to let non-technical individuals utilize BigQuery reports?

I want to have an access port for non-tech savvy individuals in which they could make reports of their own without needing to know SQL what-so-ever.
It would be best if I could create custom fields of myself, and then just let the users in the access port pick and choose whichever they like with a custom date range.
I've explored the options Google Data Studio offers, but it looks to me like it mostly puts an emphasis on data visualization.
In addition, my attempts to make custom queries with it were not successful, since the platform is rigid in terms of deciding which field is a metric and which is a dimension (and it does so inaccurately). This makes it hard to query reports as you normally would using BigQuery, which doesn't have these somewhat arbitrary limitations.
Perhaps I've misunderstood something about the platform due to my limited experience with it, but it looks like Data Studio isn't going to fit the bill for me.
EDIT: In addition, the platform should have a way of exporting said reports as CSV files, a feature that Data Studio doesn't have as far as I know.
It would be great to receive suggestions for a different platform which would better fit my needs, or even suggestions on how to make better use of Data Studio.

Have you looked at using a tool like redash (https://redash.io)? Assuming your GA360 data is in BigQuery you can connect redash to BQ. Then you can author queries and visualize.
You can also use the Google Could SDK to connect to BQ and run custom queries to generate new tables in BQ based on the GA360 session data. Then use redash, or any tool, to report/visualize.

Trigger action realtime based on keyword in Logs

I have a requirement for which I want to trigger an action (like calling a REST-ful service) in the event a keyword is found in the logs. The trigger would have to be fairly real time. I was evaluating open source solutions like GrayLog2, ELK stack (which I believe can't analyse real time), fluentd etc. but wanted to know your opinion on that. It would be great if the tool also allows setting up rules against key words to eliminate false positives and easy to set up.
I hope this makes sense and apologies if this has been discussed elsewhere!

You can try Massalyzer. It's a real-time analyzer too, very fast (up to 10 millinon line per sec), and you can analyze unlimited size with free demo version.

So, I tried Logstash+Graylog2 combination for the scenario I described in the question and it works quite well. I had to tweak a few things to make Logstash work with Graylog2, especially around capturing the right log levels. I will try this out on a highly loaded clustered environment and update my findings here.

TSQL Dynamically determine parameter list for SP/Function

I want to write a generic logging snip-it into a collection of stored procedures. I'm writing this to have a quantitative measure of our front-user user experience as I know which SP's are used by the front-end software and how they are used. I'd like to use this to gather a base-line before we commence performance tunning and afterward to show the outcome of tunning.
I can dynamically pull the object name from ##PROCID, but I've been unable to determine all parameters passed and their values. Anyone know if this is possible?
EDIT: marking my response as the answer to close this question. Appears extended events are the least intrusive item to performance, however i'm not sure if there is any substantial difference between minimal profiling and extended events. Perhaps something for a rainy day.

I can get the details of the parameters taken by the proc without parsing its text (at least in SQL Server 2005).
select * from INFORMATION_SCHEMA.PARAMETERS where
SPECIFIC_NAME = OBJECT_NAME(##PROCID)
And I guess that this means that I could, with some appropriately madcap dynamic SQL, also pull out their values.

I don't know how to do this off the top of my head, but I would consider running a trace instead if I were you. You can use SQL Server Profiler to gather only information for the stored procedures that you specify (using filters). You can send the output to a table and then query the results to your heart's content. The output can include IO information, what parameters were passed, the client userid and machine, and much much more.
After running the trace you can aggregate the results into reports that would show how many times a procedure was called, what parameters were used, etc...
Here is a link that might help:
http://msdn.microsoft.com/en-us/library/ms187929.aspx

Appears the best solution to my situation is to do profiling gathering only SP:starting and SP:completed and writing some TSQL to iterate through data and populate a tracking table.
I personally preferred code-generation for this, but politically where i'm working they preferred this solution. We lost some granularity in logging, but this is a sufficient solution to my problem.
EDIT: This ended being an OK solutions. Even profiling just these two items degrades performance to a noticeable degree. :( I wish we had a MSFT provided way to profile a workload that didn't degrade production performance. Oracle has nice solution to this, but it's has its tradeoff's as well. I'd love to see MSFT implement something similar. The new DMV's and extended events help to correlate items. Thanks again for the link Martin.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Disable Result set cache for Big Query - google-bigquery

Related

Databricks: Best practice for creating queries for data transformation?

Can I take advantage of Yugabytes compatability?

Is there a way to let non-technical individuals utilize BigQuery reports?

Trigger action realtime based on keyword in Logs

TSQL Dynamically determine parameter list for SP/Function

Categories

Resources