I am getting data in Splunk from Snowflake using Splunk DB Connect. This is just simple orders data. At Splunk search & reporting I am running the following query on my table to get visualization.
source="big_data_table_inner_join" "UNITS_SOLD" | top COUNTRY
What I am seeing is that each time I run query the events number at splunk increases quite heavily. For eg. After running first time they were 342000 events and when I ran the same query they were 67445 events. Any idea why is this happening?
Related
Hope I can explain the problem I'm having trouble with.
I have to write a stepwise methodology using pseudocode/SQL query to auto generate a list of products/items with low stock/expiry from the inventory database.The list must be updated at 12 a.m. daily.
I tried this
CREATE EVENT IF NOT EXISTS update_table
ON SCHEDULE EVERY 1 DAY STARTS '2022-05-22 00:00:00'
ON COMPLETION PRESERVE ENABLE
Do
Select inventory.products from inventory where inventory.stocks <
inventory.required_stocks.
Your stated requirement is to run some sort of report very soon after the beginning of each calendar day.
The next question you must answer is this: What will you do with that report? Will you simply drop it into "low_stock" table someplace in your database? Will you format it into an email message and send it to your purchasing department? It will be difficult to make "pseudocode" for your requirement without first analyzing the overall business process you intend to enhance.
Various RDBMS systems have ways of doing scheduled things at particular times of day. You've shown the EVENT setup provided by MariaDB / MySQL. SQL Server has their "Jobs" system. postgreSQL has the pg_cron extension. Yo
The thing is, you can't just do SELECT operations from within these scheduled database actions: the result sets have noplace to go from that context. You can do CREATE TABLE midnight_run AS SELECT whatever ... to place the results in a table. But then the results are in another table.
If you want to get the results out of the DBMS, you'll need a UNIXish cron job or a Windowsish scheduled task running an appropriate application at midnight each day.
Pro tip Do your best to avoid scheduling stuff for precisely midnight. Many things run then. If you wait until a couple of minutes after the hour, your code is less likely to contend with other midnight code.
I am querying multiple tables and I am able to see the cost of each query for my personal use. As I view the Query History I only see the queries I ran on my account.
So my question is, is it possible to somehow to see the queries which have been run by others (as well as the cost of the query ) in a project from the query history ?
You can use Jobs information schema:
SELECT query, total_bytes_processed FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE project_id = 'you_project_id' AND user_email = 'my#eamil.com'
According to the documentation, there is not a direct method of getting costs by job and user. However, there is a way of doing it.
For a detailed billing analysis, I would advise you to export the logs to BigQuery with a custom filter and from there analyse the billing for each user and query job.
So, you can create an export using the Logs Viewer or the API. While creating your sink use the following custom filter:
resource.type="bigquery_resource"
logName="projects/<your_project>/logs/cloudaudit.googleapis.com%2Fdata_access"
protoPayload.methodName="jobservice.jobcompleted"
The above filter will retrieve completed query jobs whilst the data access logs are a comprehensive audit of every query run in BigQuery along with the total bytes scanned. I would like to point that you have to make sure that data_access logs are enable, link.
From the log entries you will get the fields:
protoPayload.authenticationInfo.principalEmail
protoPayload.serviceData.jobCompletedEvent.job.jobName.jobId
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query
protoPayload.serviceData.jobCompletedEvent.job.jobStatistics.totalBilledBytes
In BigQuery, you can use a query as follows:
SELECT
protopayload_auditlog.authenticationInfo.principalEmail AS email,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes AS total_billed_bytes,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query AS query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobName.jobId as job_id
FROM
`<myproject>.<mydataset>.cloudaudit_googleapis_com_data_access`
WHERE
protopayload_auditlog.methodName = 'jobservice.jobcompleted';
Afterwards, to get an estimate of the price per each query you can use the totalBilledBytes and the Pricing summary in order to add a new column with a price estimative for each query. Therefore, you have a final table with the user's email, the query code, total bytes billed, job id and an estimate price.
I'm working on a way to stream status of some jobs that are running on an HPC resource (sort of like trying to create a dashboard to look at real time flight status). I generate and push data every 60 seconds. Unfortunately, this way i end up with a lot of repeated data as the status of each 'job' changes unpredictably. I need a way to only keep the latest data. I'm not an SQL pro and do this work in my free time so any help will be appreciated!
Here is my query:
SELECT
Job, Ref, Location, Queue, Description, Status, ElapTime, cast (Time as datetime) as Time
INTO
output_source
FROM
input_source
Here is what my output looks like when i test the query:
Query Test Result
As you can see, in the image, there are two sets of data with two different time stamps. I would like the query to return all the columns associated with only the last timestamp. How do i do this? Any ideas? Apologies if this is a repeated question. I have not found an answer that has helped me solve this problem.
Thanks for all your help!
I am relatively new to Splunk and I am trying to create a reportthat will display a hostname and the amount of times that host failed to login within the past five minutes, when they failed 3 or more times. The only way I was able to get the initial search results I want is to look only within the past 5 minutes, as you can see in my query:
index="wineventlog" EventCode=4625 earliest=-5min | stats count by host,_time | stats count by host | search count > 2
This returns the host and the count. The issue is if I use this query in my report, it can run every five minutes, but the hosts that were listed previously get removed as they no longer are included in the search results.
I found ways to generate logs that I can then search for separately (http://docs.splunk.com/Documentation/Splunk/6.6.2/Alert/LogEvents) but it didn't work the way I expected.
I am looking for an answer to any of these questions that can help me get the intended results:
Can my original search be improved to still only get results where the failed logins were within 5 minutes but be able to search over any time period?
Is there a way to send the results from the query I already have to a report, where the results will not be cleared out when the search is run again?
Is there any other option I haven't considered to achieve the desired result?
If you only care about the last 5 minutes then search only the last 5 minutes. Searching more is just wasting resources.
Consider writing your results to a summary index (using collect) with a scheduled search and have your report/dashboard display values from the summary index.
I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.