How to run Splunk stats command to get answers - splunk

Anyone please tell me how to execute commands - stats to produce a report on the numbers of times the GAMES equals to FOOTBALL?

index=... sourcetype=... GAMES=FOOTBALL | stats count

Related

How to get stats from combined aggregated bin data in AWS Cloudwatch Logs Insights

I have some AWS CloudWatch logs which output values every 5 seconds. I'd like to get the max over a rolling 10 minute interval and then get the average value per day based on that. Using the CloudWatch Logs Insights QuerySyntax I cannot seem to get the result of the first bin aggregation to use in the subsequent bin. I tried:
fields #timestamp, #message
| filter #LogStream like /mylog/
| parse #message '*' as threadCount
| stats max(threadCount) by bin(600s) as maxThreadCount
| stats avg(maxThreadCount) by bin(24h) as avgThreadCount
But the query syntax is invalid for multiple stats functions. Combining the last two lines into one like:
| stats avg(max(threadCount) by bin(600s)) by bin(24h) as threadCountAvg
Also is invalid. I can't seem to find much in the AWS logs. Am I out of luck? Anyone know a trick?

How do I parse by regular expressions only on filtered lines on Cloudwatch log insights?

Is there a way to restructure this cloudwatch insights query so that it runs faster?
fields #timestamp, #message
| filter #message like /NewProductRequest/
| parse #message /.*"productType":\s*"(?<productType>\w+)"/
| stats count(*) group productType
I am running it over a limited period (1 day's worth of logs). It is taking very long to run.
When I remove the parse command, and count(*) the filtered lines: there are only 2500 matches out of 20,000,000 lines: the query returns in several seconds
With the parse command, the query takes >15 minutes. I can see the throughput drop from ~1GBps to ~2MBps.
Running a parse regexp on 2500 filtered lines should be negligible. It takes less then 2 seconds if I download the filtered results to my macbook and run the regexp in Python.
This leads me to believe that cloudwatch is running the parse command on every line in the log, and not just the filtered lines.
Is there a way to restructure my query so that the parse command will run after my filter command? ( Effectively parsing 2.5k lines instead of 20 million lines)
Removing the .* at the beginning of the expression increases performance. If you only searching for a string starting after any character sequence (.*), then this solution will work for you. This does not solve problems if the beginning of your regexp is anything other than .*

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

How do you run a saved query from Big Query cli and export result to CSV?

I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?
From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.
Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.

Hive query records processed count

I want to know how many records processed or % of records proccessed by a query to fetch result in hive.
I tried describe formatted for query, but unable to do.
describe formatted (select * from sample)
Use explain command:
explain extended select * from sample
But the number of rows in the plan is taken from statistics because query was not actually executed yet. The number of processed rows will become known only after execution.
See manual here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
Counters in the log after command finished look like this:
Counters=FileSystemCounters.FILE_BYTES_READ:165364556525,
FileSystemCounters.FILE_BYTES_WRITTEN:398475913171,
FileSystemCounters.FILE_READ_OPS:0,
FileSystemCounters.FILE_LARGE_READ_OPS:0,
FileSystemCounters.FILE_WRITE_OPS:0,
FileSystemCounters.HDFS_BYTES_READ:2403609087417,
FileSystemCounters.HDFS_BYTES_WRITTEN:2401487507859,
FileSystemCounters.HDFS_READ_OPS:185667,
FileSystemCounters.HDFS_LARGE_READ_OPS:0 HIVE.RECORDS_IN:204428194,
HIVE.RECORDS_OUT_0:63070586,
HIVE.RECORDS_OUT_1_schema.table_name:39980068,
HIVE.RECORDS_OUT_INTERMEDIATE:126141195,
HIVE.SKEWJOINFOLLOWUPJOBS:0,
Shuffle Errors.BAD_ID:0,Shuffle