How do I parse by regular expressions only on filtered lines on Cloudwatch log insights? - amazon-cloudwatch

Is there a way to restructure this cloudwatch insights query so that it runs faster?
fields #timestamp, #message
| filter #message like /NewProductRequest/
| parse #message /.*"productType":\s*"(?<productType>\w+)"/
| stats count(*) group productType
I am running it over a limited period (1 day's worth of logs). It is taking very long to run.
When I remove the parse command, and count(*) the filtered lines: there are only 2500 matches out of 20,000,000 lines: the query returns in several seconds
With the parse command, the query takes >15 minutes. I can see the throughput drop from ~1GBps to ~2MBps.
Running a parse regexp on 2500 filtered lines should be negligible. It takes less then 2 seconds if I download the filtered results to my macbook and run the regexp in Python.
This leads me to believe that cloudwatch is running the parse command on every line in the log, and not just the filtered lines.
Is there a way to restructure my query so that the parse command will run after my filter command? ( Effectively parsing 2.5k lines instead of 20 million lines)

Removing the .* at the beginning of the expression increases performance. If you only searching for a string starting after any character sequence (.*), then this solution will work for you. This does not solve problems if the beginning of your regexp is anything other than .*

Related

Django - Iterating over Raw Query is slow

I have a query which uses a window function. I am using a raw query to filter over that new field, since django doesn't allow filtering over that window function (at least in the version I am using).
So it would look something like this (simplified):
# Returns 440k lines
user_files = Files.objects.filter(file__deleted=False).filter(user__terminated__gte=today).annotate(
row_number=Window(expression=RowNumber(), partition_by=[F("user")], order_by=[F("creation_date").desc()]))
I am basically trying to get the last not deleted file from each user which is not terminated.
Afterwards I use following raw query to get what I want:
# returns 9k lines
sql, params = user_files.query.sql_with_params()
latest_user_files = Files.objects.raw(f'select * from ({sql}) sq where row_number = 1', params)
if I run these queries in the database, they run quite quickly (300ms). But once I try to iterate over them or even just print them it takes a very long time to execute.
Anywhere from 100 to 200 seconds even though the query itself just takes a little bit less than half a second. Is there anything I am missing? Is the extra field row_number in the raw query an issue?
Thank you for any hint/answers.
(Using Django 3.2 and Python 3.9)

How do you run a saved query from Big Query cli and export result to CSV?

I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?
From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.
Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.

how to handle query execution time (performance issue ) in oracle

I have situation need to execute patch script for million row of data.The current query execution time is not meet the expectation for few rows (18000) which is take around 4 hours( testing data before deploy for live ).
The patch script actually select million row of data in loop and update according to the specification , im just wonder how long it could take for million row of data since it take around 4 hour for just 18000 rows.
to overcome this problem im decided to create temp table hold the entire select statement data and proceed with the patch process using the temp table where the process could be bit faster compare select and update.
is there any other ways i can use to handle this situation ? Any suggestion and ways to solve this.
(Due to company policy im unable to post the PL/SQl script here )
seems there is no one can answer my question here i post my answer.
In oracle there is Parallel Execution which is allows spreading the processing of a single SQL statement across multiple threads.
So by using this method i solved my long running query (4 hours ) to 6 minz ..
For more information :
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-1-1872400.html

SQL Wildcard Search - Efficiency?

There has been a debate at work recently at the most efficient way to search a MS SQL database using LIKE and wildcards. We are comparing using %abc%, %abc, and abc%. One person has said that you should always have the wildcard at the end of the term (abc%). So, according to them, if we wanted to find something that ended in "abc" it'd be most efficient to use `reverse(column) LIKE reverse('%abc').
I set up a test using SQL Server 2008 (R2) to compare each of the following statements:
select * from CLMASTER where ADDRESS like '%STREET'
select * from CLMASTER where ADDRESS like '%STREET%'
select * from CLMASTER where ADDRESS like reverse('TEERTS%')
select * from CLMASTER where reverse(ADDRESS) like reverse('%STREET')
CLMASTER holds about 500,000 records, there are about 7,400 addresses that end "Street", and about 8,500 addresses that have "Street" in it, but not necessarily at the end. Each test run took 2 seconds and they all returned the same amount of rows except for %STREET%, which found an extra 900 or so results because it picked up addresses that had an apartment number on the end.
Since the SQL Server test didn't show any difference in execution time I moved into PHP where I used the following code, switching in each statement, to run multiple tests quickly:
<?php
require_once("config.php");
$connection = odbc_connect( $connection_string, $U, $P );
for ($i = 0; $i < 500; $i++) {
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$starttime = $m_time;
$Message=odbc_exec($connection,"select * from CLMASTER where ADDRESS like '%STREET%'");
$Message=odbc_result($Message,1);
$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$endtime = $m_time;
$totaltime[] = ($endtime - $starttime);
}
odbc_close($connection);
echo "<b>Test took and average of:</b> ".round(array_sum($totaltime)/count($totaltime),8)." seconds per run.<br>";
echo "<b>Test took a total of:</b> ".round(array_sum($totaltime),8)." seconds to run.<br>";
?>
The results of this test was about as ambiguous as the results when testing in SQL Server.
%STREET completed in 166.5823 seconds (.3331 average per query), and averaged 500 results found in .0228.
%STREET% completed in 149.4500 seconds (.2989 average per query), and averaged 500 results found in .0177. (Faster time per result because it finds more results than the others, in similar time.)
reverse(ADDRESS) like reverse('%STREET') completed in 134.0115 seconds (.2680 average per query), and averaged 500 results found in .0183 seconds.
reverse('TREETS%') completed in 167.6960 seconds (.3354 average per query), and averaged 500 results found in .0229.
We expected this test to show that %STREET% would be the slowest overall, while it was actually the fastest to run, and had the best average time to return 500 results. While the suggested reverse('%STREET') was the fastest to run overall, but was a little slower in time to return 500 results.
Extra fun: A coworker ran profiler on the server while we were running the tests and found that the use of the double wildcard produced a significant increase CPU usage, while the other tests were within 1-2% of each other.
Are there any SQL Efficiency experts out that that can explain why having the wildcard at the end of the search string would be better practice than the beginning, and perhaps why searching with wildcards at the beginning and end of the string was faster than having the wildcard just at the beginning?
Having the wildcard at the end of the string, like 'abc%', would help if that column were indexed, as it would be able to seek directly to the records which start with 'abc' and ignore everything else. Having the wild card at the beginning means it has to look at every row, regardless of indexing.
Good article here with more explanation.
Only wildcards at the end of a Like character string will use an index.
You should look at using FTS Contains if you want to improve speed of wildcards at the front and back of a character string. Also see this related SO post regarding Contains versus Like.
From Microsoft it is more efficient to leave the closing wildcard because it can, if one exists, use an index rather than performing a scan. Think about how the search might work, if you have no idea what's before it then you have to scan everything, but if you are only searching the tail end then you can order the rows and even possible (depending on what you're looking for) do a quasi-binary search.
Some operators in joins or predicates tend to produce resource-intensive operations. The LIKE operator with a value enclosed in wildcards ("%a value%") almost always causes a table scan. This type of table scan is a very expensive operation because of the preceding wildcard. LIKE operators with only the closing wildcard can use an index because the index is part of a B+ tree, and the index is traversed by matching the string value from left to right.
So, the above quote also explains why there was a huge processor spike when running two wildcards. It completed faster only by happenstance because there is enough horsepower to cover up the inefficiency. When trying to determine performance on a query you want to look at the execution of the query rather than the resources of the server because those can be misleading. If I have a server with enough horsepower to serve a weather vain and I'm running queries on tables as small as 500,000 rows the results are going to be misleading.
Less the fact that Microsoft quoted your answer, when doing performance analysis, consider taking the dive into learning how to read the execution plan. It's an investment and very dry, but it will be worth it in the long run.
In short though, whoever was indicating that the trailing wildcard only is more efficient, is correct.
In MS SQL, if you want to have the names those are ending with 'ABC', then u can have the query like below(suppose table name is student)
select * from student where student_name like'%[ABC]'
so it will give those names which ends with 'A' ,'B','C'.
2) if u want to have names which are starting with 'ABC' means-
select * from student where student_name like '[ABC]%'
3) if u want to have names which in middle have 'ABC'
select * from student where student_name like '%[ABC]%'

mysqldumpslow: What does these fields indicate..?

Recently we have started on optimizing live slow queries. As part of that, we thought to use mysqldumpslow to prioritize slow queries. I am new to this tool. I am able to understand some basic info, but I would like to know what exactly the below fields in the out put will tell us.
OUTPUT: Count: 6 Time=22.64s (135s) Lock=0.00s (0s) Rows=1.0 (6)
What about the below fields ?
Time : Is it the average time taken of all these 6 times of occurance...?
135s : What is this 135 seconds....?
Rows=1.0 (6): again what does this mean...?
I didn't find a better explanation. Really thanks in advance.
Regards,
UDAY
I made a research for that coz i wanted to know that too.
I have a log from a pretty highly used DB server.
The command mysqldumpslow has several optional parameters (https://dev.mysql.com/doc/refman/5.7/en/mysqldumpslow.html), including sort by (-s)
thanks to many queries I can work with, I can tell, that:
value before brackets represents an average value from all the same queries within to group ('count' in total) and the value within brackets is the maximum value of one of the queries. Meaning, in your case:
you have a query that was called 6 times, it is executed within 22.64 seconds (average), but once it took about 135 seconds to execute it. The same applies for locks (if provided) and rows. So most of the time it returns about one row, however it returned 6 rows at least once