How do you run a saved query from Big Query cli and export result to CSV? - google-bigquery

I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?

From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.

Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.

Related

How do I parse by regular expressions only on filtered lines on Cloudwatch log insights?

Is there a way to restructure this cloudwatch insights query so that it runs faster?
fields #timestamp, #message
| filter #message like /NewProductRequest/
| parse #message /.*"productType":\s*"(?<productType>\w+)"/
| stats count(*) group productType
I am running it over a limited period (1 day's worth of logs). It is taking very long to run.
When I remove the parse command, and count(*) the filtered lines: there are only 2500 matches out of 20,000,000 lines: the query returns in several seconds
With the parse command, the query takes >15 minutes. I can see the throughput drop from ~1GBps to ~2MBps.
Running a parse regexp on 2500 filtered lines should be negligible. It takes less then 2 seconds if I download the filtered results to my macbook and run the regexp in Python.
This leads me to believe that cloudwatch is running the parse command on every line in the log, and not just the filtered lines.
Is there a way to restructure my query so that the parse command will run after my filter command? ( Effectively parsing 2.5k lines instead of 20 million lines)
Removing the .* at the beginning of the expression increases performance. If you only searching for a string starting after any character sequence (.*), then this solution will work for you. This does not solve problems if the beginning of your regexp is anything other than .*

How to get cost for a query in BQ

In BigQuery, how can we get the cost for a given query? We are doing a lot of high-compute queries -- https://cloud.google.com/bigquery/pricing#high-compute -- which often multiplies the data processed by 2 or more.
Is there a way to get the "Cost" of a query with the result set?
For the API or the CLI you could use the flat --dry_run which validates the query instead of running it, like so:
cat ../query.sql | bq query --use_legacy_sql=False --dry_run
Output:
Query successfully validated. Assuming the tables are not modified,
running this query will process 9614741466 bytes of data.
For costs, just divide the total bytes by 1024 ^ 4, multiply the result by 5 and then multiply by the Billing Tier you are in and you have the expected cost ($0.043 in this example).
If you already ran the query and want to know how much it processed, you can run:
bq show -j (job_id of your query)
And it'll return Bytes Billed and Billing Tier (looks like you still have to do the math for cost computation).
For WebUI, you can install BQMate and it already estimates costs for you (but you still have to adapt for your Billing Tier).
As a final recommendation, sometimes it's possible to greatly improve performance of analyzes just by optimizing how the query process data (here at our company we had several high computing queries that now process data normally just by using features such as ARRAYS and STRUCTS for instance).

Where clause searches everything if it has character `s` at the end

Im trying to run a simple select command in sqlite3 and getting strange result. I want to search a column and display all rows that has a string dockerhosts in it. But result shows rows without dockerhosts string in it.
For example search for dockerhosts:
sqlite> SELECT command FROM history WHERE command like '%dockerhosts%' ORDER BY id DESC limit 50;
git status
git add --all v1 v2
git status
If I remove s from the end I get what I need:
sqlite> SELECT command FROM history WHERE command like '%dockerhost%' ORDER BY id DESC limit 50;
git checkout -b hotfix/collapse-else-if-in-dockerhost
vi opt/dockerhosts/Docker
aws s3 cp dockerhosts.json s3://xxxxx/dockerhosts.json --profile dev
aws s3 cp dockerhosts.json s3://xxxxx/dockerhosts.json --profile dev
history | grep dockerhost | grep prod
history | grep dockerhosts.json
What am I missing?
I see a note here that there are configurable limits for a LIKE pattern - sqlite.org/limits.html ... 10 seems pretty short but maybe that's what you are running into.
The pattern matching algorithm used in the default LIKE and GLOB
implementation of SQLite can exhibit O(N²) performance (where N is the
number of characters in the pattern) for certain pathological cases.
To avoid denial-of-service attacks from miscreants who are able to
specify their own LIKE or GLOB patterns, the length of the LIKE or
GLOB pattern is limited to SQLITE_MAX_LIKE_PATTERN_LENGTH bytes. The
default value of this limit is 50000. A modern workstation can
evaluate even a pathological LIKE or GLOB pattern of 50000 bytes
relatively quickly. The denial of service problem only comes into play
when the pattern length gets into millions of bytes. Nevertheless,
since most useful LIKE or GLOB patterns are at most a few dozen bytes
in length, paranoid application developers may want to reduce this
parameter to something in the range of a few hundred if they know that
external users are able to generate arbitrary patterns.
The maximum length of a LIKE or GLOB pattern can be lowered at
run-time using the
sqlite3_limit(db,SQLITE_LIMIT_LIKE_PATTERN_LENGTH,size) interface.

Get top 3 results

I have a query using the analysis type count, I've got it grouped by type and it is returning me 12 different groups with varying values.
Would it be possible to get only the 3 groups with the highest count from that query?
The Keen API doesn't (as of October 2015) support this directly, although it is a commonly requested feature. It may be added in the future but there is currently no timeline for that.
The best workaround is to do the sorting and trimming on the client side once the response has been received. This should only take a few lines of code in most programming languages. If you're working from a command line (e.g. via curl) then you could use jq to do it:
curl "https://api.keen.io/3.0/projects/...<insert your query URL>..." > result.json
cat result.json | jq '.result | sort_by(.result) | reverse | .[:3]'
Hope that helps! (Disclosure: I'm a platform engineer at Keen.)

Doing multiple queries in Postgresql - conditional loop

Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .