Way to run Kusto query on command line against JSON - kql

Is there a Linux command line tool, like jq, but that runs Kusto queries?
So I could do something like this:
$ cat somedata.json | kql 'where user="ml" | count'
42
I do mean run it on my own server, not using an Azure service.

Yes, there is a command-line tool, but it runs against a Kusto cluster see here: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/tools/kusto-cli
You can run it against the "help" cluster (https://help.kusto.windows.net) which is a public cluster

Related

How to find a Hive Table is ACID enabled?

What are all the possible ways to find a given Hive Table is ACID or non-ACID?
As mentioned here, one way to achieve this is with below command in a Shell Script and check for the output
hive -e "describe extended <Database>.<tablename>;" | grep "transactional=true"
What are the other possible ways to achieve this? The solution can be in shell script/Apache Pig/Java which will be invoked via Oozie workflow.

Automatically updating data from an API to an SQL database?

I have data coming from an API in JSON format which I then run a few functions/transformations in python and then insert the data in to an SQL database through pandas & SQLalchemy.
Now, how do I automatically do this at the end of every day, without having to open up the script and run it manually?
You can use crontab on a server (or your Linux/Mac laptop but it will not run the script if its turned off of course).
You can run crontab -e to edit the crontab file. Add something like the following to run your script everyday at 11 PM:
0 23 * * * ~/myscript.py
crontab guru is a useful resource to try different schedule expressions.

Can't access external Hive metastore with Pyspark

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

Google Bigquery BQ command line execute query from a file

I use the bq command line tool to run queries, e.g:
bq query "select * from table"
What if I store the query in a file and run the query from that file? is there a way to do that?
The other answers seem to be either outdated or needlessly brittle. As of 2019, bq query reads from stdin, so you can just redirect your file into it:
bq query < myfile.sql
Query parameters are passed like this:
bq query --parameter name:type:value < myfile.sql
There is another way.
Try this:
bq query --flagfile=[your file with absolute path]
Ex:
bq query --flagfile=/home/user/abc.sql
You can run a query from a text file with a little bit of shell magic:
$ echo "SELECT 17" > qq.txt
$ bq query "$(cat qq.txt)"
Waiting on bqjob_r603d91b7e0435a0f_00000150c56689c6_1 ... (0s) Current status: DONE
+-----+
| f0_ |
+-----+
| 17 |
+-----+
Note this works on any unix variant (including mac). If you're using a windows, this should work under powershell but not the default cmd prompt.
If you are using standard sql (Not Legacy Sql).
**Steps:**
1. Create .sql file (you can you any extension).
2. Put your query in that. Make sure (;) at the end of the query.
3. Go to command line ad execute below commands.
4. If you want add parameter then you have to specify sequentially.
Example:
bq query --use_legacy_sql=False "$(cat /home/airflow/projects/bql/query/test.sql)"
for parameter
bq query --use_legacy_sql=False --parameter=country::USA "$(cat /home/airflow/projects/bql/query/test.sql)"
cat >/home/airflow/projects/bql/query/test.sql
select * from l1_gcb_trxn.account where country=#country;
This thread offers good solution
bq query `cat my_query.sql`
bq query --replace --use_legacy_sql=false --destination_table=syw-analytics:store_ranking.SHC_ENGAGEMENT_RANKING_TEST
"SELECT RED,
DEC,
REDEM
from `\syw.abc.xyz\`"

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)