For an instance, simulated device storing humidity to blob storage. How can I write a query (humidity > X) so that X is picked from another file in blob storage. This X can vary.
You can use reference data for storing the thresholds (X in above example). Reference data can either be static, or dynamic.
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
Related
I need to query the Google BigQuery table and export the results to gzipped file.
This is my current code. The requirement is that the each row data should be new line (\n) delemited.
def batch_job_handler(args):
credentials = Credentials.from_service_account_info(service_account_info)
client = Client(project=service_account_info.get("project_id"),
credentials=credentials)
query_job = client.query(QUERY_STRING)
results = query_job.result() # Result's total_rows is 1300000 records
with gzip.open("data/query_result.json.gz", "wb") as file:
data = ""
for res in results:
data += json.dumps(dict(list(res.items()))) + "\n"
break
file.write(bytes(data, encoding="utf-8"))
The above solution works perfectly fine for small number of result but, gets too slow if result has 1300000 records.
Is it because of this line: json.dumps(dict(list(res.items()))) + "\n" as I am constructing a huge string by concatenating each records by new line.
As I am running this program in AWS batch, it is consuming too much time. I Need help on iterating over the result and writing to a file for millions of records in a faster way.
Check out the new BigQuery Storage API for quick reads:
https://cloud.google.com/bigquery/docs/reference/storage
For an example of the API at work, see this project:
https://github.com/GoogleCloudPlatform/spark-bigquery-connector
It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:
Direct Streaming
It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.
Filtering
The new API allows column and limited predicate filtering to only read the data you are interested in.
Column Filtering
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
Predicate Filtering
The Storage API supports limited pushdown of predicate filters. It supports a single comparison to a literal
You should (in most cases) point your output from BigQuery query to a temp table and export that temp table to a Google Cloud Storage Bucket. From that bucket, you can download stuff locally. This is the fastest route to have results available locally. All else will be painfully slow, especially iterating over the results as BQ is not designed for that.
I am processing a huge amount of small JSON files with Azure Data Lake Analytics and I want to save the result into multiple JSON files (if it is needed) with max size (e.g. 128MB)
It this possible?
I know, that there is an option to write custom outputter, but it writes row by row only, thus I have no info about whole file size. (I guess).
There is FILE.LENGTH() property in U-SQL, which gives me the size of each extracted file. Is it possible to use it to repeatedly call output with different files and pass to it only files that fit my size limit?
Thank you for help
Here is an example of what you can do with FILE.LENGTH.
#yourData =
EXTRACT
// ... columns to extract
, file_size = FILE.LENGTH()
FROM "/mydata/{*}" //input files path
USING Extractors.Csv();
#res =
SELECT *
FROM #yourData
WHERE file_size < 100000; //Your file size
In my Azure Streaming Analytics job, I am attempting to geolocate IP Addresses. The reference that I am using is around 165 MB. Reference Data blobs are limited to 100 MB each, but the documentation states the following:
Stream Analytics has a limit of 100 MB per blob but jobs can process multiple reference blobs by using the path pattern property.
How would I go about taking advantage of this? I have split my data into two 85 MB files, iplookup1.csv and iplookup2.csv but do not seem to be able to figure out how to get the Reference Data input to pick up both as a large dataset.
As a stop-gap, I may try to create two reference data inputs, then do a left-join across both and pull the value that is not null.
Per my understanding, for reference data you could specify the static data (e.g. products/products.csv) in the Path Pattern property or you could specify one or more instances of those variables ({date}, {time}) like products/{date}/{time}/products.csv to refresh your reference data.
Based on your scenario, I assumed that you need to create two reference data inputs, then you could leverage the Union operation for combining the results of two or more queries into a single result. For Reference Data JOIN, you could follow here.
UPDATE:
SELECT I1.propertyName, ip01.propertyName
FROM Input1 I1
JOIN iplookup1 ip01
ON I1.address= ip01.address
UNION
SELECT I1.propertyName, ip02.propertyName
FROM Input1 I1
JOIN iplookup2 ip02
ON I1.address= ip02.address
I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.
I need to perform funnel analysis on a data the schema for which is following:
A(int X) Matched_B(int[] Y) Filtered_C(int[] Z)
Where,
A refers to client ID which can send multiple requests. Instead of storing request ID only client ID is being stored per request in the data pipeline. (I don't know why)
Matched_B refers to a list of items returned for a query.
Flitered_C is a subset of Matched_B and refers to items which successfully passed the filter.
All the data is stored in avro files in HDFS. The QPS with which data is being stored in HDFS is around 12000.
I need to prepare the following reports:
For each combination of (X,Y[i]), the number of times Y[i] appears in Matched_B.
For each combination of (X,Y[i]), the number of times Y[i] appears in Filtered_C.
Basically I would like to know whether this task can be performed using Hive only?
Currently, I am thinking of the following architecture.
HDFS(avro_schema)--> Hive_Script_1 --> HDFS(avro_schema_1) --> Java Application --> HDFS(avro_schema_2) --> Hive_Script_2(external_table) --> result
Where,
avro_schema is the schema described above.
avro_schema_1 is generated by Hive_Script_1 by transforming (using Lateral View explode(Matched_B)) avro_schema and is described as follows:
A(int X) Matched_B_1(int Y) Filtered_C(int[] Z)
avro_schema_2 is generated by the Java Application and is described as follows:
A(int X) Matched_B(int Y) Matched_Y(1 if Y is matched, else 0) Filtered_Y(1 if Y is filtered, 0 otherwise)
Finally we can run a Hive script to process this data for events generated each day.
The other architecture could be that we remove avro_schema_1 generation and directly process avro_schema from the Java application and generate the result.
However, I would like to avoid writing a Java application for this task. Could some point me to a Hive solution to the above problem?
Would also like some architecture's POV regarding the efficient solution to this problem.
Note: Kindly suggest the solution taking into account the QPS(12000).