Stream Analytics and query and output - azure-stream-analytics

I am a learner of Azure Stream Analytics and now I met some issues that I cannot handle and want the help.
I write the query in Stream Analytics and the query pass the test (I upload a json file and run test, the result can output on the page).
And then I run the Stream Analytics job (output to the table), the intput is from the EventHub, however there are no contents output into table.
Then I made several tests and so far my conclusion is:
When I write the query like this:
SELECT * INTO [TABLE] FROM INTPUT
The results can render in table (all contents can output to the table).
However if I write the query like this:
SELECT ID, VALUE INTO [Table] FROM INTPUT
Nothing can ouput to table.
I am confused the reason. I hope someone can answer the question. Because my original query to handle the data is more complex (with join).

Related

Need to simulate resourceName with full table path in Log Explorer

I need to understand under what circumstance does the protoPayload.resourceName with full table path i.e., projects/<project_id>/datasets/<dataset_id>/tables/<table_id> appear in the Log Explorer as shown in the example below.
The below entries were generated by a composer dag running a kubernetespodoperator executing some dbt commands on some models. On the basis of this, I have a sink linked to pub/sub for further processing.
As seen in the image the resourceName value is appearing as-
projects/gcp-project-name/datasets/dataset-name/tables/table-name
I have shaded the actual values of projectid, datasetid, and tablename.
I can't run the similar dag job with kuberenetesoperator on test tables owing to environment restrictions. So I tried running some update queries and insert queries using BigQuery Editor. Here is how value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/bxuxjob_
I tried same queries using Composer DAG using BigQueryInsertJobOpertor. Here is how the value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/airflow_<>_
Here is my question. What operation/operations in BigQuery will give me protoPayload.resourceName as the one that I am expecting i.e. -
projects/<project_id>/datasets/<dataset_id>/tables/<table_id>

AWS QuickSight custom SQL query on Athena

I have a pipeline where AWS Kinesis Firehose receives data, converts it to parquet-format based on an Athena table and stores it in an S3 bucket based on a date-partition (date_int: YYYYMMdd). Whenever new data is added to the bucket, a lambda is triggered to check if Athena already knows about the partition. Everything seems to be working fine; in Athena I can run a query (see below) and the newest data is received.
Athena query: SELECT * FROM "my_table" WHERE "date_int" >= 20210308
(On the left-hand side of the screen the correct Data Source and Database are selected)
Now I want to visualise the data in Quicksight. I can use either SPICE or direct query, again, all seems to be working fine. However, I have the data partitioned, because I only need datapoints of, say, the last month. In Quicksight I create a new dataset, choose the correct catalog/database/table and click 'Use custom SQL'. Then, when I run the query, I always get an error from the Athena client saying the table couldn't be find. When I look in the network tab, I see the query performed being:
/* QuickSight */SELECT ds.* FROM ( SELECT * FROM "my_table" ) ds LIMIT 0
Then the error message saying:
Table awsdatacatalog.default.my_table does not exist
The strange part is, I didn't say it should be looking at the 'default' database. I select 'awsdatacatalog' as the datasource and 'my_database' as the database. When I try to be more precise and specify the datasource and database in the select statement ("awsdatacatalog.my_database.my_table"), the error message will say "awsdatacatalog.default.awsdatacatalog.my_database.my_table".
Anyone else having the same problem? Is this a bug, or am I just missing something?
It worked for me by using datasource.database_name.table_name.
Try using SELECT * FROM awsdatacatalog.my_database.my_table

Can I use big query export data statement and scheduled the query?

I have a similar question asked in this link BigQuery - Export query results to local file/Google storage
I need to extract data from 2 big query tables using joins and where conditions. The extracted data has to be placed in a file on cloud storage. Mostly csv file. I want to go with a simple solution. Can I use big query export data statement In standard sql and schedule it?? Does it has a limitation of 1 Gb export?? If yes, what is the best possible way to implement this? Creating another temp table to save results from the query and using a data flow job to extras the data from the temp table? Please advise.
Basically google cloud now supports below
Please see code snippet in cloud documentation
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#exporting_data_to_csv_format
I’m thinking if I can use the above statement to export data into a file and select query will have join from 2 tables and other conditions.
This query will be a scheduled query in big query.
Any inputs please??

Hive Data Flow Issues

I am using Hive on HDInsights/Azure Spark 2.2 Cluster, submitting my queries through Ambari, the data is stored in External tables on Azure Data Lake. Staging and Target tables are partitioned.
I've been working on loading data in Hive today. The flow of data goes from .gz file -> staging table -> target table. It's an incremental load, left join from target to landing to preserve old data and then union all with new data for the full set.
I've noticed some behaviors that seem odd to me, was hoping to gather more insight.
Observation 1: After running the script through, I notice the new data is not present in the staging or the target from the original table/gz file. I wouldn't expect that since there's a UNION ALL present.
Observation 2: I did one step, manually loading data into my staging table from the .gz file/table. I run a simple count(*) on it. It returns 39k, great. I try running a select * where val = XYZ, it returns records, great again. I put a count(*) on that expression, starts returning 0 records.
Apologies if my thoughts are jumbled but wanted to know if there's anybody out there who's experienced similar occurrences and how to overcome them. Let me know any clarifications needed.
Are you sure you don't have spaces in your key ? have you tried trim(val) ?
Observation 2 is really surprising : from the same where predicates, you have rows being returned with a select * but nothing with select(*) ?
Could you include SQL queries and some rows of data ?

Newly inserted or updated row count in pentaho data integration

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy