I have a pipeline where AWS Kinesis Firehose receives data, converts it to parquet-format based on an Athena table and stores it in an S3 bucket based on a date-partition (date_int: YYYYMMdd). Whenever new data is added to the bucket, a lambda is triggered to check if Athena already knows about the partition. Everything seems to be working fine; in Athena I can run a query (see below) and the newest data is received.
Athena query: SELECT * FROM "my_table" WHERE "date_int" >= 20210308
(On the left-hand side of the screen the correct Data Source and Database are selected)
Now I want to visualise the data in Quicksight. I can use either SPICE or direct query, again, all seems to be working fine. However, I have the data partitioned, because I only need datapoints of, say, the last month. In Quicksight I create a new dataset, choose the correct catalog/database/table and click 'Use custom SQL'. Then, when I run the query, I always get an error from the Athena client saying the table couldn't be find. When I look in the network tab, I see the query performed being:
/* QuickSight */SELECT ds.* FROM ( SELECT * FROM "my_table" ) ds LIMIT 0
Then the error message saying:
Table awsdatacatalog.default.my_table does not exist
The strange part is, I didn't say it should be looking at the 'default' database. I select 'awsdatacatalog' as the datasource and 'my_database' as the database. When I try to be more precise and specify the datasource and database in the select statement ("awsdatacatalog.my_database.my_table"), the error message will say "awsdatacatalog.default.awsdatacatalog.my_database.my_table".
Anyone else having the same problem? Is this a bug, or am I just missing something?
It worked for me by using datasource.database_name.table_name.
Try using SELECT * FROM awsdatacatalog.my_database.my_table
Related
I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!
The problem I'm trying to tackle is inserting and/or updating dynamic tables in a sink within an Azure Data Factory data flow. I've managed to get the source data, transform it how I want it and then send it to a sink. The pipeline ran successfully and it said it copied 37 rows (as expected) but investigation showed that no data was actually deposited in the target table. This was because the Table Action on the sink was set to 'None'. So in trying to fix this last part, it seems I don't have the 'Create' option but do have the 'Recreate' option (see screenshot of the sink below) which is not what I want as the datasource will eventually only have changed data. I need the process to create the table if it doesn't exist and then Upsert data. (Recreate drops the table and then creates it).
If I change the sink type from Inline to Dataset, then I can select Insert and Upsert, etc options but this is then not dynamic as I need to select a specific dataset.
So has anyone come across the same issue and have you managed to have dynamic sinks in your data flow where the table is created if it doesn't exist, then upsert data.
I guess I can add a Pre SQL script which takes care of the 'create the table if it doesn't exist' but I still can't select the Upsert option with inline tables.
For the CREATE TABLE IF NOT EXISTS issue, I would recommend a Stored Procedure that is executed in the pipeline prior to the Data Flow.
For Inline vs Dataset, you can make the Dataset very flexible:
So still based on your runtime table name and no schema, so no need to target a specific table.
For the UPSERT issue, make sure you have an AlterRow activity before the Sink:
I can create a materialised view in RDS (postgreSQL) to keep track of the 'latest' data output from a SQL query, and then visualise this in QuickSight. This process is also very 'quick' as it doesn't result in calling additional AWS services and/or re-processing all data again (through the SQL query). My assumption is how this works is it runs a SQL, re-runs the SQL but not for the whole data again, so that if you structure the query correctly, you can end up having a 'real time running total' metric for example.
The issue is, creating materialised views (per 5 seconds) for 100's of queries, and having them all stored in a database is not scalable. Imagine a DB with 1TB data, creating an incremental/materialised view seems much less painful than using other AWS services, but eventually won't be optimal for processing time/cost etc.
I have explored various AWS services, none of which seem to solve this problem.
I tried using AWS Glue. You would need to create 1 script per query and output it to a DB. The lag between reading and writing the incremental data is larger than creating a materialised view; because you can incrementally process data, but then to append it to the current 'total' metric is another process.
I explored using AWS Kinesis followed by a Lambda to run a SQL on the 'new' data in the stream, and store the value in S3 or RDS. Again, this adds latency and doesn't work as well as a materialised view.
I read that AWS Redshift does not have materialised views therefore stuck to RDS (PostgreSQL).
Any thoughts?
[A similar issue: incremental SQL query - except I want to avoid running the SQL on "all" data to avoid massive processing costs.]
Edit (example):
table1 has schema (datetime, customer_id, revenue)
I run this query: select sum(revenue) from table1.
This would scan the whole table to come up with a metric per customer_id.
table1 now gets updated with new data as the datetime progresses e.g. 1 hour extra data.
If I run select sum(revenue) from table1 again, it scans all the data again.
A more efficient way is to just compute the query on the new data, and append the result.
Also, I want the query to actively run where there is a change in data, not have to 'run it with a schedule' so that my front end dashboards basically 'auto update' without the customer doing much.
I am using Hive on HDInsights/Azure Spark 2.2 Cluster, submitting my queries through Ambari, the data is stored in External tables on Azure Data Lake. Staging and Target tables are partitioned.
I've been working on loading data in Hive today. The flow of data goes from .gz file -> staging table -> target table. It's an incremental load, left join from target to landing to preserve old data and then union all with new data for the full set.
I've noticed some behaviors that seem odd to me, was hoping to gather more insight.
Observation 1: After running the script through, I notice the new data is not present in the staging or the target from the original table/gz file. I wouldn't expect that since there's a UNION ALL present.
Observation 2: I did one step, manually loading data into my staging table from the .gz file/table. I run a simple count(*) on it. It returns 39k, great. I try running a select * where val = XYZ, it returns records, great again. I put a count(*) on that expression, starts returning 0 records.
Apologies if my thoughts are jumbled but wanted to know if there's anybody out there who's experienced similar occurrences and how to overcome them. Let me know any clarifications needed.
Are you sure you don't have spaces in your key ? have you tried trim(val) ?
Observation 2 is really surprising : from the same where predicates, you have rows being returned with a select * but nothing with select(*) ?
Could you include SQL queries and some rows of data ?
I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.