Query Hive view with Redshift Spectrum - hive

I'm trying to query a Hive view with Redshift Spectrum but it gives me this error:
SQL Error [500310] [XX000]: [Amazon](500310) Invalid operation: Assert
Details:
-----------------------------------------------
error: Assert
code: 1000
context: loc->length() > 5 && loc->substr(0, 5) == "s3://" -
query: 12103470
location: scan_range_manager.cpp:272
process: padbmaster [pid=1769]
-----------------------------------------------;
Is is possible to query Hive views from Redshift Spectrum? I'm using Hive Metastore (not Glue Data Catalog).
I wanted to have a view to restrict access to the original table, with a limited set of columns and partitions. And also because my original table (Parquet data) has some Map fields so I wanted to do something like that to make it easier to query from Redshift as Map fields are a bit complicated to deal with in Redshift:
CREATE view my_view AS
SELECT event_time, event_properties['user-id'] as user_id, event_properties['product-id'] as product_id, year, month, day
FROM my_events
WHERE event_type = 'my-event' -- partition
I can query the table my_events from Spectrum but it's a mess because properties is a Map field, not a Struct so I need to kind of explode it into several rows in Redshift.
Thanks

Looking at the error it seems Spectrum always looks for a S3 path when external tables and views are queried.
This is valid for external tables because those will always have a location but views will never have an explicit S3 location.
Error type -> Assert
Error context -> context: loc->length() > 5 && loc->substr(0, 5) == "s3://"
In case of a hive view,
loc->length() will return 0, and the whole statement will return False and result in assertion error.
Confirmation for this could be the second clause:
loc->substr(0, 5) == "s3://"
It is expecting the location to be a S3 path and if we count number of chars in "s3://" it is 5, which also confirms the first clause :
loc->length() > 5
Looks like Spectrum does not support Hive Views (or in general any object without an explicit S3 path)

Related

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json
You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Pig ParquetLoader : Column Pruning

I read parquet files which has a schema of 12 columns.
I do a group by and sum aggregation over a single long column.
then join on another dataset. after join I only take a single column (the sum one) from the parquet dataset.
But pig constantly keeps on giving me error=>
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2000: Error processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune"
Does the pig parquet loader doesn't support column pruning?
If i tried with column pruning disabled, the job works.
pseudo code for what I am trying to achieve.
REGISTER /<path>/parquet*.jar;
res1 = load '<path>' using parquet.pig.ParquetLoader() as (c1:chararray,c2:chararray,c3:int, c4:int, c5:chararray, c6:chararray, c7:chararray, c8:chararray, c9:chararray, c10:chararray, c11:chararray, c12:long);
res2 = group winrate by (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11);
res3 = foreach res2 generate flatten(group) as (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11),SUM(res1.c12) as counts;

Show tblproperties command in Hive gives incorrect results

When I run show tblproperties sometblname, I get:
numRows = -1
rawDataSize = -1
totalSize = 0
COLUMN_STATS_ACCURATE = false
But my table has data in it. Is there a reason tblproperties shows something different?
Just run ANALYSE TABLE, syntax:
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
When the user issues that command but doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any).
Refer: Existing Tables – ANALYZE

BigQuery NEST() returns 'Error: An internal error occurred' [duplicate]

This question already has answers here:
Internal error on NEST when not flattening results
(3 answers)
Closed 6 years ago.
I'm trying to nest a field in the BigQuery UI (not the API) and continually get hit with an error when trying to output to a table without flattening:
Error: An internal error occurred and the request could not be completed.
I'm using the NEST() function and I've tried this on the public Shakespeare dataset and continue to get the same error.
SELECT corpus, NEST(word) FROM [publicdata:samples.shakespeare] GROUP BY 1
My Job ID is: realself-main:bquijob_1bfb8310_153583ecbc2
There were tons of question on SO related to how to generate repeated fileds/records in BigQuery
And, there were many different answers - ranging
from: NEST is not compatible with unflatten result - as in
Internal error on NEST when not flattening results
to: some solutions to address this issue using JS UDF as in
Nest multiple repeated fields in BigQuery ;
Create a table with Record type column ;
create a table with a column type RECORD
there are more - you can search
But surprisingly enough - recently, I found how to make NEST() work almost as it supposed to work!
Try below to see the trick
SELECT corpus, words
FROM (
SELECT corpus, NEST(word) AS words
FROM [publicdata:samples.shakespeare]
GROUP BY 1
) AS a
CROSS JOIN (SELECT 1) AS b
Note, you have to write result to table with Allow Large Results on and Flatten Results off

Qlikview - calculate and use calculated variable in script

As a new Qlikview user, I'm looking for the best way to create calculated variables, and variables based on calculated variables, in my data and use them in displays. My data is connected via ODBC.
For example, let's say I want a variable Rating based on the "Risk" variable in my dataset. The raw data contains a Risk variable that is "L" or "H". I would like to create an indicator, like Risk_H, that is 0 or 1 (if Risk='H'). Then I would like to create the Rating like "Rating = 1 + Risk_H*2". Can I do all of this in a script and have the variable Rating in my dataset?
When I try the above, I can create the Risk_H variable, but then I am not sure how to reference it in the script to calculate the Rating variable. I have read other posts that address using the load statement (Qlikview Calculated Fields with Load Script) but have been unsuccessful using calculated variables to create new variables.
Example code (which works):
SQL SELECT *,
case when (Risk = 'H') then 1
else 0
end as Risk_H
FROM [Data];
How can I create Risk_H in order to use it in the same script, like the below? In other settings, I would use something like "calculated Risk_H" to refer to it.
SQL SELECT *,
case when (Risk = 'H') then 1
else 0
end as Risk_H,
(10 + Risk_H*2) as Rating // Qlikview says it can't find Risk_H
FROM [Data];
I've tried creating Risk_H in a load script, but Qlikview doesn't recognize Risk_H in a later SQL statement. I've also tried creating a table with Risk_H , and pulling the data from that table. And in reality I'm trying to create 10+ indicators, not just one, so nested case statements aren't the answer.
EDIT: I'm told that resident tables may be the answer to performing calculations. If you can provide syntax for this using tables connected via ODBC that may answer the question.
It appears that your second Select statement is not valid SQL so as a result QlikView will complain that it cannot find Risk_H. You could try a more complicated SQL query with a sub-query to resolve this, or you could use a resident load in QlikView as follows:
Source_Data:
SQL SELECT *,
case when (Risk = 'H') then 1
else 0
end as Risk_H
FROM [Data];
Calculated_Data:
NOCONCATENATE
LOAD
*,
(10 + Risk_H*2) as Rating
RESIDENT Source_Data;
DROP TABLE Source_Data;
You also mentioned that you have around 10 indicators that you wish to use, so I agree, a case statement would probably not be a good idea. You can move this part into QlikView as well if you like using a MAPPING load and the ApplyMap function as follows:
Indicator_Map:
MAPPING
LOAD
*
INLINE [
Risk, Value
H, 1
I, 2
J, 3
];
Source_Data:
SQL SELECT *,
case when (Risk = 'H') then 1
else 0
end as Risk_H
FROM [Data];
Calculated_Data:
NOCONCATENATE
LOAD
*,
(10 + (ApplyMap('Indicator_Map',Risk, 0) * 2)) as Rating
RESIDENT Source_Data;
DROP TABLE Source_Data;
I added a couple of extra entries for your Risk "indicators" to give you an idea. Of course, the table doesn't need to be inline, it could come from another SQL statement, other file etc.
In the above example, what happens is that the Risk field's value is supplied as a parameter to the mapping table Indicator_Map which then returns the associated value. If no risk value is found, it returns 0 (the third parameter).