When I run show tblproperties sometblname, I get:
numRows = -1
rawDataSize = -1
totalSize = 0
COLUMN_STATS_ACCURATE = false
But my table has data in it. Is there a reason tblproperties shows something different?
Just run ANALYSE TABLE, syntax:
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
When the user issues that command but doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any).
Refer: Existing Tables – ANALYZE
Related
I was thinking to use this together with DBT to check that all the DAG, dependencies and such is correct without incurring in costs.
I was thinking of adding a LIMIT 0 in BigQuery queries. I'm not finding any official doc stating whether this is the case.
Are those queries not billed?
Correct, this will not bill any data. You can run a dry run to verify:
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 0'
Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1'
Query successfully validated. Assuming the tables are not modified, running this query will process 254787 bytes of data.
Above you can see a LIMIT 0 bills 0 bytes, while a LIMIT 1 will scan the whole table.
this is the partitions formart of the hive table:
year=2021/month=09/day=15
year=2021/month=09/day=16
year=2021/month=09/day=17
year=2021/month=09/day=18
year=2021/month=09/day=19
year=2021/month=09/day=20
this is the sql :
SELECT * FROM table_name
OPTIONS(
'streaming-source.enable'='true',
'streaming-source.monitor-interval'='1 min',
'streaming-source.partition.include'='all',
'stream-source.consume-order'='partition-name',
'streaming-source.consume-start-offset'='year=2021/month=09/day=15'
)
the result:
I set start offset as year='year=2021/month=09/day=15', but only the data of partition='year=2021/month=09/day=20' will flow in Flink, why???
I read parquet files which has a schema of 12 columns.
I do a group by and sum aggregation over a single long column.
then join on another dataset. after join I only take a single column (the sum one) from the parquet dataset.
But pig constantly keeps on giving me error=>
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2000: Error processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune"
Does the pig parquet loader doesn't support column pruning?
If i tried with column pruning disabled, the job works.
pseudo code for what I am trying to achieve.
REGISTER /<path>/parquet*.jar;
res1 = load '<path>' using parquet.pig.ParquetLoader() as (c1:chararray,c2:chararray,c3:int, c4:int, c5:chararray, c6:chararray, c7:chararray, c8:chararray, c9:chararray, c10:chararray, c11:chararray, c12:long);
res2 = group winrate by (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11);
res3 = foreach res2 generate flatten(group) as (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11),SUM(res1.c12) as counts;
I'm trying to query a Hive view with Redshift Spectrum but it gives me this error:
SQL Error [500310] [XX000]: [Amazon](500310) Invalid operation: Assert
Details:
-----------------------------------------------
error: Assert
code: 1000
context: loc->length() > 5 && loc->substr(0, 5) == "s3://" -
query: 12103470
location: scan_range_manager.cpp:272
process: padbmaster [pid=1769]
-----------------------------------------------;
Is is possible to query Hive views from Redshift Spectrum? I'm using Hive Metastore (not Glue Data Catalog).
I wanted to have a view to restrict access to the original table, with a limited set of columns and partitions. And also because my original table (Parquet data) has some Map fields so I wanted to do something like that to make it easier to query from Redshift as Map fields are a bit complicated to deal with in Redshift:
CREATE view my_view AS
SELECT event_time, event_properties['user-id'] as user_id, event_properties['product-id'] as product_id, year, month, day
FROM my_events
WHERE event_type = 'my-event' -- partition
I can query the table my_events from Spectrum but it's a mess because properties is a Map field, not a Struct so I need to kind of explode it into several rows in Redshift.
Thanks
Looking at the error it seems Spectrum always looks for a S3 path when external tables and views are queried.
This is valid for external tables because those will always have a location but views will never have an explicit S3 location.
Error type -> Assert
Error context -> context: loc->length() > 5 && loc->substr(0, 5) == "s3://"
In case of a hive view,
loc->length() will return 0, and the whole statement will return False and result in assertion error.
Confirmation for this could be the second clause:
loc->substr(0, 5) == "s3://"
It is expecting the location to be a S3 path and if we count number of chars in "s3://" it is 5, which also confirms the first clause :
loc->length() > 5
Looks like Spectrum does not support Hive Views (or in general any object without an explicit S3 path)
Currently i am working on the migration project. Found the below query in a procedure. I am able to get the size of database from sys.master_files table. In WHERE condition segmap is used. I am not able to find the simillar column in sys.master_files. Please help me on this
SELECT sum(size) * 2
FROM master..sysusages U
WHERE U.segmap = 3
AND U.dbid = db_id(#db_name)
SYBASE and SQLSERVER used to share same code base.So as from SYBASE docs..below is definition of segmap
The values of master..sysusages.segmap mean the following:
3: Data stored on this segment
4: Log stored on this segment
7: Since 7=4+3, both log and data stored on this segment
So the equivalent would be type='0' which means get only data space