Kinesis Firehose to s3: data delivered to wrong hour in s3 path - hive

I'm using Kinesis Firehose to buffer IoT data, and write it to s3. Firehose writes buffers to s3 in the format s3://bucket.me.com/YYYY/MM/DD/HH
Data that comes in at 10:59a may get buffered by Firehose and not written out until 11:00a (s3://bucket.me.com/2017/03/09/11).
The problem is, when creating partitions for Athena, the partition for hour 10 will not contain all the data for hour 10 because it is in the hour 11 path.
Here is an example that better illustrates:
An IoT sends the following data to Firehose, which at 2a writes it to s3://bucket.me.com/2017/03/24/02/file-0000. The file contents look like this:
{"id":1,"dt":"2017-03-24 01:59:40"}
{"id":2,"dt":"2017-03-24 01:59:41"}
{"id":3,"dt":"2017-03-24 01:59:42"}
I then create an Athena table:
CREATE EXTERNAL TABLE sensor_data (
id string,
dt timestamp)
PARTITIONED BY (year int, month int, day int, hour int)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket.me.com/';
ALTER TABLE sensor_data ADD PARTITION (year=2017,month=3,day=24,hour=1) location 's3://bucket.me.com/2017/03/24/01/';
ALTER TABLE sensor_data ADD PARTITION (year=2017,month=3,day=24,hour=2) location 's3://bucket.me.com/2017/03/24/02/';
When I run select * from sensor_data where hour = 1, I won't get the 3 records above returned because it will only read from the s3 path defined for partition hour=1 (and the 3 records are really in the hour=2 partition).
How do I avoid this problem?

You cant avoid it entirely, but writing more often will create more accurate results in the appropriate hour.

I think you're going to want to query broader and then refilter
select * from sensor_data
where (hour = 1 or hour = 2)
and date_trunc('hour', dt.hour) = 1

Related

Automatically add partition conditions to WHERE clause

I have a columnar table that is partitioned by day and hour. It is stored on S3 in parquet files to be queried by Athena. Here is the CREATE TABLE:
CREATE EXTERNAL TABLE foo (
-- other columns here
dt timestamp,
day string,
hour string
)
PARTITIONED BY (day string, hour string)
STORED AS parquet
LOCATION 's3://foo/foo'
And the layout on S3 is like:
s3://foo/foo/day=2021-10-10/hh=00/*.parquet
s3://foo/foo/day=2021-10-10/hh=01/*.parquet
...etc
s3://foo/foo/day=2021-10-10/hh=23/*.parquet
So a query like the following will be fast because it only scans over one hour of parquet files because the partition columns are being used to filter it:
-- fast, easy to write
SELECT * FROM foo WHERE day = '2021-10-10' AND hour = '00'
However, the table also includes the full datetime dt. Usually we want to write queries for ranges that don't align to a day/hour boundary, and/or are in a different timezone.
For example, this will scan ALL parquet files and be very slow:
-- slow, easy to write
SELECT * FROM foo WHERE dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
It can be improved by manually calculating the day and hour that minimally enclose the time period:
-- fast, painful to write
SELECT * FROM foo
WHERE
((day, hh) IN (('2021-10-09', '23'), ('2021-10-11', '00')) OR day = '2021-10-10')
AND
dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
Ideally this extra condition could be added transparently by the database so as to avoid having to manually add the ((day,hh) IN (...)).
Is this possible somehow with Athena?
I've wished for this feature many times, but unfortunately Athena does not support it. You have to include both the predicate for the dt column and the day and hour partition keys.

Athena queries changes on the fly according to the partitions introduced in S3

I have partitioned my DATA_BUCKET in S3 with structure of
S3/DATA_BUCKET/table_1/YYYY/MM/DD/files.parquet
Now I have three additional columns in the table_1 which are visible in Athena as "partition_0", ""partition_1" and "partition_2" (for Year, Month and Day respectively).
Till now my apps were making time-related-queries based on the "time_stamp" column in the table:
select * from table_1 where time_stamp like '2023-01-17%'
Now to leverage the performance because of the partitions, the corresponding new query is:
select * from table_1 where partition_0 = '2023' and partition_1 = '01' and partition_2 = '17'
Problem:
Since there are many previous queries made on time_stamp in my apps I do not want to change them but still somehow transform those queries to my "partitions-type-queries" like above.
Is there any way like internally in Athena or something else ?
TIA
You can create a view from the original table with a new "time_stamp" column.
This column calculate Date from date-parts:
CREATE OR REPLACE VIEW my_view AS
SELECT mytable.col1,
mytable.col2,
cast(date_add('day', trans_day - 1, date_add('month', trans_month - 1, date_add('year', trans_year - 1970, from_unixtime(0)))) as Date) as time_stamp
FROM my_db.my_table mytable

Can I replace an interval of partitions of a BigQuery partitioned table at once?

I'm working on BigQuery tables with the Python SDK and I want to achieve something that seems doable, but can't find anything in the documentation.
I have a table T partitioned by date, and I have a SELECT request that computes values over the X last days. In T, I would like to replace the partitions of the X last days with these values, without affecting the partitions older than X days.
Here is how we do for replacing one partition only :
job_config = bigquery.QueryJobConfig()
job_config.destination = dataset.table("{}${}".format(table, date.strftime("%Y%m%d")))
job_config.use_legacy_sql = False
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
query_job = bigquery.job.QueryJob(str(uuid.uuid4()), query, client, job_config)
query_job.result()
I tried to go like this :
job_config.destination = dataset.table(table))
But it truncates all partitions, even those older than X days.
Is there a way to do this easily ? Or do I have to loop over each partition of the interval ?
Thanks
I don't think you can achieve it by playing with destination table.
Not considering the cost, what you can do with SQL is
DELETE FROM your_ds.your_table WHERE partition_date > DATE_SUB(CURRENT_DATE(), INTERVAL X DAY);
Then
INSERT INTO your_ds.your_table SELECT (...)
Cost
The first DELETE will cost:
The sum of bytes processed for all the columns referenced in all partitions for the tables scanned by the query
+ the sum of bytes for all columns in the modified or scanned partitions for the table being modified (at the time the DELETE starts).
The second INSERT INTO should cost the same as your current query.

Sorting on partition keys during INSERT INTO (Parquet) TABLE with Impala

I have an ETL job where I want to append data from a .csv file into an Impala table. Currently, I do this by creating a temporary external .csv table with the new data (in .csv.lzo format), after which it is inserted into the main table.
The query I use looks like this:
INSERT INTO TABLE main_table
PARTITION(yr, mth)
SELECT
*,
CAST(extract(ts, "year") AS SMALLINT) AS yr,
CAST(extract(ts, "month") AS TINYINT) AS mth
FROM csv_table
where main_table is defined as follows (several columns truncated):
CREATE TABLE IF NOT EXISTS main_table (
tid INT,
s1 VARCHAR,
s2 VARCHAR,
status TINYINT,
ts TIMESTAMP,
n1 DOUBLE,
n2 DOUBLE,
p DECIMAL(3,2),
mins SMALLINT,
temp DOUBLE
)
PARTITIONED BY (yr SMALLINT, mth TINYINT)
STORED AS PARQUET
The data is on the order of a few GB (55 million rows with some 30 columns), and this takes over an hour to run. I was curious as to why this was the case (as this seems rather long for something that is essentially an append operation), and came across this in the query plan:
F01:PLAN FRAGMENT [HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))] hosts=2 instances=2
| Per-Host Resources: mem-estimate=1.01GB mem-reservation=12.00MB thread-reservation=1
WRITE TO HDFS [default.main_table, OVERWRITE=false, PARTITION-KEYS=(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))]
| partitions=unavailable
| mem-estimate=1.00GB mem-reservation=0B thread-reservation=0
|
02:SORT
| order by: CAST(extract(ts, 'year') AS SMALLINT) ASC NULLS LAST, CAST(extract(ts, 'month') AS TINYINT) ASC NULLS LAST
| materialized: CAST(extract(ts, 'year') AS SMALLINT), CAST(extract(ts, 'month') AS TINYINT)
| mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB thread-reservation=0
| tuple-ids=1 row-size=1.29KB cardinality=unavailable
| in pipelines: 02(GETNEXT), 00(OPEN)
|
01:EXCHANGE [HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))]
| mem-estimate=2.57MB mem-reservation=0B thread-reservation=0
| tuple-ids=0 row-size=1.28KB cardinality=unavailable
| in pipelines: 00(GETNEXT)
|
Apparently, most of the time and resources are spent sorting on the partition keys:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
02:SORT 2 17m16s 30m50s 55.05M -1 25.60 GB 12.00 MB
01:EXCHANGE 2 9s493ms 12s822ms 55.05M -1 26.98 MB 2.90 MB HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))
00:SCAN HDFS 2 51s958ms 1m10s 55.05M -1 76.06 MB 704.00 MB default.csv_table
Why does Impala have to do this? Is there any way to partition the table without having to sort on the partition keys, or a way to speed it up in my case, where the entirety of the .csv files I'm trying to append has only 1 or 2 partition keys?
EDIT: It turns out that this is most likely because I'm using the Parquet file format. My question still applies though: is there a way to speed up the sort when I know there is little to no sorting actually required?
By comparison, an operation like SELECT COUNT(*) FROM csv_table WHERE extract(ts, "year") = 2018 AND extract(ts, "month") = 1 takes around 2-3 minutes, whereas the ORDER BY (as done during the insert) takes over an hour. This example only had the keys (2018,1) and (2018,2).
You can add hint to disable the sort stage.
INSERT INTO TABLE main_table
PARTITION(yr, mth) /* +NOCLUSTERED */
SELECT
*,
CAST(extract(ts, "year") AS SMALLINT) AS yr,
CAST(extract(ts, "month") AS TINYINT) AS mth
FROM csv_table
As explain here:Optimizer Hints
/* +CLUSTERED / and / +NOCLUSTERED / Hints / +CLUSTERED / sorts
data by the partition columns before inserting to ensure that only one
partition is written at a time per node. Use this hint to reduce the
number of files kept open and the number of buffers kept in memory
simultaneously. This technique is primarily useful for inserts into
Parquet tables, where the large block size requires substantial memory
to buffer data for multiple output files at once. This hint is
available in Impala 2.8 or higher. Starting in Impala 3.0, /
+CLUSTERED */ is the default behavior for HDFS tables.
/* +NOCLUSTERED */ does not sort by primary key before insert. This
hint is available in Impala 2.8 or higher. Use this hint when
inserting to Kudu tables.
In the versions lower than Impala 3.0, /* +NOCLUSTERED */ is the
default in HDFS tables.
Impala does the sorting because you use dynamic partitioning. Especially with tables with oncomputed stats, impala is not very well at dynamic partitioning. I advise you to use hive in case of dynamic partitions. If you are not about to use hive, my advises are :
Do compute stats on the csv table before each insert into statements.
If the first step does not work well, use static partition for a few possible partitions, and run dynamic partition that is out of the possible ranges. For example; if there is one option of year and month:
INSERT
INTO TABLE main_table
PARTITION(yr=2019, mth=2)
SELECT
*
FROM csv_table where CAST(extract(ts, "year") AS SMALLINT)=2019 and CAST(extract(ts, "month") AS TINYINT)=2;
INSERT INTO TABLE main_table
PARTITION(yr, mth)
SELECT
*,
CAST(extract(ts, "year") AS SMALLINT),
CAST(extract(ts, "month") AS TINYINT)
FROM csv_table where CAST(extract(ts, "year") AS SMALLINT)!=2019 and CAST(extract(ts, "month") AS TINYINT)!=2;
These statements shrink the set which dynamic partition would deal. And expected to decrease the total time spent.

Query - find empty interval in series of timestamps

I have a table that stores historical data. I get a row inserted in this query every 30 seconds from different type of sources and obviously there is a time stamp associated.
Let's make my parameter as disservice to 1 hour.
Since I charge my services based on time, I need to know, for example, in a specific month, if there is a period within this month in which the there is an interval which is equal or exceeds my 1 hour interval.
A simplified structure of the table would be like:
tid serial primary key,
tunitd id int,
tts timestamp default now(),
tdescr text
I don't want to write a function that loops through all the records comparing them one by one as I suppose it is time and memory consuming.
Is there any way to do this directly from SQL maybe using the interval type in PostgreSQL?
Thanks.
this small SQL query will display all gaps with the duration more than one hour:
select tts, next_tts, next_tts-tts as diff from
(select a.tts, min(b.tts) as next_tts
from test1 a
inner join test1 b ON a.tts < b.tts
GROUP BY a.tts) as c
where next_tts - tts > INTERVAL '1 hour'
order by tts;
SQL Fiddle