Dynamic partition cannot be the parent of a static partition '3' - hive

While inserting data into table hive threw the error "Dynamic partition cannot be the parent of a static partition '3'" using below query
INSERT INTO TABLE student_partition PARTITION(course , year = 3)
SELECT name, id, course FROM student1 WHERE year = 3;
Please explain the reason..

The reason of this Exception is because partitions are hierarchical folders. course folder is upper level and year is nested folders for each year.
When you creating partitions dynamically, upper folder should be created first (course) then nested year=3 folder.
You are providing year=3 partition in advance (statically), even before course is known.
Vice-versa is possible: Static parent partition and dynamic child partition:
INSERT INTO TABLE student_partition PARTITION(course='chemistry' , year) --static course partition
SELECT name, id, 3 as year --or just simply year
FROM student1 WHERE year = 3;
In the HDFS partitions folders are like this:
/student_table/course=chemistry/year=3
/student_table/course=chemistry/year=4
/student_table/course=philosophy/year=3
Static partition should exist. But it cannot exist if parent is not defined yet.
Alternatively you can make year partition dynamic as well:
INSERT INTO TABLE student_partition PARTITION(course , year)
SELECT name, id, course, 3 as year --or just simply year
FROM student1 WHERE year = 3;

Related

Create a BigQuery view to get the latest rows from a partitioned (and clustered) table

The issue
I'm trying to create a view to get the latest rows from a partitioned table, filtered on the date partition _LOCALDATETIME and zero or more cluster fields. I can create a view which uses a partition and I can create a view which handles some filters, but I can't work out the syntax to achieve both.
An example query requirement
SELECT fieldA, fieldB, fieldC FROM theView
WHERE date between '2021-01-01' and '2021-12-31' AND
_CLUSTERFIELD1 = 'foo'
GROUPBY _CLUSTERFIELD2
ORDERBY _CLUSTERFIELD3
Table schema
_LOCALDATETIME
_id
_CLUSTERFIELD1
_CLUSTERFIELD2
_CLUSTERFIELD3
_CLUSTERFIELD4
...other fields
Base on what I'm understanding from your case I have come with this approach.
I have created partion table based on _LOCALDATETIME with clustered fields and then the view that returns the data from a defined date scope and the value of the last elements based on _id. So, that will allow me to have a view which have the last items of a partitioned table from a fixed date range.
view
CREATE VIEW `<my-project-id>.<dataset>.<table>` AS
with range_id as (
select MAX(_id) as last_id_partition,_localdatetime as partition_ FROM
`<my-project-id>.<dataset>.<table>` where _localdatetime BETWEEN "2020-01-01" and "2022-01-01" group by _localdatetime)
SELECT s.*
FROM
`<my-project-id>.<dataset>.<table>` s
inner join range_id r on s._id = r.last_id_partition and s._localdatetime = r.partition_
where _localdatetime BETWEEN "2020-01-01" and "2022-01-01"
group by _id,_localdatetime,_name,_location
The view will return the last ids of a partioned clustered table with the clustered fields that are within the view (which is for year 2020 and 2021).
query
select * from `<my-project-id>.<dataset>.<table>`
WHERE _localdatetime between '2021-12-21' and '2021-12-22'
and <clusteredfield> = 'Venezuela'
It will return the records available for that filter as the data its already defined in the view.
What you can't do is to have a view without the partition field as it must exist to query a partitioned table. You can also update use the queries inside a function to further customize your outputs.

Data Loaded wrongly into Hive Partitioned table after adding a new column using ALTER

I already have a Hive partitioned table. I needed to add a new column to the table, so i used ALTER to add the column like below.
ALTER TABLE TABLE1 ADD COLUMNS(COLUMN6 STRING);
I have my final table load query like this:
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column4(Partition Column),Field Name:Code Sample value - YAHOO.COM
stg.Column5(Partition Column),Field Name:Date Sample Value - 2021-06-25
stg.Column6(New Column) Field Name:reason sample value - Adjustment
from (
select fee.* from (
select
fees.* ,
ROW_NUMBER() OVER (PARTITION BY fees.Column1 ORDER BY fees.Column3 DESC) as RNK
from Stage table fee
) fee
where RNK = 1
) stg
left join (
select Column1 from Final table
where Column5(date) in (select distinct column5(date) from Stage table)
) TGT
on tgt.Column1(id) = stg.Column1(id) where tgt.column1 is null
UNION
select
tgt.column1(id),
tgt.column2,
tgt.column3,
tgt.column4(partiton column),
tgt.column5(partiton column-date),
tgt.column6(New column)
from
Final Table TGT
WHERE TGT.Column5(date) in (select distinct column5(date) from Stage table);"
Now when my job ran today, and when i try to query the final table, i get the below error
Invalid partition value 'Adjustment' for DATE partition key: Code=2021-06-25/date=Adjustment
I can figure out something wrong happend around the partition column but unable to figure out what went wrong..Can someone help?
Partition columns should be the last ones in the select. When you add new column it is being added as the last non-partition column, partition columns remain the last ones, they are not stored in the datafiles, only metadata contains information about partitions. All other columns order also matters, it should match table DDL, check it using DESCRIBE FORMATTED table_name.
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column6 (New column) ------------New column
stg.Column4(Partition Column) ---partition columns
stg.Column5(Partition Column)
...

Is there a way to check multiple columns using "IN" condition in Redshift Spectrum?

I have a Redshift Spectrum table named as customer_details_table where the column id is not unique. I have another column hierarchy which is based on which record should be given priority if they have the same id. Here's an example:
Here, if we encounter the same id as 28846 multiple times, we will choose John as the one to be qualified, considering he has the maximum hierarchy.
I'm trying to create this eligibility column using a group by on id and then selecting the record corresponding to maximum hierarchy. Here's my SQL code:
SELECT *,
CASE WHEN (
(id , hierarchy) IN
(SELECT id , max(hierarchy)
FROM
customer_details_table
GROUP BY id
)
) THEN 'Qualified' ELSE 'Disqualified' END as eligibility
FROM
customer_details_table
Upon running this I get the following error:
SQL Error [500310] [XX000]: [Amazon](500310) Invalid operation: This type of IN/NOT IN query is not supported yet;
The above code works fine when my table (customer_details_table) is a regular Redshift table, but fails when the same table is an external spectrum table. Can anyone please suggest a good solution/alternative to achieve the same logic in spectrum tables?
You can use window functions to generate the eligibility column:
Basically you need to partition the rows by id, and rank by descending hierarchy within each group.
select
*,
case when row_number() over(partition by id order by hierarchy desc) = 1
then 'Qualified' else 'Disqualified'
end eligibility
from customer_details_table
You can use window functions:
select cdt.*
from (select cdt.*,
row_number() over (partition by id order by hierarchy desc) as seqnum
from customer_details_table cdt
) cdt
where seqnum = 1;

Partition by week/month//quarter/year to get over the partition limit?

I have 32 years of data that I want to put into a partitioned table. However BigQuery says that I'm going over the limit (4000 partitions).
For a query like:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
I'm getting an error like:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
How can I get over this limit?
Instead of partitioning by day, you could partition by week/month/year.
In my case each year of data contains around ~3GB of data, so I'll get the most benefits from clustering if I partition by year.
For this, I'll create a year date column, and partition by it:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
Note that I created the extra column DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year in the process.
Table stats:
Since the table is clustered, I'll get the benefits of partitioning even if I don't use the partitioning column (year) as a filter:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
Alternative example, I created a NOAA GSOD summary table clustered by station name - and instead of partitioning by day, I didn't partition it at all.
Let's say I want to find the hottest days since 1980 for all stations with a name like SAN FRAN%:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
Note that I got the results after processing only 55.2MB of data.
The equivalent query on the source tables (without clustering) processes 4GB instead:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
I also added a geo clustered table, to search by location instead of station name. See details here: https://stackoverflow.com/a/34804655/132438

Selecting the last entry in sql database for each id field

Hi all I am using SQL server.
I have one table that has a whole list of details on cars and events that have happened with those cars.
What I need is to be able to pick out the last entry for each vehicle based on their (Reg_No) registration number.
I have the following to work with
Table name = UnitHistory
Columns = indx (This is just the primary key, with increment)
Transdate(This is my date time column) and have Reg_No (Unique to each vehicle) .
There are about 45 vehicles with registration numbers if that helps?
I have looked at different examples but they all seem to have another table to work with.
Please help me. Thanks in advance for the help
WITH cte
AS
(
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY Reg_No
ORDER BY Transdate DESC
) AS RowNumber
FROM unithistory
)
SELECT *
FROM cte
WHERE RowNumber = 1
If you only need the index and the transdatem and they are both incremental (I am assuming that a later date corresponds to a higher index number) then the simplest query would be:
SELECT Reg_No, MAX(indx), MAX(Transdate)
FROM UnitHistory
GROUP BY Reg_No
If you want all data for a known Reg_No, you can use Dd2's answer
If you want a list of all Reg_No's with thier data, you will need a subquery