In Apache Hive, I'm trying to copy specific rows from one table to a second table that's identical apart from an additional string column (which I'm calling "report-type") at the end of the second table. Both tables are partitioned by a string field called 'dt' which has a date e.g. "2022-08-04". When I try and copy a row from table 1 to table 2, the data is inserted into table 2 with report-type and dt swapped, because the partition column seems to be forcibly listed last.
E.g. INSERT INTO table2 SELECT *, 'some_report_type' FROM table1 WHERE <some criteria>;
gives all the data in table2 in the correct columns apart from report-type is e.g. "2022-08-04" and dt is e.g. "2022-08-04"
Is there any way around this?
Two solutions I can see are recreate the table without the partitioning (ideally want to avoid) and just have dt as a regular non-partition column, or alternatively specify each of the columns in a column list in the query, but not sure if this would force "dt" to not be the last column, and the main issue with that is I have 830 columns to specify individually.
Thanks
i have a Hive table (A) with two columns, and both are sorted in an order but with duplicates. Need to insert only the unique values from table A into table B but the order should be preserved. Unable to do this with distinct as the sorting order gets changed.
If the question is about sorted bucketed table, read this answer: https://stackoverflow.com/a/41249147/2700344 You should add cluster by/distribute by + sort by when inserting the data.
If you are selecting such sorted table and expect the dataset returned is sorted without order by clause, read this answer: https://stackoverflow.com/a/47416027/2700344
I short: select without order by does not guarantee the order.
We have avro partitioned table in hive. When we query table, partition column is displaying at the end. Is there any way to display partition column at first?
Eg: select * from tablea
Output:
Col1 col2 partition_column
Expected output:
Partition_column col1 col2
Partition column is not stored in files, so, avro or not avro, it does not matter in this context. Partition column corresponds partition sub-folder within table folder and stored in the metadata.
Historically partition column is the last one. dynamic partitioning using Insertoverwrite table partition (partition_column) SELECT * from ...` is rather common scenario. Hive will know partition is the last column.
The dynamic partition columns must be specified last among the columns
in the SELECT statement and in the same order in which they appear in
the PARTITION() clause.
You can change the order of columns displayed when running SELECT * only by creating a view in which you list all columns in the required order, OR select columns explicitly in your select.
Also according to the Codd's theory, column and row order is immaterial, you always must specify columns order desired explicitly in the select and rows order using ORDER BY, instead of relying on columns order and row order in the table or view. But in Hive the partitioning column is the last one in the table.
Consider also this: You may even not know, what you selecting from: table or view. And you may be not notified that upstream system decided to change the table or view eventually. View or table can change the order of columns. Consider view the same as a table when doing selects. It is just abstraction level. Use explicit column list to make your program working reliably always and do not have strong dependency on column order in the underlying table/view, which is immaterial.
I have a SQLite table sorted by column ID. But I need to sort it by another numerical field called RunTime.
CREATE TABLE Pass_2 AS
SELECT RunTime, PosLevel, PosX, PosY, Speed, ID
FROM Pass_1
The table Pass_2 looks good, but I need to renumber the ID column from 1 .. n without resorting the records.
It is a principle of SQL databases that the underlying tables have no natural or guaranteed order to their records. You must specify the order in which you want to see the records when SELECTing from a table using an ORDER BY clause.
You can obtain the records you want using SELECT * FROM your_table ORDER BY RunTime, and that is the correct and reliable way to do this in any SQL database.
If you want to attempt to get the records in Pass_2 to "be" in RunTime order, you can add the ORDER BY clause to the SELECT you use to create the table but remember: you are not guaranteed to get the records back in the order in which they were added to the table.
When might you get the records back in a different order? This is most likely to happen when your query can be answered using columns in a covering index -- in that case the records are more likely to be returned in index order than any "natural" order (but again, no guarantees with an ORDER BY clause).
If you want a new ID column starting at 1, then use the ROW_NUMBER() function. Instead of ID in your query use this ROW_NUMBER() OVER(ORDER BY Runtime) AS ID.... This will replace the old ID column with a freshly calculated column
I need the SQL update statement to assign consecutive sequence numbers to subsets of records in a table. I'm using MS access.
Let's say the current table has records like:
notebook,blue
notebook.Yellow
pencil,yellow
chair,blue
desk,green
desk,blue
I would like to add another field to the table and populate it as follows:
notebook,blue,1
notebook.Yellow,1
pencil,yellow,2
chair,blue,2
desk,green,1
desk,blue,3
you see that I have given a consecutive number assignment based on a certain set of criteria. In this example, the criteria was a distinct value in the second field (in real life, the criteria will be a distinct combination of values from several fields, but all the relevant fields are within the same table... no join is needed to get the criteria). since there are three records with blue in field 2, these are numbered 1,2,3. And since there are two records with yellow, they are numbered 1,2.
So I can't derive the numbering from the row number, since I have several numbering series in the same table all starting with 1.
Also, I need it to be a query where I don't have to explicitly specify the value in the second field. I just want each unique value in the second field to get its own numbering series. that is, I don't want to have to explicitly write one query to generate the numbers for "blue", and write a separate query to generate the numbers for "yellow"
The maximum number of records in the series is under 1000. So I don't mind if I would need to create and auxiliary table with 1000 records, with a field containing the values 1 to 1000. Then the update statement to the primary table could pull in the next value from the auxiliary table.
But I don't know the SQL syntax to use for this update statement, or for the update statement for any other approach. So I need your advice.
I'm not sure how to do this with a single SQL statement, but here are 2 SQL statements that could be used to handle each case:
insert into table ('desk', 'blue', 1)
where not exists (select field3 from table where field1 = 'desk' and field2 = 'blue');
insert into table (field1, field2, field3)
select field1, field2, count(1) + 1
from table
where field1 = 'desk'
and field2 = 'blue'
group by field1, field2;
Create Table #TableAutoIncrement (ID int identity(1000 , 1) , item varchar(20), COLOR varchar(20) )
Insert INTO #TableAutoIncrement
(item, COLOR )
SELECT item, COLOR FROM YOURTABLE
--- GETTING all the values from the temporary table
SELECT * FROM #TableAutoIncrement
A colleague of mine worked out the necessary SQL. Here's the generalized solution (note that I really needed to number the multiple series in my data set based on a combination of two fields. In my simplified example in the original post, I was using only one field--color--but since I really need two fields, that's what I show in this solution.
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1]
and t1.[NameCriteriaField2]= t2.[NameCriteriaField2])
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]
The source table is set up with "ID" as field with an integer value. Every record has a unique value of ID, but it does not matter if there are gaps in the ID or how the records are sequenced against the ID. (e.g., the typical MS access auto numbered primary key field serves this purpose)
This query is set up to assume that there are two fields in your data set that you want to use to group your records and assign a numerical series count to each record within each group. (Thus your table may contain multiple groups, and each group has its own numbering series starting with 1. But the way the query is formulated, there are exactly 2 criteria that define the group.) You cannot use any where clauses to further filter the records that get counted. Through experimentation, I found that adding where clauses gives unreliable results where records can get omitted. So if you need the results to be filtered so that some records are not to be included in the numerical series for a particular group, then do one of the following before running my query:
run a query to delete the undesired records from the source table
first copy all records from the source table into a new table and delete the records from the new table that should not be numbered, and run my query on the new table
deleting extraneous records before running this query is needed only if those records qualify as members of a group defined by criteria 1 and criteria 2. If there are extraneous records that don't match those two criteria, you can leave them in the table, because they will not impact the numbering of the records within the groups that you care about. They will just get their own independent numbering, which you can just ignore.**
The numbering of each group starts at 1, and the query dynamically defines the groups based on the distinct combinations of criteria1 and criteria2. However, if you have records that do not belong to any group, these records will all be numbered with 0. (Criteria1 and criteria2--at least to the extent of my testing--are non-null values. (In theory--at least on Microsoft Access, an empty string is different than Null, but I did not test this with empty strings either.) If you have records that have null in the criteria1 or criteria2 fields, MS Access consider these records as not belonging to any group and thus numbers them with 0. That is, these distinct groups need to define by non-null values for criteria1 and criteria2, and thus this is different than the way SQL DISTINCT statement works.
If you need to have NULL as a valid criteria for defining the group (and thus to have groups defined by NULL numbered), it's very simple. Prior to running my query, first run an update statement that changes all instances of null values in criteria1 or criteria2 to the phrase "placeholder for null field". Then run my query. On the result set (after the numbering has been assigned to the groups), run another update to change all occurrences of the placeholder phrase back to null.
Adjustment to syntax if your group is defined by only one field criteria
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1] )
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]
Adjustment to syntax if your group is defined by combination of 3 field criteria
SELECT *,
(SELECT COUNT(T1.ID)
FROM
[TableName] AS T1
WHERE T1.ID >= T2.ID and t1.[NameCriteriaField1] = t2.[NameCriteriaField1]
and t1.[NameCriteriaField2]= t2.[NameCriteriaField2]
and t1.[NameCriteriaField3]= t2.[NameCriteriaField3])
AS Sequence into OutputResultsTableName
FROM
[TableName] AS T2
ORDER BY [NameCriteriaField2] , [NameCriteriaField1]