I have a file:
id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica
I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.
One possible solution is using RegexSerDe:
CREATE TABLE table my_table (
id string,
name string,
address string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$')
location 'put location here'
;
Replace location property with your table location and put the file(s) into that location.
First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.
Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'
Also you can easily test Regex for extracting columns even without creating table, like this:
select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
Result:
002|michael|57-D,costa rica
In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.
Answering question in the comment. You can have address with comma and one more column without comma like this:
select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102, mont vert|sydney
Checking comma is optional in Address column:
hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
Returns:
001|adam|1-A102 mont vert|sydney
Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562
[^,] means not a comma, last column can be everything except comma.
And of course add one more column to the DDL.
Related
I ran this in AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS `nina-nba-database`.`nina_nba_test` (
`Data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'nina'
) LOCATION 's3://nina-gray/'
TBLPROPERTIES ('has_encrypted_data'='false');
However when I try to select the table using the syntax below:
SELECT * FROM "nina-nba-database"."nina_nba_table" limit 10;
It gives me this error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
This query ran against the "layla-nba-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: b96e4344-5bbe-4eca-9da4-70be11f8e87d
Would anyone be able to help?
The input.regex in your query doesn't look like valid one. The specified regex group while creating the table becomes a new column. So if you want to read data inside a column as new column you can specify the valid regex, to understand more about regex you can refer to Regex SerDe examples from this aws documentation. Or if your use case to just read columnar data you can create the table specifying proper delimiter, For example if your data is comma separated you can specify the delimiter as
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
...
have a look at this example for more details.
I am trying to compare names in 2 different tables.
In Table1 the field is called Name1 and has values like Lynn Smith.
In Table2, the field is called Name2 and it has the value like Lynn Smith (Extra)
How can I compare the two name values ignoring the text in the brackets?
I want to write a query where I need some other fields where the main name is the same.
One method would use like:
select . . .
from t1 join
t2
on t2.name2 like t1.name1 + ' (%)';
However, this is probably not efficient. If you want performance, you can extract the name into a separate column in the second table and create an index on it:
alter table t2 add column name_cleaned as
(left(name2, charindex(' (', name2 + ' (') - 2));
create index idx_t2_name_cleaned on t2(name_cleaned);
Then you can phrase the query as:
select . . .
from t1 join
t2
on t2.name2_cleaned = t1.name1;
One way to do this is to direct compare the names after cleaning up on one side.
Unlike Gordon's answer, I'd do this with another table containing data to compare from table2.
SELECT Table2Id, Name2, NULL as cleanedName INTO NewTable FROM Table2
Now we update the cleanedName column to strip off extra information from Name2 column like below. You may also create an index on this table.
UPDATE cleanedName
SET cleanedName = LEFT (name2,CHARINDEX('(',Name2))
Now drop and re-create index on CleanedName column and then compare with Table1.Name1 column
If all the values in Table2 Column2 have space between the end of the second name and the first (open) bracket then you could use this:
SELECT SUBSTRING('Lynn Smith (Extra)',1,PATINDEX('%(%','Lynn Smith (Extra)')-2)
If you were to replace 'Lynn Smith (Extra)' with the column name:
SELECT SUBSTRING('name2',1,PATINDEX('%(%','name2')-2)
then it would show a list of the values in name2 without the text in the brackets, in other words, in the same format (as such) as the names in name1 on table1.
SUBSTRING and PATINDEX are String functions.
SUBSTRING asks for three 'arguments': (1) expression (2) start and (3) length.
(1) As you can see above the first argument can be (amongst other things)
either a constant - 'Lynn Smith (Extra)' or a column - 'name2'
(2) the start of the result you want so, in this example, the first (or left)
character in the string in the column or constant is signified by the number 1.
(3) how many characters do you want to see in the result? In this example I have used PATINDEX to create a number (see below).
PATINDEX asks for two arguments: (1) %pattern% and (2) expression
(1) is the character or group of characters (shape or 'pattern') you are looking
to locate, the reason for the wildcard characters %% either side of the
pattern is because there may be characters either side of the pattern
(2) is (amongst other things) the constant or column that contains the pattern
from argument 1.
Whilst SUBSTRING returns character data (part of the string) PATINDEX produces a number, that number is the first character in the pattern (given as a number, counting from the left of the expression).
Currently, I'm trying to execute an FTS5 query via libsqlite, and need to restrict the query to a specific column. In FTS4, this was possible by doing:
SELECT foo, bar FROM tableName WHERE columnName MATCH ?
and then binding the search string to the statement. However, with FTS5, the LHS of the MATCH operator must be the FTS table name itself, and the column name must be a part of the query:
SELECT foo, bar FROM tableName WHERE tableName MATCH 'columnName:' || ?.
This works when the binded string is a single phrase. However, consider the search text this is great. The query then becomes:
SELECT foo, bar FROM tableName WHERE tableName MATCH 'columnName:pizza is great';
Only pizza is restricted to to the columnName, but the rest of the phrase is matched against all columns.
How can I work around this?
The documentation says:
A single phrase … may be restricted to matching text within a specified column of the FTS table by prefixing it with the column name followed by a colon character.
So the column name applies only to a single phrase.
If you have three phrases, you need to specify the column name three times:
tableName MATCH 'columnName:pizza columnName:is columnName:great'
One field of table is made up of many values seperated by comma,
for example, a record of this field is:
598423,4803510,599121,98181856,1666529,106317962,4061964,7828860,598752,728067,599809,8799578,1666528,3253720,601990,601235
I want to spread the values in every record of this field in Hive.
Which function or method I can use to realize this?
Thanks.
I'm not entirely sure what you mean by "spread".
If you want an output table that has a value in every row like:
598423
4803510
599121
Then you could use explode(split(data,',')
Otherwise, if each input row has exactly 16 numbers and you want each of the numbers to reside in a different column, you have two options:
Define the comma as a delimiter for the input table ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
Split a single column into 16 columns using the split UDF: SELECT split(data,',')[0] as col1, split(data,',')[1] as col2, ...
When I add element to column (varchar) I get extra space. For example if I have a table Student with name varchar(10) and I do:
INSERT INTO Student (id,name) VALUES (1,'student')
SELECT name FROM Student WHERE id=1
I get student[space][space][space].
How can I fix without changing the type to text?
Most databases output results from a query in a fixed tabular format. To see where a string really begins and ends, you need to concatenate another character. Here are three ways, depending on the database:
select '"'+name+'"'
select '"'||name||'"'
select concat('"', name, '"')
You'll probably find that the spaces are an artifact of the query tool.