How to spread the values from a column in Hive? - hive

One field of table is made up of many values seperated by comma,
for example, a record of this field is:
598423,4803510,599121,98181856,1666529,106317962,4061964,7828860,598752,728067,599809,8799578,1666528,3253720,601990,601235
I want to spread the values in every record of this field in Hive.
Which function or method I can use to realize this?
Thanks.

I'm not entirely sure what you mean by "spread".
If you want an output table that has a value in every row like:
598423
4803510
599121
Then you could use explode(split(data,',')
Otherwise, if each input row has exactly 16 numbers and you want each of the numbers to reside in a different column, you have two options:
Define the comma as a delimiter for the input table ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
Split a single column into 16 columns using the split UDF: SELECT split(data,',')[0] as col1, split(data,',')[1] as col2, ...

Related

Select if comma separated string contains a value

I have table
raw TABLE
=========
id class_ids
------------------------
1 1234,12334,12341,1228
2 12281,12341,12283
3 1234,34221,31233,43434,1123
How to define regex to select raws if class_ids contains special id.
If we select raws with '1234' in class_ids result list should not contain raws with '12341' in class_ids.
IDs in column class_ids separated with ,
SELECT FROM raw re WHERE re.class_ids LIKE (regex)
You shouldn't be storing comma separated values in a single column.
However, this is better done using string_to_array() in Postgres instead of a regex:
SELECT *
FROM raw
WHERE '1234'= any(string_to_array(class_ids, ','));
If you really want to de-normalize your data, it's better to store those numbers in a proper integer array, instead of comma separated list of strings
A simple way uses like:
where ',' || re.class_ids || ',' like '%,1234,%'
However, this is not the real issue. You should not be storing lists of ids in a string. The SQLish way of storing them would have a table with one row per id and one row per class_id. This is called a junction table.
Even if you don't use a separate table, you should at least use Postgres's built-in mechanisms, such as an array. However, a separate table is much the preferred method, because you can explicitly declare foreign key relationships.
If you really want to do this with regular expressions, you can use the ~ operator:
SELECT FROM raw re WHERE re.class_ids ~ '^(^|,)1234(,|$)$';
But I prefer a_horse_with_no_name's answer that uses arrays.

importing data with commas in numeric fields into redshift

I am importing data into redshift using the SQL COPY statement. The data has comma thousands separators in the numeric fields which the COPY statement rejects.
The COPY statement has a number of options to specify field separators, date and time formats and NULL values. However I do not see anything to specify number formatting.
Do I need to preprocess the data before loading or is there a way to get redshift to parse the numbers corerctly?
Import the columns as TEXT data type in a temporary table
Insert the temporary table to your target table. Have your SELECT statement for the INSERT replace commas with empty strings, and cast the values to the correct numeric type.

How to split a column which contains a combined text and number?

I have a column in a table which consists of data like name50, somename20, other40, some65 like that.
I want to split the text part and number part and add the number part into another table with an empty column, which contains a column already with the text part. Now I have add the number part to the corresponding name part in this table.
For example in the second table I have a column called Textpart with the same text part from the first tables column (which I want to split) with all the names repeated several times randomly. And another caolumn called Numberpart which is empty.
Now I have to fill that numberpart with the corresponding numbers from the first table.
Please help me. thank you.
You can use a combination of substring and patindex.
First extract the numeric part. To get the text part just replace the previously found numeric part with an empty string.
select substring(data, patindex('%[0-9]%', data), len(data)) as numeric_part,
replace(data, substring(data, patindex('%[0-9]%', data), len(data)), '') as text_part
from tablename
To update the other table with the numeric part, use the text_part column to join.
Note that this will only work well if the numbers are towards the end.

Extract alphanumeric value from varchar column

I have a table which contains a column having alphanumeric values which is stored as a string. I have multiple values in that column having values such as F4737, 00Y778, PP0098, XXYYYZ etc.
I want to extract values starting with a series of F and must have numeric values in that row.
Alphanumeric column is the unique column having unique values but the rest of the columns contain duplicate values in my table.
Futhermore, once these values are extracted I would like to pick up the max value from the duplicate row,for eg:
Suppose I have F4737 and F4700 as a unique Alphanumeric row, then F4737 must be extracted from it.
I have written a query like this but the numeric values are not getting extracted from this query:
select max(Alplanumeric)
from Customers
where Alplanumeric '%[F0-9]%
or
select max(Alplanumeric)
from Customers
where Alplanumeric like '%[0-9]%'
and Alplanumeric like 'F%'**
I run the above query but I am only getting the F series if I remove the numeric part from the above query. How do I extract both, the F starting series as well as the numeric values included in that row?
Going out on a limb, you might be looking for a query like this:
SELECT *, substring(alphanumeric, '^F(\d+)')::int AS nr
FROM customers
WHERE alphanumeric ~ '^F\d+'
ORDER BY nr DESC NULLS LAST
, alphanumeric
LIMIT 1;
The WHERE conditions is a regular expression match, the expression is anchored to the start, so it can use an index. Ideally:
CREATE INDEX customers_alphanumeric_pattern_ops_idx ON customers
(alphanumeric text_pattern_ops);
This returns the one row with the highest (extracted) numeric value in alphanumeric among rows starting with 'F' followed by one ore more digits.
About the index:
PostgreSQL LIKE query performance variations
About pattern matching:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Ideally, you should store the leading text and the following numeric value in separate columns to make this more efficient. You don't necessarily need more tables like has been suggested.

Select number of comma delimited list items

I have a column that has comma seperated values and I need to select all rows that have 13 commas. They seperate numbers so I don't need to worry about any strings that contain commas. How would I do this?
alternative to like (I do not like the like, and the above will fail if contains 14 commas or more)
select * from table
where length(replace(your_column, ',', ''))=length(your_column)-13;
for better utilize the index, you should seek to normalize your table
If you're using PostgreSQL, you could also use regular expressions.
However, a better question might be why you have a single column with comma-separated values instead of multiple columns.
If you count a string with 14 commas as having 13 commas, then this will work:
SELECT * FROM table WHERE column LIKE '%,%,%,%,%,%,%,%,%,%,%,%,%,%'
% matches any string (including zero length).
In PostgreSQL you can do:
select col from table where length(regexp_replace(col, '[^,]', '', 'g')) = 13;