I need to extract the 5th value from data string array in Hive,
arr = ("abc", "123-4567", "10", "ax", "cdpp asd", "00", "q", "na", "avail", "n", "n", "na")
How can I extract "cdpp asd" ie 5th value.
We can use SUBSTR, and INSTR but is there any other way to achieve this?
If your array is in string column then you can remove brackets and double quotes using regexp_replace and split resulted string to get an array using split():
select split(regexp_replace('("abc", "123-4567", "10", "ax", "cdpp asd", "00", "q", "na", "avail", "n", "n", "na")','^\\(|\\)$|"',''),', *')[4];
OK
cdpp asd
arr = ("abc", "123-4567", "10", "ax", "cdpp asd", "00", "q", "na", "avail", "n", "n", "na")
Select arr[4] from tablename;
Output:
cdpp asd
1、Maybe you can try to write a UDF to cast this string to an Array arr, then you can use arr[4] to visit the 5th value;
2、Or you can use the following way to get the 5th value:
select tf.* from (
select regexp_replace('("abc", "123-4567", "10", "ax", "cdpp asd", "00", "q", "na", "avail", "n", "n", "na")','\\(|\\)|"','') as str
) t lateral view posexplode(split(str,', ')) tf as pos,val
where tf.pos = 4;
Note: this way require the array string has no brackets.
Related
Is the following a full list of all value types as they're passed to json in BigQuery? I've gotten this by trial and error but haven't been able to find this in the documentation:
select
NULL as NullValue,
FALSE as BoolValue,
DATE '2014-01-01' as DateValue,
INTERVAL 1 year as IntervalValue,
DATETIME '2014-01-01 01:02:03' as DatetimeValue,
TIMESTAMP '2014-01-01 01:02:03' as TimestampValue,
"Hello" as StringValue,
B"abc" as BytesValue,
123 as IntegerValue,
NUMERIC '3.14' as NumericValue,
3.14 as FloatValue,
TIME '12:30:00.45' as TimeValue,
[1,2,3] as ArrayValue,
STRUCT('Mark' as first, 'Thomas' as last) as StructValue,
[STRUCT(1 as x, 2 as y), STRUCT(5 as x, 6 as y)] as ArrayStructValue,
STRUCT(1 as x, [1,2,3] as y, ('a','b','c') as z) as StructNestedValue
{
"NullValue": null,
"BoolValue": "false", // why not just false without quotes?
"DateValue": "2014-01-01",
"IntervalValue": "1-0 0 0:0:0",
"DatetimeValue": "2014-01-01T01:02:03",
"TimestampValue": "2014-01-01T01:02:03Z",
"StringValue": "Hello",
"BytesValue": "YWJj",
"IntegerValue": "123",
"NumericValue": "3.14",
"FloatValue": "3.14",
"TimeValue": "12:30:00.450000",
"ArrayValue": ["1", "2", "3"],
"StructValue": {
"first": "Mark",
"last": "Thomas"
},
"ArrayStructValue": [
{"x": "1", "y": "2"},
{"x": "5", "y": "6"}
],
"StructNestedValue": {
"x": "1",
"y": ["1", 2", "3"],
"z": {"a": "a", b": "b", "c": "c"}
}
}
Honestly, it seems to me that other than the null value and the array [] or struct {} container, everything is string-enclosed, which seems a bit odd.
According to this document, json is built on two structures:
A collection of name/value pairs. In various languages, this is
realized as an object, record, struct, dictionary, hash table, keyed
list, or associative array.
An ordered list of values. In most
languages, this is realized as an array, vector, list, or sequence.
The result of the SELECT query is in json format, wherein [] depicts an array datatype, {} depicts an object datatype and double quotes(" ") depicts a string value as in the query itself.
please help me with the following conversion please. So I have a pandas dataframe in the following format:
id
location
{ "0": "5",
"0": "Charlotte, North Carolina",
"1": "5",
"1": "N/A",
"2": "5",
"2": "Portland, Oregon",
"3": "5",
"3": "Jonesborough, Tennessee",
"4": "5",
"4": "Rockville, Indiana",
"5": "5",}
"5": "Dallas, Texas",
and would like to convert this into the following format:
A header
Another header
"5"
"Charlotte, North Carolina"
"5"
"N/A"
"5"
"Portland, Oregon"
"5"
"Jonesborough, Tennessee"
"5"
"Rockville, Indiana"
"5"
"Dallas, Texas"
Please help
You can try this.
import pandas as pd
import re
df = pd.DataFrame([['{ "0": "5",', '"0": "Charlotte, North Carolina",'], ['"1": "5",','"1": "N/A",']], columns=['id', 'location'])
#using regex to extract int values and selecting second int
df['id'] = df['id'].apply(lambda x: re.findall(r'\d+', x)[1])
#Split the string with : and select second value. And remove comma
df['location'] = df['location'].apply(lambda x: x.split(':')[1][:-1])
print(df)
Output:
id location
0 5 "Charlotte, North Carolina"
1 5 "N/A"
I have a list containing the keys and another list containing values (obtained from splitting a log line). How can I combine the two to make a proeprty-bag in Kusto?
let headers = pack_array("A", "B", "C");
datatable(RawData:string)
[
"1,2,3",
"4,5,6",
]
| expand fields = split(RawData, ",")
| expand dict = ???
Expected:
dict
-----
{"A": 1, "B": 2, "C": 3}
{"A": 4, "B": 5, "C": 6}
Here's one option, that uses the combination of:
mv-apply: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/mv-applyoperator
pack(): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/packfunction
make_bag(): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/make-bag-aggfunction
let keys = pack_array("A", "B", "C");
datatable(RawData:string)
[
"1,2,3",
"4,5,6",
]
| project values = split(RawData, ",")
| mv-apply with_itemindex = i key = keys to typeof(string) on (
summarize dict = make_bag(pack(key, values[i]))
)
values
dict
[ "1", "2", "3"]
{ "A": "1", "B": "2", "C": "3"}
[ "4", "5", "6"]
{ "A": "4", "B": "5", "C": "6"}
I have a JSON string with n objects in DB like [{"key1": "A", "key2": 22}, {"key1": "B", "key2": 32}, {"key1": "C", "key2": 42}, ....]
Need to join all n objects to a single string of format A22 B32 C42 ...
How can I achieve this using a SQL function
Version: 2016
A way to do it in 2016 version from Jeroen Mostert's suggestion
SELECT
CAST(t.str as varchar) + ' ' AS 'data()'
FROM
( SELECT CONCAT(key1, key2) as [str]
FROM OPENJSON('[{"key1": "A", "key2": 22}, {"key1": "B", "key2": 32}, {"key1": "C", "key2": 42}]')
WITH (key1 NVARCHAR(MAX), key2 INT)
) t
FOR XML PATH('')
I have a gigantic script that I would like to create in an iterative way (while or for loop), so it becomes overviewable and much shorter. It makes so much sense that it should be doable in SQL but so far I have not succeeded. What I did now in order to make it work is a lot of selections that I UNION together to make one table.
I want to iterate through the years, so while year is lower then 2017 execute function with the year in it as variable, starting from 1995.
So actually, an iterative function that fills in all years in the following lines of code and combines all results within one table: I will keep trying myself and update the code if I make progress.
SELECT
regio, 1995 as year, sum("0") as "0", sum("1") as "1", sum("2") as "2", sum("3") as "3", sum("4") as "4", sum("5") as "5", sum("6") as "6", sum("7") as "7", sum("8") as "8", sum("9") as "9", sum("10") as "10"
FROM
source
where
year = 1995 OR "year-1" = 1995 OR "year-2" = 1995 OR "year-3" = 1995 OR "year-4" = 1995
group by
regio
UNION
SELECT
regio, 1996 as year, sum("0") as "0", sum("1") as "1", sum("2") as "2", sum("3") as "3", sum("4") as "4", sum("5") as "5", sum("6") as "6", sum("7") as "7", sum("8") as "8", sum("9") as "9", sum("10") as "10"
FROM
source
where
year = 1996 OR "year-1" = 1996 OR "year-2" = 1996 OR "year-3" = 1996 OR "year-4" = 1996
group by
regio
You would seem to want:
SELECT regio, g.yyyy as year, sum("0") as "0", sum("1") as "1",
sum("2") as "2", sum("3") as "3", sum("4") as "4",
sum("5") as "5", sum("6") as "6", sum("7") as "7",
sum("8") as "8", sum("9") as "9", sum("10") as "10"
FROM source CROSS JOIN
generate_series(1995, 2017) g(yyyy)
WHERE g.yyyy IN (year, "year-1", "year-2", "year-3", "year-4")
GROUP BY regio, g.yyyy;