how to query differently-structured files? - azure-data-lake

Is it possible to execute a query against files that have different schemas?
I have 2 sets of files in the same directory. The second type has an extra field.
Type 1
id, first, last
1, liza, smith
Type 2
id, first, last, state
4, alex, gordon, CT
Desired Result
1, liza
4, alex
How do we query files with different schemas, but where you want the same output fields?
Here's what I have:
SELECT id, first
FROM "/one 1300/{files}.csv"
USING Extractors.Csv();
#interestingRows = SELECT id, first FROM #interestingRows;
OUTPUT #interestingRows
TO #uriPrefix + "/one 1300/output/output.csv"
USING Outputters.Csv();

The CSV outputter not solve your problem.
You will need a custom extractor to solve this.
I recommend you to use Flexible Extractor
check this:
https://github.com/Azure/usql/tree/master/Examples/FlexibleSchemaExtractor
https://blogs.msdn.microsoft.com/mrys/2016/08/15/how-to-deal-with-files-containing-rows-with-different-column-counts-in-u-sql-introducing-a-flexible-schema-extractor/
The other solutions is to extract data with diferent schema separatly

How about importing the column as one using a delimiter you know does not exist in the data, then splitting it afterwards using the Split method of the string class? Something like this:
#working =
EXTRACT wholeRow string
FROM "/one 1300/{*}.csv"
USING Extractors.Text(delimiter:'|');
#working =
SELECT
wholeRow.Split(',')[0] AS id,
wholeRow.Split(',')[1] AS first,
wholeRow.Split(',')[2] AS last
FROM #working;
OUTPUT #working
TO "/output/output.csv"
USING Outputters.Csv(quoting:false);

Since you said that this two types are actualy in the same file, assuming that they are like this:
You just extract it with all the columns and set quoting to false:
//Extract the data
#extractedData =
EXTRACT id int,
first string,
last string,
state string
FROM "data.csv"
USING Extractors.Csv(skipFirstNRows : 1, quoting: false);
Then you just select the fields you need and output them:
//Select the fields
#finalData = SELECT id, first FROM #extractedData;
//Output the data
OUTPUT #finalData
TO "/Desired Result.csv"
USING Outputters.Csv(quoting: false);
The desired result:

Related

Hive sql extract one to multiple values from key value pairs

I have a column that looks like:
[{"key_1":true,"key_2":true,"key_3":false},{"key_1":false,"key_2":false,"key_3":false},...]
There can be 1 to many items described by parameters in {} in the column.
I would like to extract values only of parameters described by key_1. Is there a function for that? I tried so far json related functions (json_tuple, get_json_object) but each time I received null.
Consider below json path.
WITH sample_data AS (
SELECT '[{"key_1":true,"key_2":true,"key_3":false},{"key_1":false,"key_2":false,"key_3":false}]' json
)
SELECT get_json_object(json, '$[*].key_1') AS key1_values FROM sample_data;
Query results

SQL group by middle part of string

I have string column that looks usually approximately like this:
https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554
https://mapy.cz/turisticka?x=15.9380354&y=50.1990211&z=11&source=base&id=2197
https://mapy.cz/turisticka?x=12.8611357&y=49.8051338&z=16&source=base&id=1703157
I would like to group data by source which is part of the string - four letters behind "source=" (in the case above: firm) and then simply count them. Is there a way to achieve this directly in SQL code? I am using hadoop.
Data is a set of strings that look like above. My expected result is summary table with two columns: 1) Each type of the source (there is about 20 possible and their length is different so I cannot use sipmle substring). Ideally I am looking for solution that says: For the grouping use four letters that come after "source=" 2) Count of their occurences in all the strings.
There is just one source type in each string.
You can use regexp_extract():
select substr(regexp_extract(url, 'source[^&]+'), 8)
You can use charindex in MSSQL to get position of string and extract record
;with cte as (
SELECT SUBSTRING('https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554',
charindex('&source=','https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554')
+8,4) AS ExtractString )
select ExtractString,count(ExtractString) as count from cte group by ExtractString;
There is equivalent function LOCATE in hiveql for charindex.

Convert strings into table columns in biq query

I would like to convert this table
to something like this
the long string can be dynamic so it's important to me that it's not a fixed solution for these values specifically
Please help, i'm using big query
You could start by using SPLIT SPLIT(value[, delimiter]) to convert your long string into separate key-value pairs in an array.
This will be sensitive to you having commas as part of your values.
SPLIT(session_experiments, ',')
Then you could either FLATTEN that array or access each element, and then use some REGEXs to separate the key and the value.
If you share more context on your restrictions and intended result I could try and put together a query for you that does exactly what you want.
It's not possible what you want, however, there is a better practice for BigQuery.
You can use arrays of structs to store that information in a table.
Let's say you have a table like that
You can use that sample query to understand how to use it.
with rawdata AS
(
SELECT 1 as id, 'test1-val1,test2-val2,test3-val3' as experiments union all
SELECT 1 as id, 'test1-val1,test3-val3,test5-val5' as experiments
)
select
id,
(select array_agg(struct(split(param, '-')[offset(0)] as experiment, split(param, '-')[offset(1)] as value)) from unnest(split(experiments)) as param ) as experiments
from rawdata
The output will look like that:
After having that output, it's more convenient to manipulate the data

SQL select specific (4th) part of column BLOB data, separated by specific pattern

There is a BLOB column that contains data like:
{{Property1 {property1_string}} {Property2 {property2_string}} {Property3 {property3_string}} {Property4 {property4_string}} {Property5 {property5_string}}}
I select the above column to display the BLOB data, as follows:
utl_raw.cast_to_varchar2(dbms_lob.substr(blobColumn))
I need to display only the data of 4th Property of BLOB column, so the following:
{Property4 {property4_string}}
So, I need help to create the necessary select for this purpose.
Thank you.
this will work:
select substr(cast(blobfieldname as
varchar2(2000)),instr(cast(blobfieldname as
varchar2(2000)),'{',1,8)),instr(cast(blobfieldname as
varchar2(2000)),'}',1,8))-
instr(cast(blobfieldname as varchar2(2000)),'{',1,8))) from tablename;
You may use REGEXP_SUBSTR.
select REGEXP_SUBSTR(s,'[^{} ]+', 1, 2 * :n) FROM t;
Where n is the nth property string you want to extract from your data.
n = 1 gives property1_string
n = 2 gives property2_string
..
and so on
Note that s should be the output of utl_raw.cast_to_varchar2
Demo

Searching a column containing CSV data in a MySQL table for existence of input values

I have a table say, ITEM, in MySQL that stores data as follows:
ID FEATURES
--------------------
1 AB,CD,EF,XY
2 PQ,AC,A3,B3
3 AB,CDE
4 AB1,BC3
--------------------
As an input, I will get a CSV string, something like "AB,PQ". I want to get the records that contain AB or PQ. I realized that we've to write a MySQL function to achieve this. So, if we have this magical function MATCH_ANY defined in MySQL that does this, I would then simply execute an SQL as follows:
select * from ITEM where MATCH_ANY(FEAURES, "AB,PQ") = 0
The above query would return the records 1, 2 and 3.
But I'm running into all sorts of problems while implementing this function as I realized that MySQL doesn't support arrays and there's no simple way to split strings based on a delimiter.
Remodeling the table is the last option for me as it involves lot of issues.
I might also want to execute queries containing multiple MATCH_ANY functions such as:
select * from ITEM where MATCH_ANY(FEATURES, "AB,PQ") = 0 and MATCH_ANY(FEATURES, "CDE")
In the above case, we would get an intersection of records (1, 2, 3) and (3) which would be just 3.
Any help is deeply appreciated.
Thanks
First of all, the database should of course not contain comma separated values, but you are hopefully aware of this already. If the table was normalised, you could easily get the items using a query like:
select distinct i.Itemid
from Item i
inner join ItemFeature f on f.ItemId = i.ItemId
where f.Feature in ('AB', 'PQ')
You can match the strings in the comma separated values, but it's not very efficient:
select Id
from Item
where
instr(concat(',', Features, ','), ',AB,') <> 0 or
instr(concat(',', Features, ','), ',PQ,') <> 0
For all you REGEXP lovers out there, I thought I would add this as a solution:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]';
and for case sensitivity:
SELECT * FROM ITEM WHERE FEATURES REGEXP BINARY '[[:<:]]AB|PQ[[:>:]]';
For the second query:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]' AND FEATURES REGEXP '[[:<:]]CDE[[:>:]];
Cheers!
select *
from ITEM where
where CONCAT(',',FEAURES,',') LIKE '%,AB,%'
or CONCAT(',',FEAURES,',') LIKE '%,PQ,%'
or create a custom function to do your MATCH_ANY
Alternatively, consider using RLIKE()
select *
from ITEM
where ','+FEATURES+',' RLIKE ',AB,|,PQ,';
Just a thought:
Does it have to be done in SQL? This is the kind of thing you might normally expect to write in PHP or Python or whatever language you're using to interface with the database.
This approach means you can build your query string using whatever complex logic you need and then just submit a vanilla SQL query, rather than trying to build a procedure in SQL.
Ben