I'm querying a table in a database with SQL like this:
Select col1, col2 from table_name
For reference, col2 will be an integer value, and col1 will be the name of an element. E.G.
FOO, 3
BAR, 10
I want a data structure where the values can be addressed like vars->{valueofcol1} should return the value of col2.
So
$vars->FOO
would return 3
Basically I don't know how to return the SQL results into a data structure I can address like this.
You need to fetch reach row and build that hashref yourself.
my $vars; # declare the variable for the hash ref outside the loop
my $sth = $dbh->prepare(q{select col1, col2 from table_name});
$sth->execute;
while ( my $res = $sth->fetchrow_hashref ) { # fetch row by row
$vars->{ $res->{col1} } = $res->{col2}; # build up data structure
}
print $vars->{FOO};
__END__
3
You may want to read up on DBI, especially how to fetch stuff.
Related
My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)
It looks like this
How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?
I have tried
select column_name
from my_schema.INFORMATION_SCHEMA.COLUMNS
where
table_name = 'my_table
But it will only list first level keys
From the picture above, I want, as a first step, a SQL query that returns
message
user_id
seeker
liker_id
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt
...
My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]
The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.
Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object
EDIT:
Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?
Example input (properties.x are dynamic and not known by the programmer)
search_id
properties.filters.school
properties.filters.type
1
MIT
master
2
Princetown
null
3
null
master
Example output
search_id
enabled_filters
1
["school", "type"]
2
["school"]
3
["type"]
Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.
select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'
[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]
The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.
CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
var out=[]
for(let x in input){
let te=input[x];
out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x] );
}
return out;
Object.keys(JSON.parse(input));
}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
union all select struct(null,struct(null,1,[999])) )
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu
from tbl
I am working with the JSON_VALUE function and I need a kind of dynamic query
I have a column called Criteria and sometimes it has 1 value but sometimes it has 2 or 3 vales like:
Example of 1 value: $.IRId = 1
Example of 2 values: $.IROwner = 'james.jonson#domain.com' AND DaysTillDue < 10
So in order to read the values from a JSON column and taking the Criteria column I am using this logic:
DECLARE #CriteriaValue int
,#CriteriaStatement VARCHAR(50)
SELECT #CriteriaValue=SUBSTRING(Criteria, CHARINDEX('=',Criteria)+1, len(Criteria)) FROM #SubscriptionCriteria;
SELECT #CriteriaStatement= SUBSTRING(Criteria,0, CHARINDEX('=',Criteria)) FROM #SubscriptionCriteria;
SELECT #CriteriaValue,#CriteriaStatement
SELECT *
FROM [SAAS].[ObjectEvent]
WHERE
JSON_VALUE(JSONMessageData, #CriteriaStatement) = #CriteriaValue
That SQL code is taking only the Criteria Column with only 1 value ($.IRId = 1), but the idea is to have something that reads the criteria no matter the different filters and apply them into the final query. The idea I have is that the query would look like this:
SELECT *
FROM [SAAS].[ObjectEvent]
WHERE
JSON_VALUE(JSONMessageData, #CriteriaStatement1) = #CriteriaValue1 ADN JSON_VALUE(JSONMessageData, #CriteriaStatement2) = #CriteriaValue2 AND
JSON_VALUE(JSONMessageData, #CriteriaStatement3) = #CriteriaValue3
ETC
Any suggestion?
I'm reading and executing sql queries from file and I need to inspect the result sets to count all the null values across all columns. Because the SQL is read from file, I don't know the column names and thus can't call the columns by name when trying to find the null values.
I think using CTE is the best way to do this, but how can I call the columns when I don't know what the column names are?
WITH query_results AS
(
<sql_read_from_file_here>
)
select count_if(<column_name> is not null) FROM query_results
If you are using Python to read the file of SQL statements, you can do something like this which uses pglast to parse the SQL query to get the columns for you:
import pglast
sql_read_from_file_here = "SELECT 1 foo, 1 bar"
ast = pglast.parse_sql(sql_read_from_file_here)
cols = ast[0]['RawStmt']['stmt']['SelectStmt']['targetList']
sum_stmt = "sum(iff({col} is null,1,0))"
sums = [sum_sql.format(col = col['ResTarget']['name']) for col in cols]
print(f"select {' + '.join(sums)} total_null_count from query_results")
# outputs: select sum(iff(foo is null,1,0)) + sum(iff(bar is null,1,0)) total_null_count from query_results
I have a column col1 in file 1:
00SPY58KHT5
00SPXB2BD0J
00SPXB2DXH6
00SPXDQ02S1
00SPXDY91JI
00SPXFG88L6
00SPXF1AQ4Z
00SPXF5UKS3
00SPXGL9IV6
I have column col2 in file2:
0SPY58KHT5
0SPXB2BD0J
0SPXB2DXH6
0SPXDQ02S1
0SPXDY91JI
0SPXFG88L6
0SPXF1AQ4Z
0SPXF5UKS3
0SPXGL9IV6
As you can see there is different of 0 in the first one in the beginning
I need to do JOIN operation between two files by these columns. So I need to use substring like this :
JOIN_FILE1_FILE2 = JOIN FILE1 BY TRIM(SUBSTRING(col1,1,10)), FILE1 BY TRIM(col2);
DUMP JOIN_FILE1_FILE2;
But I get empty result.
Input(s):
Successfully read 914493 records from: "/hdfs/data/adhoc/PR/02/RDO0/GUIDES/GUIDE_CONTRAT_USINE.csv"
Successfully read 102851809 records from: "/hdfs/data/adhoc/PR/02/RDO0/BB0/MGM7X007-2019-09-11.csv"
Output(s):
Successfully stored 0 records in: "hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp964914764/tmp1220183619"
How can I did this jointure please ?
As a solution I generate first data to applicate the SUBSTRING function to the col1.
Then I did the filtration using TRIM and finally use CONCAT('0',col1) in other generation.
In other words
DATA1 = FOREACH DATA_SOURCE GENERATE
SUBSTRING(col1,1,10) AS col1;
JOINED_DATA = JOIN DATA1 BY col1, ...
FINAL_DATA = FOREACH JOINED_DATA GENERATE
CONCAT('0',col1) AS col1,
...
And this works without problem.
I have files of the format test_YYYYMM.txt. I am using '-tagFile' and SUBSTRING() to extract the year and month for use in my pig script.
The file name gets added as a pseudo-column at the beginning of the tuple.
Before I do a DUMP I would like to remove that column. Doing a FOREACH ... GENERATE with only the columns I need does not work, it still retains the psuedo-column.
Is there a way to remove this column?
My sample script is as follows
raw_data = LOAD 'test_201501.txt' using PigStorage('|', '-tagFile') as
col1: chararray, col2: chararray;
data_with_yearmonth = FOREACH raw_data GENERATE
SUBSTRING($0,5,11) as yearmonth,
'TEST_DATA' as test,
col1,
col2;
DUMP data_with_yearmonth;
Expected Output:
201501, TEST_DATA, col1, col2
Current Output:
201501, TEST_DATA, test_YYYYMM.txt, col1, col2
First of all, if col1 and col2 are string then you should declare them as CHARARRAY in Pig.
Plus, I guess your current output is actually : 201501, TEST_DATA, test_YYYYMM.txt, col1.
Tell me if I'm wrong, but as you used '-TagFile' the first column is the file title, this is why you access to it with $0 in your SUBSTRING.
You can try with this code :
raw_data = LOAD 'text_201505.txt'
USING PigStorage('|', '-tagFile')
AS (title: CHARARRAY, col1: CHARARRAY, col2: CHARARRAY);
data_with_yearmonth = FOREACH raw_data
GENERATE
SUBSTRING($0,5,11) AS yearmonth,
'TEST_DATA' AS test,
col1,
col2;
DUMP data_with_yearmonth;