Average count of elements in a set in hive? - hive

I have two columns id and segment. Segment is comma separated set of strings. I need to find average number of segments in all the table. One way to do it is by using two separate queries -
A - select count(*) from table_name;
B - select count(*) from table_name LATERAL VIEW explode(split(segment, ',') lTable AS singleSegment where segment != ""
avg = B/A
Answer would be 8/4 = 2 in the above case.
Is there a better way to achieve this ?

Try:
select sum(CASE segment
WHEN '' THEN 0
ELSE size(split(segment,','))
END
)*1.0/count(*) from table_name;
If your id field is unique, and you want to add a filter to the segment part, or protect against other malformed segment values like a,b, and a,,b, you could do:
SELECT SUM(seg_size)*1.0/count(*) FROM (
SELECT count(*) as seg_size from table_name
LATERAL VIEW explode(split(segment, ',')) lTable AS singleSegment
WHERE trim(singleSegment) != ""
GROUP BY id
) sizes
Then you can add other stuff into the where clause.
But this query takes two Hive jobs to run, compared to one for the simpler query, and requires the id field to be unique.

Related

Loop through table and update a specific column

I have the following table:
Id
Category
1
some thing
2
value
This table contains a lot of rows and what I'm trying to do is to update all the Category values to change every first letter to caps. For example, some thing should be Some Thing.
At the moment this is what I have:
UPDATE MyTable
SET Category = (SELECT UPPER(LEFT(Category,1))+LOWER(SUBSTRING(Category,2,LEN(Category))) FROM MyTable WHERE Id = 1)
WHERE Id = 1;
But there are two problems, the first one is trying to change the Category Value to upper, because only works ok for 1 len words (hello=> Hello, hello world => Hello world) and the second one is that I'll need to run this query X times following the Where Id = X logic. So my question is how can I update X rows? I was thinking in a cursor but I don't have too much experience with it.
Here is a fiddle to play with.
You can split the words apart, apply the capitalization, then munge the words back together. No, you shouldn't be worrying about subqueries and Id because you should always approach updating a set of rows as a set-based operation and not one row at a time.
;WITH cte AS
(
SELECT Id, NewCat = STRING_AGG(CONCAT(
UPPER(LEFT(value,1)),
SUBSTRING(value,2,57)), ' ')
WITHIN GROUP (ORDER BY CHARINDEX(value, Category))
FROM
(
SELECT t.Id, t.Category, s.value
FROM dbo.MyTable AS t
CROSS APPLY STRING_SPLIT(Category, ' ') AS s
) AS x GROUP BY Id
)
UPDATE t
SET t.Category = cte.NewCat
FROM dbo.MyTable AS t
INNER JOIN cte ON t.Id = cte.Id;
This assumes your category doesn't have non-consecutive duplicates within it; for example, bora frickin bora would get messed up (meanwhile bora bora fickin would be fine). It also assumes a case insensitive collation (which could be catered to if necessary).
In Azure SQL Database you can use the new enable_ordinal argument to STRING_SPLIT() but, for now, you'll have to rely on hacks like CHARINDEX().
Updated db<>fiddle (thank you for the head start!)

How to select the nth column, and order columns' selection in BigQuery

I have this huge table upon which I apply a lot of processing (using CTEs), and I want to perform a UNION ALL on 2 particular CTEs.
SELECT *
, 0 AS orders
, 0 AS revenue
, 0 AS units
FROM secondary_prep_cte WHERE purchase_event_flag IS FALSE
UNION ALL
SELECT *
FROM results_orders_and_revenues_cte
I get a "Column 1164 in UNION ALL has incompatible types : STRING,DATE at [97:5]
Obviously I don't know the name of the column, and I'd like to debug this but I feel like I'm going to waste a lot of time if I can't pin-point which column is 1164.
I also think this is a problem of the order of columns between the CTEs, so I have 2 questions:
How do I identify the 1164th column
How do I order my columns before performing the UNION ALL
I found this similar question but it is for MSSQL, I am using BigQuery
You can get information from INFORMATION_SCHEMA.COLUMNS but you'll need to create a table or view from the CTE:
CREATE OR REPLACE VIEW `project.dataset.secondary_prep_view` as select * from (select 1 as id, "a" as name, "b" as value)
Then:
SELECT * FROM dataset.INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'secondary_prep_view';

Find if a string is in or not in a database

I have a list of IDs
'ACE', 'ACD', 'IDs', 'IN','CD'
I also have a table similar to following structure
ID value
ACE 2
CED 3
ACD 4
IN 4
IN 4
I want a SQL query that returns a list of IDs that exists in the database and a list of IDs that does not in the database.
The return should be:
1.ACE, ACD, IN (exist)
2.IDs,CD (not exist)
my code is like this
select
ID,
value
from db
where ID is in ( 'ACE', 'ACD', 'IDs', 'IN','CD')
however, the return is 1) super slow with all kinds of IDs 2) return multiple rows with the same ID. Is there anyway using postgresql to return 1) unique ID 2) make the running faster?
Assuming no duplicates in table nor input, this query should do it:
SELECT t.id IS NOT NULL AS id_exists
, array_agg(ids.id)
FROM unnest(ARRAY['ACE','ACD','IDs','IN','CD']) ids(id)
LEFT JOIN tbl t USING (id)
GROUP BY 1;
Else, please define how to deal with duplicates on either side.
If the LEFT JOIN finds a matching row, the expression t.id IS NOT NULL is true. Else it's false. GROUP BY 1 groups by this expression (1st in the SELECT list), array_agg() forms arrays for each of the two groups.
Related:
Select rows which are not present in other table
Hmmm . . . Is this sufficient:
select ids.id,
(exists (select 1 from table t where t.id = ids.id)
from unnest(array['ACE', 'ACD', 'IDs', 'IN','CD']) ids(id);

How to efficiently select records matching substring in another table using BigQuery?

I have a table of several million strings that I want to match against a table of about twenty thousand strings like this:
#standardSQL
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Unfortunately this is taking an awful long time.
Considering that the fragment table is only 20k records, can I load it into a JavaScript array using a UDF and match it that way? I'm trying to figure out how to this right now but perhaps there's already some magic I could do here to make this faster. I tried a CROSS JOIN and got resource exceeded fairly quickly. I've also tried using EXISTS but I can't reference the record.name inside that subquery's WHERE without getting an error.
Example using Public Data
This seems to reflect about the same amount of data ...
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT LOWER(name) AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT record.* FROM `record`
JOIN `fragment` ON record.name
LIKE CONCAT('%', fragment.name, '%')
Below is for BigQuery Standard SQL
#standardSQL
WITH record AS (
SELECT LOWER(text) AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT DISTINCT LOWER(name) AS name
FROM `bigquery-public-data.usa_names.usa_1910_current`
), temp_record AS (
SELECT record, TO_JSON_STRING(record) id, name, item
FROM record, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
), temp_fragment AS (
SELECT name, item FROM fragment, UNNEST(REGEXP_EXTRACT_ALL(name, r'\w+')) item
)
SELECT AS VALUE ANY_VALUE(record) FROM (
SELECT ANY_VALUE(record) record, id, r.name name, f.name fragment_name
FROM temp_record r
JOIN temp_fragment f
USING(item)
GROUP BY id, name, fragment_name
)
WHERE name LIKE CONCAT('%', fragment_name, '%')
GROUP BY id
above was completed in 375 seconds, while original query is still running at 2740 seconds and keep running, so I will not even wait for it to complete
Mikhail's answer appears to be faster - but lets have one that doesn't need to SPLIT nor separate the text into words.
First, compute a regular expression with all the words to be searched:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
)
SELECT FORMAT('(%s)',STRING_AGG(name,'|'))
FROM fragment
Now you can take that resulting string, and use it in a REGEX ignoring case:
#standardSQL
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
), largestring AS (
SELECT '(?i)(mary|margaret|helen|more_names|more_names|more_names|josniel|khaiden|sergi)'
)
SELECT record.* FROM `record`
WHERE REGEXP_CONTAINS(record.name, (SELECT * FROM largestring))
(~510 seconds)
As eluded to in my question, I worked on a version using a JavaScript UDF which solves this albeit in a slower way than the answer I accepted. For completeness, I'm posting it here because perhaps someone (like myself in the future) may find it useful.
CREATE TEMPORARY FUNCTION CONTAINS_ANY(str STRING, fragments ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """
for (var i in fragments) {
if (str.indexOf(fragments[i]) >= 0) {
return fragments[i];
}
}
return null;
""";
WITH record AS (
SELECT text AS name
FROM `bigquery-public-data.hacker_news.comments`
WHERE text IS NOT NULL
), fragment AS (
SELECT name AS name, COUNT(*)
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name IS NOT NULL
GROUP BY name
), fragment_array AS (
SELECT ARRAY_AGG(name) AS names, COUNT(*) AS count
FROM fragment
GROUP BY LENGTH(name)
), records_with_fragments AS (
SELECT record.name,
CONTAINS_ANY(record.name, fragment_array.names)
AS fragment_name
FROM record INNER JOIN fragment_array
ON CONTAINS_ANY(name, fragment_array.names) IS NOT NULL
)
SELECT * EXCEPT(rownum) FROM (
SELECT record.name,
records_with_fragments.fragment_name,
ROW_NUMBER() OVER (PARTITION BY record.name) AS rownum
FROM record
INNER JOIN records_with_fragments
ON records_with_fragments.name = record.name
AND records_with_fragments.fragment_name IS NOT NULL
) WHERE rownum = 1
The idea is that the list of fragments is relatively small enough that it can be processed in an array, similar to Felipe's answer using regular expressions. The first thing I do is create a fragment_array table which is grouped by the fragment lengths ... a cheap way of preventing an over-sized array which I found can cause UDF timeouts.
Next I create a table called records_with_fragments that joins those arrays to the original records, finding only those which contain a matching fragment using the JavaScript UDF CONTAINS_ANY(). This will result in a table containing some duplicates since one record may match multiple fragments.
The final SELECT then pulls in the original record table, joins to records_with_fragments to determine which fragment matched, and also uses the ROW_NUMBER() function to prevent duplicates, e.g. only showing the first row of each record as uniquely identified by its name.
Now, the reason I do the join in the final query is because in my actual data there are more fields I want besides just the string being matched. Earlier on in my actual data I create a table of DISTINCT strings which then later need to be re-joined.
Voila! Not the most elegant but it gets the job done.

Redshift - CASE statement checking column EXISTS or no

I am querying dynamically tables where some of the tables might not have specific column. My intention is check the existence of the column and dynamically assign a value. Basically if all the tables would contain the field I would just write simply :
select name, count(k_val) from tbl GROUP by 1
But in my case I need to do something like this:
select name,
SUM( (CASE when (select EXISTS( SELECT * FROM pg_table_def WHERE tablename = 'tbl'
and "column" = 'k_val'))
then 1 else 0 end) ) as val
from tbl GROUP by 1
I am getting the error:
SQL Error [500310] [0A000]: Amazon Invalid operation:
Specified types or functions (one per INFO message) not supported on
Redshift tables.;
The following is a trick that works on most databases to handle missing columns.
select t.*,
(select k_val -- intentionally not qualified
from tbl t2
where t2.pk = t.pk
) new_k_val
from tbl t cross join
(select NULL as k_val) k;
pk is the primary key column for the table. This uses scoping rules to find a value for k_val. If k_val is in the table, then the subquery will use the value from that row. If not, then the scope will "reach out" and take the value from k. There is no confusion in this case, because k_val is not in tbl.
If you don't want a constant subquery for some reason, you can always use:
(select NULL as k_val from t limit 1) k
You can then use this as a subquery or CTE for your aggregation purposes.
Having said all that, I am wary of handling missing columns this way.