How to fix seed in hive

How to fix seed in hive - hive

I am running an experiment and need to fix the audience for test and control group. Here is the query I am using:
select consumer_id,
case when rand(5555)<0.5 then 'control'
else 'experiment'
end as groups
from my_table
If I create two tables using the same query, and join them, they have the same split but if I do join together in the same query it gives different split for each.
select a.groups,b.groups,count(*) from
(select consumer_id,
case when rand(5555)<0.5 then 'control'
else 'experiment'
end as groups
from my_table) a
left join
(select consumer_id,
case when rand(5555)<0.5 then 'control'
else 'experiment'
end as groups
from my_table) b on a.consumer_id = b.consumer_id
group by a.groups,b.groups;
Any idea why is this and which function I can use for seeding in hive

I found the same bug querying a table twice (nothing related to the join itself). Try setting
hive.optimize.index.filter=false;
Let me know if it works, if it does not I think there is one additional property with a bug querying a table twice.

Related

Non null value associated with least value in other column

My data is like this:
Desired output:
I have tried using following SQL:
CASE
WHEN (MINDAY_DIFF > 0) AND (MINDAY_DIFF IS NOT NULL)
THEN FIRST_VALUE(BP_MED) OVER (PARTITION BY ID ORDER BY MINDAY_DIFF ASC)
END AS DRUG
This returns NULL.
I also tried
CASE
WHEN (MINDAY_DIFF > 0)
THEN BP_MED
ELSE NULL
END AS DRUG
It returns both non-null values of BP_MED.
I also tried NVL but that didn't work either.
Since it is in Netezza. There are fewer solutions online. Please help.

The concept here is the following working inside out:
We didn't always have analytics to make things easier: So not knowing Netezza I took a more... antiquated approach. There may be better/more efficient ways; which I would look for given a place to play around with; but I think this would work in most any RDBMS as I tried to avoid any RDBMS Specific aspects unless we're dealing with pre left join supported RDBMS.
Find the Min ID and the day difference for that ID in a result set (MinAndID)
LEFT Join back to the baseSet to get all possible values specifically to get BP_Med
Then Join back to base Table to ensure we get ALL records and then populate only the BP_Med which links to the minDay_Diff. Since it's a left join only 1 record per ID should return.
UNTESTED:
SELECT A.*, DesiredDrug.BP_MED
FROM TABLE A
LEFT JOIN (SELECT ID, Fill_Date, BP_MED, MinDay_Diff
FROM TABLE BaseSet
INNER JOIN (SELECT MIN(MINDAY_DIFF) MDD, ID
FROM TABLE
WHERE ID is not null
GROUP BY ID) MinAndID
on BaseSet.ID = MinAndID.ID
and BaseSet.MinDay_Diff = MinAndID.MDD) DesiredDrug
on A.ID = DesiredDrug
and A.Fill_Date = DesiredDrug.Fill_Date
and A.BP_Med= DesiredDrug.BP_Med
and A.MinDay_Diff = DesiredDrug.MinDay_Diff

Why do I get a duplicate column name error only when I SELECT FROM (SELECT)

I imagine this is a really basic oversight on my part but I have an SQL query which works fine. But I when I SELECT from that result (SELECT FROM (SELECT))
I get a 'duplicate column' error. There are duplicate column names, for sure, in two tables where I compare them but they do not cause a problem in the initial result. For example:
SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id
Works fine but when I try to select from it, I get the error:
SELECT DISTINCT tag FROM
(SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id) a
Regardless of the DISTINCT. Ok, I can change the column names to be unique but the question really is - why do i get the error when I SELECT FROM (SELECT) and not in the initial query?
Thanks
Solution:
SELECT DISTINCT tag_id, tag FROM (SELECT _dia_tagsrel.tag_id, _dia_tagsrel.article_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id) a
I only needed to SELECT one of the duplicate columns, even though I was comparing the both of them. Provided by answer below.

In you are second query i.e., the sub query, you are selecting tag_id twice. Though it is from two different tables, it works out whey you are selecting the data. But when you select the columns with same name twice, it provides you duplicate error. Below is the way you have selected the column which is incorrect
_dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
While using sub queries, merge, in or exists clause, avoid using the same column names multiple times.
Simple join works out no need of having subquery,
SELECT _dia_tagsrel.tag_id,_dia_tagsrel.article_id, _dia_tags.tag_id, _dia_tags.tag
FROM _dia_tagsrel
JOIN _dia_tags
ON _dia_tagsrel.tag_id = _dia_tags.tag_id

Your first query returns four columns:
tag_id
article_id
tag_id
tag
Duplicate column names are allowed in a result set, but are not allowed in a table -- or derived table, view, CTE, or most subqueries (an exception are EXISTS subqueries).
I hope you can see the duplicate. There is no need to select tag_id twice, because the JOIN requires that the values are the same. So just select three columns:
SELECT tr.tag_id, tr.article_id, t.tag
FROM _dia_tagsrel tr JOIN
_dia_tags t
ON tr.tag_id = t.tag_id

Your subquery has two tag_ids, so how database engine decide which one you want to use.
So, either use one (join requires tag_ids to be same) or re-name it :
If _dia_tag has unique tags then you can use EXISTS instead of INNER JOIN:
SELECT t.tag
FROM _dia_tags t
WHERE EXISTS (SELECT 1 FROM _dia_tagsrel tr WHERE tr.tag_id = t.tag_id);

Issue with joins in a SQL query

SELECT
c.ConfigurationID AS RealflowID, c.companyname,
c.companyphone, c.ContactEmail, COUNT(k.caseid)
FROM
dbo.Configuration c
INNER JOIN
dbo.cases k ON k.SiteID = c.ConfigurationId
WHERE
EXISTS (SELECT * FROM dbo.RepairEstimates
WHERE caseid = k.caseid)
AND c.AccountStatus = 'Active'
AND c.domainid = 46
GROUP BY
c.configurationid,c.companyname, c.companyphone, c.ContactEmail
I have this query - I am using the configuration table to get the siteid of the cases in the cases table. And if the case exists in the repair estimates table pull the company details listed and get a count of how many cases are in the repair estimator table for that siteid.
I hope that is clear enough of a description.
But the issue here is the count is not correct with the data that is being pulled. Is there something I could do differently? Different join? Remove the exists add another join? I am not sure I have tried many different things.

Realized I was using the wrong table. The query was correct.

Create table that is table 1 minus table 2 based on three criteria

I have a table of LoggedDischarges and another table of ActualDischarges.
I am trying to generate a query that will give me all the fields from ActualDischarges excluding those already in LoggedDischarges based on AgencyID, Program and ActivityEndDate
A client can be in multiple programs and be discharged from multiple on the same day. I need to make sure I get LoggedDischarges from each program.
This is what I have but am not sure how to add the other criteria.
select * from ActualDischarges
where (agencychildid ) not in
(select agencyid from LoggedDischarges)
Thank you,
Steve Hathaway

Even if your DBMS supports multiple columns in a subquery like
where (AgencyID, Program, ActivityEndDate) not in
( select AgencyID, Program, ActivityEndDate
from ... )
you better switch to a NOT EXISTS (in case of any NULLs):
select * from ActualDischarges as aD
where NOT EXISTS
(select * from LoggedDischarges as lD
where aD.AgencyID = lD.AgencyID
and aD.Program = lD. Program
and aD.ActivityEndDate= lD.ActivityEndDate)

For this type of match, I would recommend a LEFT JOIN with an IS NULL at the end to determine that the second table does not have the record:
SELECT a.*
FROM ActualDischarges AS a
LEFT JOIN LoggedDischarges AS l
ON agencyid=agencychildid
AND a.program=l.program
AND a.ActivityEndDate=l.ActivityEndDate
WHERE l.agencyid IS NULL
As a side note, definitely avoid using multiple IN statements for situations like this WHERE NOT IN (...) AND NOT IN (...) etc. as you end up excluding records which match different records in LoggedDischarges for different reasons, which is rarely the desired result.

t-sql include inner join on a table only if records present

I have a very big table with tons and tons of records.
[HugeTable](id, col1, col2, col3...)
There is a page on the front end application showing this [HugeTable] data based on many filters. One of the filters will give a subset of [HugeTable], if not null
#HugeTable_subset(id)
if this filter is present, #HugeTable_subset would have records. I would like to narrow down [HugeTable] data to only matching records in #HugeTable_subset.
so right now, in the t-sql, I am doing an if-else kind of query
IF (SELECT Count(*) FROM #HugeTable_subset) > 0
BEGIN
SELECT HugeTable.* FROM [HugeTable] h
JOIN #HugeTable_subset t
ON h.id = t.id
WHERE h.params = #searchParams
END
ELSE
BEGIN
SELECT * FROM [HugeTable] h
WHERE h.params = #searchParams
END
Is there a way I could merge these two selects into one?

To join the two selects in one you can just use a LEFT OUTTER JOIN instead of a INNER JOIN.
You probably already know that, yes, maybe you don't knows you already doing it in the most optimized way. sql-server ill create two sub-query plans for each select inside the IF-ELSE and use each properly.
You can acid teste it to see if there are any difference and if the IF-ELSE really beats up the LEFT JOIN option
Also there's still two point I can point out.
1) Good Indexes over the filters can really improve your performance.
2) You can use pagination to return just a few results, improving performance and user experience when the result returns a ton of records

SELECT HugeTable.* FROM [HugeTable] h
WHERE ((SELECT Count(*) FROM #HugeTable_subset) = 0) OR
h.id IN (SELECT t.id from #HugeTable_subset t))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to fix seed in hive - hive

I found the same bug querying a table twice (nothing related to the join itself). Try setting hive.optimize.index.filter=false; Let me know if it works, if it does not I think there is one additional property with a bug querying a table twice.

Related

Non null value associated with least value in other column

Why do I get a duplicate column name error only when I SELECT FROM (SELECT)

Issue with joins in a SQL query

Create table that is table 1 minus table 2 based on three criteria

t-sql include inner join on a table only if records present

Categories

Resources