How do I implement a join with a between in Hive? - sql

I have a Hive table with the numeric version of an IP address. I have another table with start, end, location where start and end define a range of numeric IPs associated with a location.
Example
Numeric: 29
start | end | location
----------------------
1 | 11 | 666
12 | 30 | 777
31 | 40 | 888
Output: 29 - 777
I need to use the IP from table 1 to lookup the location from table 2. I'm new to Hive and have discovered that I can't use BETWEEN or < > in join statements. I've been trying to figure out some way of making this happen using Hive SQL and can't figure it out. Is there a way? I'm somewhat familiar with UDFs as well if one of those is needed. I'm open to the idea that this isn't possible in Hive and I need to do with Pig or a Java Map/Reduce job, I just don't know enough about things at this point to say.
Any help is appreciated. Thanks.

Hive and Pig do not support such inequality join. You can use cross join and where to do it. But it's inefficient. A simple example:
SELECT t1.ip, t2.location_ip FROM t1 JOIN t2
WHERE t1.ip >= t2.start_ip and t1.ip<=t2.start_ip ;
However, it seems you want to do cross join a big table and a small table. If so, maybe the following statement is more efficient:
SELECT /*+ MAPJOIN(t2) */ t1.ip, t2.location_ip FROM t1 JOIN t2
WHERE t1.ip >= t2.start_ip and t1.ip<=t2.start_ip ;

Related

Convert DECODE into mapping / reference table

I want to make sure that this conversion of a DECODE function into a SELECT statement joining it to a mapping table would run properly and I'm not using coding or format that works in SQL Server but is different in Oracle SQL
About the code: it is using the DECODE function to map a series of four digit medical taxonomy codes into two digit provider specialty codes. The primary table is PRVDR.TXNMY_CD, the outcome would be a column PRFRM_PRVDR_SPCLTY_CD.
Original code:
SELECT
DECODE (SUBSTR(PRVDR.TXNMY_CD, 1, 4),
'261Q', '70','347E', '59','332H', '96','332B', 'A6','1711', 'Y9','2257', 'Y9','106H', '62','103K', '26','101Y', '26','367A', '42','207K', '03', '3416', '59','367H', '32','207L', '05','211D', '48','231H', '64','2376', '64','111N', '35','291U', '69','103G', '86','364S', '89','208C', '28', '172V', '60','251S'
) END AS PRFRM_PRVDR_SPCLTY_CD
FROM
NPS_CLM_HDR
My conversion attempt:
First, I'd separately create this table called MAPPING with the following columns
| TXNMY_CD_MAP | PRFRM_PRVDR_SPCLTY_CD |
| 1711 | Y9 |
| 2257 | Y9 |
| 106H | 62 |
| 367A | 42 |
etc.
Then I would use the following query:
SELECT PRFRM_PRVDR_SPCLTY_CD
FROM REF.MAPPING AS M
JOIN PRVDR.TXNMY_CD AS P ON P.TXNMY_CD = M.TXNMY_CD_MAP
Does this look correct or have I used terminology from SQL Server that does not work with Oracle SQL?
Hmmm . . . I am expecting the two columns to be:
TXNMY_CD4 PRFRM_PRVDR_SPCLTY_CD
'261Q' '70'
'347E' '59'
'332H' '96'
. . .
This may be what your table looks like, but these are the values at the beginning of the table.
And then:
SELECT m.PRFRM_PRVDR_SPCLTY_CD
FROM PRVDR.TXNMY_CD P LEFT JOIN
REF.MAPPING M
ON LEFT(P.TXNMY_CD, 4) = M.TXNMY_CD_MAP
Except for the LEFT() vs. SUBSTR(), this should work in both databases.
Note that this uses LEFT JOIN to ensure that no rows are lost, even if there are no matches.
decode() is an Oracle function, that is not available in SQL Server. If you wanted to translate your decode to SQL Server, you would use case:
case left(prvdr.txnmy_cd, 4)
when '261Q' then '70'
when '347E' then '59'
...
end as prfrm_prvdr_spclty_cd
from nps_clm_hdr
Note that substr() is not supported in SQL Server - here we cause left() instead.
That said, using a mapping table is a better approach: it scales better, and makes it easy to maintain the mapping (there is no need to modify the code of the query, just the data).
You would phrase the query as:
select prfrm_prvdr_spclty_cd
from prvdr.txnmy_cd as p
left join ref.mapping as m on left(p.txnmy_cd, 4) = m.txnmy_cd_map
The left join allows unmapped values.
Not all database support left(), nor substr() (it is sometimes called substring(), as in SQL Server). I think that the most portable approach uses like and concat():
left join ref.mapping as m on p.txnmy_cd like concat(m.txnmy_cd_map, '%')

Why are query results across linked servers of different versions completely wrong?

I have a SQL Server 2005 running a stored procedure which hits other servers running 2008.
A very straightforward query is returning utterly incorrect results.
SELECT
c.acctno, c.provcode, p.provcode, p.provname, c.poscode AS ChargePOS,
pos.poscode, pos.posdesc
FROM
Server2008.charge_t as c
inner join
Server2008.provcode_t as p on c.provcode = p.provcode
inner join
Server2008.poscode_t as pos on c.poscode = pos.poscode
inner join
Server2008.patdemo_t as pat on c.acctno = pat.acctno
left join
Server2008.billareacode_t as b on c.billingarea = b.billareacode
Where
c.proccode in ('G0438', 'G0439', '99420')
and c.correction = 'N'
and (c.priinscode in ('0001', '001A', '001B')
or c.secinscode in ('0001', '001A', '001B'))
and year(c.dateofservice) = year(getdate())
Note the INNER JOIN from poscode_t to charge_t table (second inner join) where c.poscode = pos.poscode. This is very simple, standard stuff here.
When this is executed on the 2005 server, the results are just wrong. I get the following:
acctno | patlname | patfname | ChargeProv | ProvProv | provname | ChargePOS | poscode | posdesc
---------------------------------------------------------------------------------------------------------------------------------------------------------
1 | person1 | Person1 | 28 | 28 | Doctor28 | 07 | 323 | Site323
2 | person2 | person2 | 24 | 24 | Doctor24 | 07 | 323 | Site323
In both example, the ChargePOS (07) and the poscode (323) are clearly not the same, which the join should ensure they were.
When I run this query on Server2008 itself, the results are correct. When I run it on a 2012 server, the results are correct. It's only when I run it on the 2005 server. It makes no difference what version of SSMS I use.
I've broken the query down to run piece by piece adding in the joins one at a time. If I specify an acctno in the WHERE, the results are correct.
Has anyone seen anything like this? It's like the link itself is bad or there's some sort of junk in a hung transaction out there that's messing with things only on this server. Any ideas where to look are helpful.
Thanks for your time.
This is not a solution to the overall issue, but I've discovered a couple of things that affect the results.
Changing the joins to LEFT corrects the problem when running it on the 2005 server.
Adding the hint OPTION ( MERGE JOIN ) corrects it as well.
None of this explains why it runs properly on all other servers but horribly wrong on this one server. Changing the joins to LEFT didn't alter the execution plan's structure but adding the hint did.
We're to a point where we need to bring in an expert because working around the problem isn't acceptable in this case. I still welcome any ideas for what might be happening here.

SQL: SUM of MAX values WHERE date1 <= date2 returns "wrong" results

Hi stackoverflow users
I'm having a bit of a problem trying to combine SUM, MAX and WHERE in one query and after an intense Google search (my search engine skills usually don't fail me) you are my last hope to understand and fix the following issue.
My goal is to count people in a certain period of time and because a person can visit more than once in said period, I'm using MAX. Due to the fact that I'm defining people as male (m) or female (f) using a string (for statistic purposes), CHAR_LENGTH returns the numbers I'm in need of.
SELECT SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id")
So far, so good. But now, as stated before, I'd like to only count the guests which visited in a certain time interval (for statistic purposes as well).
SELECT "statistic"."id", SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id"),
"statistic", "guests"
WHERE ( "guests"."arrival" <= "statistic"."from" AND "guests"."departure" >= "statistic"."to")
GROUP BY "statistic"."id"
This query returns the following, x = desired result:
x * (x+1)
So if the result should be 3, it's 12. If it should be 5, it's 30 etc.
I probably could solve this algebraic but I'd rather understand what I'm doing wrong and learn from it.
Thanks in advance and I'm certainly going to answer all further questions.
PS: I'm using LibreOffice Base.
EDIT: An example
guests table:
ID | arrival | departure | gender |
10 | 1.1.14 | 10.1.14 | mf |
10 | 15.1.14 | 17.1.14 | m |
11 | 5.1.14 | 6.1.14 | m |
12 | 10.2.14 | 24.2.14 | f |
13 | 27.2.14 | 28.2.14 | mmmmmf |
statistic table:
ID | from | to | name |
1 | 1.1.14 | 31.1.14 |January | expected result: 3
2 | 1.2.14 | 28.2.14 |February| expected result: 7
MAX(...) is the wrong function: You want COUNT(DISTINCT ...).
Add proper join syntax, simplify (and remove unnecessary quotes) and this should work:
SELECT s.id, COUNT(DISTINCT g.id) AS People
FROM statistic s
LEFT JOIN guests g ON g.arrival <= s."from" AND g.departure >= s."too"
GROUP BY s.id
Note: Using LEFT join means you'll get a result of zero for statistics ids that have no guests. If you would rather no row at all, remove the LEFT keyword.
You have a very strange data structure. In any case, I think you want:
SELECT s.id, sum(numpersons) AS People
FROM (select g.id, max(char_length(g.gender)) as numpersons
from guests g join
statistic s
on g.arrival <= s."from" AND g.departure >= s."too"
group by g.id
) g join
GROUP BY s.id;
Thanks for all your inputs. I wasn't familiar with JOIN but it was necessary to solve my problem.
Since my databank is designed in german, I made quite the big mistake while translating it and I'm sorry if this caused confusion.
Selecting guests.id and later on grouping by guests.id wouldn't make any sense since the id is unique. What I actually wanted to do is select and group the guests.adr_id which links a visiting guest to an adress databank.
The correct solution to my problem is the following code:
SELECT statname, SUM (numpers) FROM (
SELECT statistic.name AS statname, guests.adr_id, MAX( CHAR_LENGTH( guests.gender ) ) AS numpers
FROM guests
JOIN statistics ON (guests.arrival <= statistics.too AND guests.departure >= statistics.from )
GROUP BY guests.adr_id, statistic.name )
GROUP BY statname
I also noted that my database structure is a mess but I created it learning by doing and haven't found any time to rewrite it yet. Next time posting, I'll try better.

SQL: Select distinct based on regular expression

Basically, I'm dealing with a horribly set up table that I'd love to rebuild, but am not sure I can at this point.
So, the table is of addresses, and it has a ton of similar entries for the same address. But there are sometimes slight variations in the address (i.e., a room # is tacked on IN THE SAME COLUMN, ugh).
Like this:
id | place_name | place_street
1 | Place Name One | 1001 Mercury Blvd
2 | Place Name Two | 2388 Jupiter Street
3 | Place Name One | 1001 Mercury Blvd, Suite A
4 | Place Name, One | 1001 Mercury Boulevard
5 | Place Nam Two | 2388 Jupiter Street, Rm 101
What I would like to do is in SQL (this is mssql), if possible, is do a query that is like:
SELECT DISTINCT place_name, place_street where [the first 4 letters of the place_name are the same] && [the first 4 characters of the place_street are the same].
to, I guess at this point, get:
Plac | 1001
Plac | 2388
Basically, then I can figure out what are the main addresses I have to break out into another table to normalize this, because the rest are just slight derivations.
I hope that makes sense.
I've done some research and I see people using regular expressions in SQL, but a lot of them seem to be using C scripts or something. Do I have to write regex functions and save them into the SQL Server before executing any regular expressions?
Any direction on whether I can just write them in SQL or if I have another step to go through would be great.
Or on how to approach this problem.
Thanks in advance!
Use the SQL function LEFT:
SELECT DISTINCT LEFT(place_name, 4)
I don't think you need regular expressions to get the results you describe. You just want to trim the columns and group by the results, which will effectively give you distinct values.
SELECT left(place_name, 4), left(place_street, 4), count(*)
FROM AddressTable
GROUP BY left(place_name, 4), left(place_street, 4)
The count(*) column isn't necessary, but it gives you some idea of which values might have the most (possibly) duplicate address rows in common.
I would recommend you look into Fuzzy Search Operations in SQL Server. You can match the results much better than what you are trying to do. Just google sql server fuzzy search.
Assuming at least SQL Server 2005 for the CTE:
;with cteCommonAddresses as (
select left(place_name, 4) as LeftName, left(place_street,4) as LeftStreet
from Address
group by left(place_name, 4), left(place_street,4)
having count(*) > 1
)
select a.id, a.place_name, a.place_street
from cteCommonAddresses c
inner join Address a
on c.LeftName = left(a.place_name,4)
and c.LeftStreet = left(a.place_street,4)
order by a.place_name, a.place_street, a.id

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.
Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;