Pig Latin: Filter number <5 and >= 5 in chararray (text and numbers) - apache-pig

How can I filter or group those with less than 5 years and those with more than 5 years. I am very new to Pig Latin. The ID, e.g. BUS2003 should be left as is.
Input Data
ID,Experience
BUS2003,More than 17 years teaching experience
BUS1303,2 years teaching experience
BUS4543,13 plus years of teaching experience; 4 plus years of corporate experience
BUS2103,4 year + 6 years in business
BUS2913,8 yrs teaching experience
I know how to load the data in to PigStorage or CSVloader, however, I am having a hard time solving the Experience due to words and numbers being together.
Desired result:
**Less than five years**
BUS1303,2 years teaching experience
BUS2103,4 year + 6 years in business
**Equal or greater than five years**
BUS2003,More than 17 years teaching experience
BUS4543,13 plus years of teaching experience; 4 plus years of corporate experience
BUS2913,8 yrs teaching experience
Thanks in advance.

You'll have to extract the number and then split.This should get you what you are looking for
A = LOAD 'input.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray);
B = FOREACH A GENERATE a1,a2,REGEX_EXTRACT(a2,'(\\d*)',1) as exp:int;
C = SPLIT B INTO C1 IF B.exp < 5, C2 IF B.exp >= 5;
DUMP C1;
DUMP C2;

Related

Is there a way to set a condition on the SQL code below? I only want to show up until the end year but am showing the start year + 5 years right now

Goodmorning everyone I have this SQL code that I am using in Access to show a table of total hours per year on a form. The user has the OPTION of choosing up to 6 years which means they could also simply choose 1 2 3 4 or 5 years. I only want to show the years up to the end year not the full 6 if 6 aren't selected. The way I have the sql written right now takes the start year and adds 1 to it until it reaches year 6 and displays them all. I have a variable for the end year as well and was thinking if there is a way to put in IF startyear + number is greater than endyear then end but not sure how to achieve that. Any help would be super helpful
code:
SELECT
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].EstimateID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].FY,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].YearID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].SumOfLaborHours
FROM qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear
WHERE (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value))
Or (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value+1))
Or (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value+2))
Or (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value+3))
Or (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value+4))
Or (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID)=Forms!frm_Home.Form.cboEstimate.Value))
And (((qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID)=Forms!frm_Home.Form.cboStartYear.Value+5))
ORDER BY [qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].EstimateID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].FY,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].YearID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].SumOfLaborHours;
Try something like this:
SELECT
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].EstimateID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].FY,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].YearID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].SumOfLaborHours
FROM
qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear
WHERE
qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.EstimateID = Forms!frm_Home!cboEstimate.Value
AND
(qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear.YearID Between
Forms!frm_Home!cboStartYear.Value And
Forms!frm_Home!cboStartYear.Value + Forms!frm_Home!YearsToSelect.Value)
ORDER BY
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].EstimateID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].FY,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].YearID,
[qry_LaborLineItems_TotalLaborHoursPerEstimatePerYear].SumOfLaborHours;
where Forms!frm_Home!YearsToSelect.Value holds the user's choice of count of years.

how to calculate prevalence with given data [duplicate]

This question already has answers here:
how to calculate prevalence using sql code
(3 answers)
Closed 3 years ago.
I have a dataset called 'disease' and I would like to calculate prevalence of certain disease using given dataset.
I have two column 'disease_id' and 'person_id' and I know there are total number of samples are 1453477 and I am going to filter certain disease using 'where' statement to 'disease_id'.
As I know the total number of sample which is 1453477 and let say I want to calculate prevalence for people who has diabetes and there is 851415 people who applies to that condition.
so 851415/1453477=0.58577 but when I run the query below I am keep getting 1 for the answer.
select count(disease__id) / count(person_id) as prevalence
from disease
where disease_id=12345;
I know that disease_id for diabetes is 12345 and there is 851415 people who are diabetes so 'count(disease_id)' should be 851415 and 'count(person_id)' should result 1453477.
can some one please help me on this?
Use this.
select
cast (
(
select
count(disease__id)
from
disease
where
disease_id = 12345
) as decimal
)
) / (select count(person_id) from disease)

Select a range of numbers not in a table in SQL [duplicate]

This question already has answers here:
What is the best way to create and populate a numbers table?
(12 answers)
Closed 4 years ago.
I am wondering if it is possible to get a query that will take a range of numbers, in this case 8 to 17, compare it against a field in a table and remove the ones that do appear in the table and return the rest?
I assume the peusdo code would look something like
Select nums from range(8-17) where nums not in (select column from table)
Is this possible at all?
Edit
To clarify my question.
In table I might have the following:
Intnumber
9
10
16
I would like to have the numbers between 8-17 that do not appear in this table, so 8,11,12,13,14,15,17
Kind regards
Matt
select nums from table where nums not between 8 and 17;

Db restructure due to performance issue

Currently I have arround 250 clients with their 5 years datas and the tables structure were splited up based on their years (Eg),A client named as XX.
T00_XX_2011,T00_XX_2012,T00_XX_2013,T00_XX_2014 each table contains 220 column with more or less 10 millions records in which 12 column already has indexes
The issue was for a single select query it get arround 5 to 10 min Can anyone help to tweek the performance

Group by on table, list all of one attribute for each of another attribute [duplicate]

This question already has an answer here:
Is there any function in oracle similar to group_concat in mysql? [duplicate]
(1 answer)
Closed 9 years ago.
I'm attempting to group this data by name so that instead of current query output giving me:
Name Number
Nice guy 1
Nice guy 2
Nice guy 4
Nice guy 5
Nice guy 6
Nice guy 7
Nice guy 8
Nice guy 9
Nice guy 10
Nice guy 11
Nice guy 12
Frank 3
Frank 4
I would get this:
Name Number
Nice guy 1,2,4,...
Frank 3,4
Here is my current query:
select distinct name, number
from patterns,numbers,people
where patterns.index=numbers.index
AND patterns.id=people.id
order by name, charge;
What I have tried is this, but it fails:
select distinct name, number
from patterns,numbers,people
where patterns.index=numbers.index
AND patterns.id=people.id
group by name
order by name, number;
Any help would be greatly appreciated!
UPDATED: Try it this way
SELECT name, WM_CONCAT(number) number
FROM
(
SELECT DISTINCT name, number
FROM patterns t JOIN numbers n
ON t.index = n.index JOIN people p
ON t.id = p.id
) q
GROUP BY name
ORDER BY name