Calculate percentage on boolean column - apache-pig

Assuming my data has the following structure :
Year | Location | New_client
2018 | Paris | true
2018 | Paris | true
2018 | Paris | false
2018 | London | true
2018 | Madrid | true
2018 | Madrid | false
2017 | Paris | true
I'm trying to calculate for each year and location the percentage of true value for New_client, so an example taking the records from the structure example would be
2018 | Paris | 66
2018 | London | 100
2018 | Madrid | 50
2017 | Paris | 100
Adapting from https://stackoverflow.com/a/13484279/2802552 my current script is but the difference is that instead of 1 column it's using 2 columns (Year and Location)
data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
total = COUNT(data);
t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
The problem is this uses Year as reference while I need it to be Year AND District.
Thanks for your help.

You need to group by both Year and Location, which will require two modifications. First, add Location to the group by statement. Second, change FLATTEN(group) AS Year to FLATTEN(group) AS (Year, Location) since group is now a tuple with two fields.
grp = group inpt by (Year, Location);
result = FOREACH grp {
total = COUNT(inpt);
t = FILTER inpt BY New_client == 'true';
GENERATE
FLATTEN(group) AS (Year, Location),
total AS TOTAL_ROWS_IN_INPUT_TABLE,
100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

Tested this code and looks working for me:
A = LOAD ...
B = GROUP A BY (year, location);
C = FOREACH B {
TRUE_CNT = FILTER A BY (chararray)new_client == 'true';
GENERATE group.year, group.location, (int)((float)COUNT(TRUE_CNT) / COUNT(A) * 100);
}
DUMP C;
(2017,Paris,100)
(2018,Paris,66)
(2018,London,100)
(2018,Madrid,50)

Related

Distinct values between two SQL queries

I want to be able to find any potential differences between data inputted on one day to another.
The relevant col. from the table are seen below:
Name | Size | DateSale | Location | Comments | Date
The two current queries are:
Select Name, Size, DateSale, Location, Comments from [Table] where Date = '06/02/2022'
Select Name, Size, DateSale, Location, Comments from [Table] where Date = '06/01/2022'
How would I come up with a list of values that are different from these two lists? Tried working with select distinct but could not figure it out.
Sample Data:
Name | Size | DateSale | Location | Comments | Date
john | 100 |06/05/2022| Houston | proj. | 06/02/2022
john | 100 |06/04/2022| Dallas | | 06/01/2022
jake | 90 |06/04/2022| Houston | proj. | 06/02/2022
jake | 90 |06/04/2022| Houston | proj. | 06/01/2022
Desired Result:
john | 100 |06/05/2022| Houston | proj. | 06/02/2022
Since the keys (Name + Size) are the same, but there are differences in the other categories (Sale Date, Location, or Comments), it will return
the row from the first query (most recent date)
SELECT y.* FROM (SELECT * from Table where Date = '05/31/2022') as x,
(SELECT * FROM Table where DATE = '06/02/2022') as y where x.Name = y.Name
and x.Size = y.Size and (x.DateSale!=y.DateSale or x.Location!=y.Location or
x.COMMENTS != y.COMMENTS)
This solution worked for me

Dynamic Column Names in BigQuery SQL Query

I have a BigQuery table in which every row is a visit of a user in a country. The schema is something like this:
UserID | Place | StartDate | EndDate | etc ...
---------------------------------------------------------------
134 | Paris | 234687432 | 23648949 | etc ...
153 | Bangkok | 289374897 | 2348709 | etc ...
134 | Paris | 9287324892 | 3435438 | etc ...
The values of the "Place" columns can be no more than tens of options, but I don't know them all in advance.
I want to query this table so that in the resulted table the columns are named as all the possibilities of the Place column, and the values are the total number of visits per user in this place.
The end result should look like this:
UserID | Paris | Bangkok | Rome | London | Rivendell | Alderaan
----------------------------------------------------------------
134 | 2 | 0 | 0 | 0 | 0 | 0
153 | 0 | 1 | 0 | 0 | 0 | 0
I guess I can select all the possible values of "Place" with SELECT DISTINCT but how can I achieve this structure of result table?
Thanks
Below is for BigQuery Standard SQL
Step 1 - dynamically assemble proper SQL statement with all possible values of "place" field
#standardSQL
SELECT '''
SELECT UserID,''' || STRING_AGG(DISTINCT
' COUNTIF(Place = "' || Place || '") AS ' || REPLACE(Place, ' ', '_')
) || ''' FROM `project.dataset.table`
GROUP BY UserID
'''
FROM `project.dataset.table`
Note: you will get one row output with the text like below (already split in multiple rows for better reading
SELECT UserID,
COUNTIF(Place = "Paris") AS Paris,
COUNTIF(Place = "Los Angeles") AS Los_Angeles
FROM `project.dataset.table`
GROUP BY UserID
Note; I replaced Bangkok with Los Angeles so you see why it is important to replace possible spaces with underscores
Step 2 - just copy output text of Step 1 and simply run it
Obviously you can automate above two steps using any client of your choice
If you just want to count the places, you can use countif():
select userid,
countif(place = 'Paris') as paris,
countif(place = 'Bangkok') as bangkok,
countif(place = 'Rome') as rome,
. . .
from t
group by userid;

Access sql to retrieve counts of values meeting a condition

I'm trying to write a query in Access that will return a count of values for each site in a table where the value exceeds a specified level, but also, for sites that have no values exceeding that level, return a specified value, such as "NA".
I've tried Iif, Switch, Union, sub queries, querying a different query, but no luck. I can get all the counts exceeding the level, or all sites with "NA" correct but showing total count for the rest, not just count above the level.
For example, in the table below, assuming level > 10, Houston = "NA", Detroit = 2, Pittsburgh PA = 3. I just can't get both sides of the query to work.
Apologize in advance for poor formatting.
+-----------------+-------+
| 1. Site | Value |
+-----------------+-------+
| 2. Houston | 10 |
| 3. Houston | 3 |
| 4. Houston | 0 |
| 5. Detroit | 15 |
| 6. Detroit | 7 |
| 7. Detroit | 4 |
| 8. Detroit | 12 |
| 9. Pittsburgh | 23 |
| 10. Pittsburgh | 2 |
| 11. Pittsburgh | 18 |
| 12. Pittsburgh | 12 |
+-----------------+-------+
Another solution is to use conditional aggregation, as follows :
SELECT site, SUM(IIf(value > 10, 1, 0)) AS value
FROM mytable
GROUP BY site
This approach should be more efficient than self-joining the table, since it requires to scan the table only once.
The SUM(IIf ...) is a handy construct to count how many records satisfy a given condition.
NB : it is generally not a good idea to return two different data types in the same column (in your use case, either a number or string 'NA'). Most RDBMS do not allow that. So I provided a query that will return 0 when there are not matches, instead of NA. If you really want 'NA', you can try :
IIF(
SUM(IIf(value > 10, 1, 0)) = 0,
'NA',
STR(SUM(IIf(value > 10, 1, 0)))
) AS value
This demo on DB Fiddle, with your sample data returns :
site | value
:--------- | ----:
Detroit | 2
Houston | 0
Pittsburgh | 3
Get a list of all sites independant of the counts (SiteList derived table below)
LEFT Join this back to your base table (SiteValues) to get the counts for each site where it's meeting threshold. --note should join on key which I'm not sure what is for this table. site alone isn't enough
Count the values from the siteValues dataset as NULL's will get counted as 0.
WORKING DEMO:
.
SELECT SiteList.Site, Count(Sitevalues.Site)
FROM (SELECT site, value
FROM TableName) SiteList
LEFT JOIN TableName SiteValues
on SiteList.Site = SiteValues.Site
and SiteValues.Value > 10
and SiteValues.Value = SiteList.value
GROUP BY SiteList.Site
GIVING US:
+----+------------+------------------+
| | Site | (No column name) |
+----+------------+------------------+
| 1 | Detroit | 2 |
| 2 | Houston | 0 |
| 3 | Pittsburgh | 3 |
+----+------------+------------------+
Or if you need the NA you have to cast the count to a varchar
SELECT SiteList.Site, case when Count(Sitevalues.Site) = 0 then 'NA' else cast(count(Sitevalues.site) as varchar(10)) end as SitesMeetingThreshold
FROM (SELECT site, value
FROM TableName) SiteList
LEFT JOIN TableName SiteValues
on SiteList.Site = SiteValues.Site
and SiteValues.Value > 10
and SiteValues.Value = SiteList.value
GROUP BY SiteList.Site
Just use conditional aggregation:
select site,
max(iif(value > 10, 1, 0)) as cnt_11plus
from t
group by site;
I think 0 is better than N/A. But if you want that you'll need to convert the results to a string.
select site,
iif(max(iif(value > 10, 1, 0)) > 0,
str(max(iif(value > 10, 1, 0))),
"N/A"
) as cnt_11plus
from t
group by site;
You can use UNION like this:
SELECT site, count(value) AS counter
FROM sites
WHERE value > 10
GROUP BY site
UNION
SELECT s.site, 'NA' AS counter
FROM sites AS s
WHERE value <= 10
AND NOT EXISTS (
SELECT 1 FROM sites WHERE site = s.site AND value > 10
)
GROUP BY site
Results:
site counter
Detroit 2
Houston NA
Pittsburgh 3
There is no need to convert the integer counter to Text, because Access does this implicitly for you.

Remove all records with opposite sign

I'm looking for a SQL query (or even better a LINQ query) to remove people who have cancelled their leave, i.e. remove all records with the same NAME and same START and END and the DAYS_TAKEN values differ only in the sign.
How to get from this
NAME |DAYS_TAKEN |START |END |UNIQUE_LEAVE_ID
--------|-----------|-----------|-----------|-----------
Alice | 2 | 1 June | 3 June | 1 --remove because cancelled
Alice | -2 | 1 June | 3 June | 2 --cancelled
Alice | 3 | 5 June | 8 June | 3 --keep
Bob | 10 | 4 June | 14 June | 4 --keep
Charles | 12 | 2 June | 14 June | 5 --remove because cancelled
Charles | -12 | 2 June | 14 June | 6 --cancelled
David | 5 | 3 June | 8 June | 7 --keep
To this?
NAME |DAYS_TAKEN |START |END |UNIQUE_LEAVE_ID
--------|-----------|-----------|-----------|-----------
Alice | 3 | 5 June | 8 June | 3 --keep
Bob | 10 | 4 June | 14 June | 4 --keep
David | 5 | 3 June | 8 June | 7 --keep
What I've tried
Query1 to find all the cancelled records (not sure if this is correct)
SELECT L1.UNIQUE_LEAVE_ID
FROM LEAVE L1
INNER JOIN LEAVE L2 ON L2.DAYS_TAKEN > 0 AND ABS(L1.DAYS_TAKEN) = L2.DAYS_TAKEN AND L1.NAME= L2.NAME AND L1.START = L2.START AND L1.END = L2.END
WHERE L1.DAYS_TAKEN < 0
Then I use Query1 twice in an inner select like so
SELECT L.* FROM LEAVE L WHERE
L.UNIQUE_LEAVE_ID NOT IN (Query1)
AND L.UNIQUE_LEAVE_ID NOT IN (Query1)
Is there a way to use the inner query only once?
(It's an Oracle database, being called from .NET/C#)
You can use a query like the following:
SELECT NAME, START, END
FROM LEAVE
GROUP BY NAME, START, END
HAVING SUM(DAYS_TAKEN) = 0
in order to get NAME, START, END groups that have been cancelled (assuming DAYS_TAKEN of the cancellation record negates the days of the initial record).
Output:
NAME |START |END
--------|-----------|----------
Alice | 1 June | 3 June
Charles | 2 June | 14 June
Using the above query as a derived table you can get records not being related to 'cancelled' groups:
SELECT L1.NAME, L1.DAYS_TAKEN, L1.START, L1.END, L1.UNIQUE_LEAVE_ID
FROM LEAVE L1
LEFT JOIN (
SELECT NAME, START, END
FROM LEAVE
GROUP BY NAME, START, END
HAVING SUM(DAYS_TAKEN) = 0
) L2 ON L1.NAME = L2.NAME AND L1.START = L2.START AND L1.END = L2.END
WHERE L2.NAME IS NULL
Output:
NAME |DAYS_TAKEN |START |END |UNIQUE_LEAVE_ID
--------|-----------|-----------|-----------|-----------
Alice | 3 | 5 June | 8 June | 3
Bob | 10 | 4 June | 14 June | 4
David | 5 | 3 June | 8 June | 7
You can use not exists:
select l.*
from leave l
where not exists (select 1
from leave l2
where l2.name = l.name and l2.start = l.start and
l2.end = l.name and l2.days_taken = - l.days_taken
);
This query can take advantage of an index on leave(name, start, end, days_taken).
Here is a variation with SUM() OVER:
SELECT x.*
FROM (SELECT l.*, SUM (days_taken) OVER (PARTITION BY name, "START", "END", ABS (days_taken) ORDER BY NULL) s
FROM leave l) x
WHERE s <> 0
And if you have Oracle 12, this give you the canceled:
SELECT l.*
FROM leave l,
LATERAL (SELECT days_taken
FROM leave l2
WHERE l2.name = l.name
AND l2."START" = l."START"
AND l2."END" = l."END"
AND l2.days_taken = -l.days_taken) x
and this what should remain:
SELECT l.*
FROM leave l
OUTER APPLY (SELECT days_taken
FROM leave l2
WHERE l2.name = l.name
AND l2."START" = l."START"
AND l2."END" = l."END"
AND l2.days_taken = -l.days_taken) x
WHERE x.days_taken IS NULL
And something about the column names.Using reserved word in Oracle SQL is not recommended, but if you must do it, use '"' like here.
I used Giorgos answer to come up with this Linq solution. This solution also considers people who cancel / apply their leave multiple times. See Alice and Edgar below.
Sample data
int id = 0;
List<Leave> allLeave = new List<Leave>()
{
new Leave() { UniqueLeaveID=id++, Name="Alice", Start=new DateTime(2016,6,1), End=new DateTime(2016,6,3), Taken=-2 },
new Leave() { UniqueLeaveID=id++,Name="Alice", Start=new DateTime(2016,6,1), End=new DateTime(2016,6,3), Taken=2 },
new Leave() { UniqueLeaveID=id++, Name="Alice", Start=new DateTime(2016,6,1), End=new DateTime(2016,6,3), Taken=2 },
new Leave() { UniqueLeaveID=id++,Name="Alice", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,5), Taken=3 },
new Leave() { UniqueLeaveID=id++,Name="Bob", Start=new DateTime(2016,6,4), End=new DateTime(2016,6,14), Taken=10 },
new Leave() { UniqueLeaveID=id++,Name="Charles", Start=new DateTime(2016,6,2), End=new DateTime(2016,6,14), Taken=12 },
new Leave() { UniqueLeaveID=id++,Name="Charles", Start=new DateTime(2016,6,2), End=new DateTime(2016,6,14), Taken=-12 },
new Leave() { UniqueLeaveID=id++,Name="David", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,8), Taken=5 },
new Leave() { UniqueLeaveID=id++,Name="Edgar", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,8), Taken=5 },
new Leave() { UniqueLeaveID=id++,Name="Edgar", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,8), Taken=5 },
new Leave() { UniqueLeaveID=id++,Name="Edgar", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,8), Taken=5 },
new Leave() { UniqueLeaveID=id++,Name="Edgar", Start=new DateTime(2016,6,3), End=new DateTime(2016,6,8), Taken=5 }
};
Linq Query (watch out for Oracle version 11 vs 12)
var filteredLeave = allLeave
.GroupBy(a => new { a.Name, a.Start, a.End })
.Select(a => new { Group = a.OrderByDescending(b=>b.Taken), Count = a.Count() })
.Where(a => a.Count % 2 != 0)
.Select(a => a.Group.First());
"OrderByDescending" ensures only positive days taken are returned.
Oracle SQL
SELECT
*
FROM
(
SELECT
L1.NAME, L1.START, L1.END, MAX(TAKEN) AS TAKEN, COUNT(*) AS CNT
FROM LEAVE L1
GROUP BY L1.NAME, L1.START, L1.END
) L2
WHERE MOD(L2.CNT,2)<>0 -- replace MOD with % for Microsoft SQL
The condition "WHERE MOD(L2.CNT,2)<>0" (or in Linq "a.Count % 2 != 0") only returns people who applied once or odd number of times (e.g. apply - cancel - apply). But people who apply - cancel - apply - cancel are filtered out.

SQLlite strftime function to get grouped data by months

i have table with following structure and data:
I would like to get grouped data by months in given date range for example (from 2014-01-01 to 2014-12-31). Data for some months cannot be available but i still need to have in result information that in given month is result 0.
Result should have following format:
MONTH | DIALS_CNT | APPT_CNT | CONVERS_CNT | CANNOT_REACH_CNT |
2014-01 | 100 | 50 | 20 | 30 |
2014-02 | 100 | 40 | 30 | 30 |
2014-03 | 0 | 0 | 0 | 0 |
etc..
WHERE
APPT_CNT = WHERE call.result = APPT
CONVERS_CNT = WHERE call.result = CONV_NO_APPT
CANNOT_REACH_CNT = WHERE call.result = CANNOT_REACH
How can i do it please with usage function strftime ?
Many thanks for any help or example.
SELECT Month,
(SELECT COUNT(*)
FROM MyTable
WHERE date LIKE Month || '%'
) AS Dials_Cnt,
(SELECT SUM(Call_Result = 'APPT')
FROM MyTable
WHERE date LIKE Month || '%'
) AS Appt_Cnt,
...
FROM (SELECT '2014-01' AS Month UNION ALL
SELECT '2014-02' UNION ALL
SELECT '2014-03' UNION ALL
...
SELECT '2014-12')