Pig data cleaning based on date

Pig data cleaning based on date - apache-pig

I have 2 datasets as shown below:
1. ID and location
{ID, beginning year, ending year, location}.
sample:
(1001, 2010, 2012, CA)
(1001, 2013, 2015, WA)
(1002, 2009, 2015, AZ)
(1003, 2014, 2015, FL)
2. ID and connection
{ID1, ID2, connection creating date}
sample:
(1001, 1002, 2013)
(1001, 1003, 2014)
I want to count the number of connections based on location and year. I assume once the connection is created, it never expires. The results I am looking for is below
{Location 1, Location2, year, number of connections}
In the example above, it should be:
(WA, AZ,2013,1)
(WA, AZ,2014,1)
(WA, AZ,2015,1)
(WA, FL,2014,1)
(WA, FL,2015,1)
Does anyone know how to accomplish that in Apache pig?

As mentioned in your comment, we will at some point need to move to yearly information. To minimizes impact of data size bloat up, we need to move it as far down as possible in our pig script.
First thing we need to do is the following data translation:
{ID1, ID2, connection creating date} -> {Location1, Location2, start_year, end_year}
This can be achived with the following pig script statements:
locationData = LOAD 'path1' USING PigStorage('\t') AS (ID:chararray, beginning_year:long, ending_year:long, location:chararray);
connectionData = LOAD 'path2' USING PigStorage('\t') AS (ID1:chararray, ID2:chararray, connection_year:long);
partialJoin = JOIN connectionData USING ID1, locationData USING ID;
partialExtracted = FOREACH partialJoin GENERATE
ID2,
connection_year,
location AS location1,
(beginning_year > connection_year ? beginning_year : connection_year) AS start_year,
ending_year AS end_year;
fullJoin = JOIN partialExtracted USING ID2, locationData USING ID;
fullExtracted = FOREACH fullJoin GENERATE,
location1,
location AS location2,
(beginning_year > start_year ? beginning_year : start_year) AS start_year,
(ending_year < end_year ? ending_year : end_year ) AS end_year;
fullFiltered = FILTER fullExtracted BY (end_year < start_year);
We are now ready to explode the data to get yearly information. Essentially, the following data translation needs to happen:
{Location1, Location2, start_year, end_year} -> {Location1, Location2, year}
e.g.
WA, AZ, 2013, 2015
->
WA, AZ, 2013
WA, AZ, 2014
WA, AZ, 2015
Here a UDF is unavoidable. We will need a UDF which takes start year and end year and returns a bag of the range of years. You should be able to follow online tutorial to write your UDF. Lets say this UDF is called getYearRange(). Your script will look as follows:
fullExploded = FOREACH fullFiltered GENERATE
location1, location2,
FLATTEN(getYearRange(start_year, end_year)) AS year;
All that remains is a GROUP BY to get your final counts:
fullGrouped = GROUP fullExploded BY (location1, location2, year);
finalOutput = FOREACH fullGrouped GENERATE
FLATTEN(group) AS (location1, location2, year),
COUNT(fullExploded) AS count;
The above describes the data flow. You might need to add additional steps to take care of edge cases and ensure data sanity.

Related

How to write SQL in DAX?

I trying to write the following SQL code in DAX.
select SUM(amount) from TABLE
where BusinessDate = '2023-02-11'
OR (Segment = 'General' and Days_Without_Payments < 210)
OR (Segment = 'Enterprise' and Days_Without_Payments < 1215)
OR (Segment = 'House' and Days_Without_Payments < 1945)
I got the first part(i-e. SUM) right, but can't get my head around the filters.
What I tried:
Measure = CALCULATE([Total Outstanding],
FILTER(VALUES(BILBL_WO[Date]),BILBL_WO[Date]=MAX(BILBL_WO[Date])),
AND(BILBL_WO[Days_Without_Payments] < 210, BILBL_WO[Segment] = "General") ||
AND(BILBL_WO[Days_Without_Payments] < 1215, BILBL_WO[Segment] = "Enterprise") ||
AND(BILBL_WO[Days_Without_Payments] < 1945, BILBL_WO[Segment] = "House")
)
Where Total Outstanding is another Measure which I created for summation of Amount.
Please help as I couldn't find something useful from internet and I am new at this. Thanks!

In general, when people start learning DAX, most think it is like SQL query, but that is not true because SQL is a structured query language. In contrast, DAX is a formula language used for analysis purposes similar to excel equations.
I will give you a few pointers on how to convert SQL to DAX in simple terms considering you have one table that you imported to Power BI or SSAS tabular model, which are as follow:
You need to segregate your query as follows:
Aggregation function in your example SUM (amount) in SQL query will be in DAX:
Total Outstanding = sum('BILBL_WO'[amount])
Where condition on a date column:
I usually create a date dimension and create a relation with the table that has the date column. You can do that when you go into more details relating to evaluation context in Power BI and Star schema architecture as a start, but in your case, if the column data type is a string, convert it to date type and then use the date column as a filter in the power bi report page, allowing the report users to select different dates, not just ‘2023-02-11'.
For your conditions on Segment and Days_Without_Payments, before converting it to DAX, I think your original query is missing Parenthesis. If I understand it correctly, please let me know if I am wrong.
select SUM(amount)
from TABLE
where BusinessDate = '2023-02-11'
and
(
(Segment = 'General' and Days_Without_Payments < 210)
OR
(Segment = 'Enterprise' and Days_Without_Payments < 1215)
OR
(Segment = 'House' and Days_Without_Payments < 1945)
)
Altered SQL script
If my above assumption is valid and you need to write a DAX measure with the Segment and Days_Without_Payments conditions, you can write it as follow:
Measure = CALCULATE([Total Outstanding] ,
KEEPFILTERS(
FILTER(
ALL(BILBL_WO[Segment] ,BILBL_WO[Days_Without_Payments]),
(BILBL_WO[Segment] = "General" && BILBL_WO[Days_Without_Payments] < 210)
||
(BILBL_WO[Segment] = "Enterprise" && BILBL_WO[Days_Without_Payments] < 1215)
||
(BILBL_WO[Segment] = "House" && BILBL_WO[Days_Without_Payments] < 1945))
)
)
And for more on how to filter DAX using multicolumn, see this YouTube video https://www.youtube.com/watch?v=kQjYG6TJVp8 and follow the SQLBI channel. It will help you a lot.
I hope I helped in a way; if so, please mark this as an answer and vote for it :)

Using the sum of previous payslips of the same year in salary calculation in Odoo

I am using Odoo Payroll 15 with some custom python structures.
I have some type of taxes that have a maximum amount that has to be paid during a year per employee. I want to do a sum of the payslip of that year for a specific line, then I can calculate correctly the amount that has to be paid in that payslip.
I have tried this previously, but it seems like it uses only the current payslip in its calculation:
payslip.sum('THE_TAX', '2022-01-01', '2022-12-31')
How can I access previous payslips within the payslip rules? A solution that requires Odoo 16 would work for me too.

The main difference between your example and the sum implementation is that your code does not check for payslip states.
This means it does cout all Canceled and unfinished payslips as well.
And the sum implementation counts only payslips with the state in 'done'.
You can add modification of the sum function to the payslip model and call that if you need a different behavior.
The implementation is from v12. Now the hr_payslip is in the Enterprise edition.
class Payslips(BrowsableObject):
def sum(self, code, from_date, to_date=None):
if to_date is None:
to_date = fields.Date.today()
self.env.cr.execute("""SELECT sum(pl.total) -- this line is different in v12 and v15
FROM hr_payslip as hp, hr_payslip_line as pl
WHERE hp.employee_id = %s
AND hp.state = 'done'
AND hp.date_from >= %s
AND hp.date_to <= %s
AND hp.id = pl.slip_id
AND pl.code = %s""",
(self.employee_id, from_date, to_date, code))
res = self.env.cr.fetchone()
return res and res[0] or 0.0
https://github.com/odoo/odoo/blob/c53081f10befd4f1c98e46a450ed3bc71a6246ed/addons/hr_payroll/models/hr_payslip.py#L300
Edit:
I think it is Odoos bug, the paid state should be included, but it is not. You can use variation of your code; filtering, and mapped functions should make it faster.
I can't test the code:
already_paid = sum(
employee.mapped("slip_ids")
.filtered(lambda s: s.date_from.year == 2022 and s.state in ("done", "paid"))
.mapped("line_ids")
.filtered(lambda l: l.code == "THE_TAX")
.mapped("total")
)

I think I found an ugly workaround that does give the correct answer:
already_paid = 0.0
for slip in employee.slip_ids:
for line in slip.line_ids:
if line.code != "THE_TAX":
continue
if line.date_from.year != 2022:
continue
already_paid += line.total

SQL Server Exception Reporting Script

Seeing this is my first post of 2016, Happy New Year to all.
Once again, I'm stuck with a tricky report I must pull from SQL Server.
I have a two tables called Documents and Doctype
Columns in Documents:
File_Name, Date, Region, DoctypeID
Data inside this columns are as follows.
00001,2016-01-06,JHB,1d187ecc
00001,2016-01-06,JHB,bccc05f9
00001,2016-01-06,JHB,fe697be0
00001,2016-01-06,JHB,bbae8c73
00002,2016-01-06,JHB,1d187ecc
00002,2016-01-06,JHB,bccc05f9
00002,2016-01-06,JHB,fe697be0
Columns in Doctype:
DoctypeID, Document_Type
Data inside this columns are as follows.
1d187ecc, Collection
bccc05f9, Image
fe697be0, Log
bbae8c73, Sent to warehouse.
My query needs to give me the below result, using the data above.
File_Name,Collection, Image, Log, Sent to Warehouse,Region
00001, 1, 1, 1, 1, JHB
00002, 1, 1, 1, 0, JHB
I hope the above makes sense, how would I go about doing this?
Thank you all in advance.

As you can try this:
SELECT FILE_NAME, ISNULL([Collection],0) AS [Collection], ISNULL([Image],0) AS [Image],
ISNULL([Log],0) AS [Log], ISNULL([Sent to warehouse],0) AS [Sent to warehouse], Region
FROM (
SELECT FILE_NAME, Document_Type, COUNT(Document_Type) AS Frequency, Region
FROM documents d,doctype dt WHERE d.DoctypeID = dt.DoctypeID
GROUP BY Document_Type, FILE_NAME,REGION
) AS s
PIVOT
(
SUM(Frequency)
FOR [Document_Type] IN ([Collection], [Image], [Log], [Sent to warehouse])
)AS pvt
Here is a detail reference that you can read.

SubQuery Aggregates in ActiveRecord

I'm trying to avoid using straight up SQL in my Rails app, but need to do a quite large version of this:
SELECT ds.product_id,
( SELECT SUM(units) FROM daily_sales WHERE (date BETWEEN '2015-01-01' AND '2015-01-08') AND service_type = 1 ) as wk1,
( SELECT SUM(units) FROM daily_sales WHERE (date BETWEEN '2015-01-09' AND '2015-01-16') AND service_type = 1 ) as wk2
FROM daily_sales as ds group by ds.product_id
I'm sure it can be done, but i'm struggling to write this as an active record statement. Can anyone help?

If you must do this in a single query, you'll need to write some SQL for the CASE statements. The following is what you need:
ranges = [ # ordered array of all your date-ranges
Date.new(2015, 1, 1)..Date.new(2015, 1, 8),
Date.new(2015, 1, 9)..Date.new(2015, 1, 16)
]
overall_range = (ranges.first.min)..(ranges.last.max)
grouping_sub_str = \
ranges.map.with_index do |range, i|
"WHEN (date BETWEEN '#{range.min}' AND '#{range.max}') THEN 'week#{i}'"
end.join(' ')
grouping_condition = "CASE #{grouping_sub_str} END"
grouping_columns = ['product_id', grouping_condition]
DailySale.where(date: overall_range).group(grouping_columns).sum(:units)
That will produce a hash with array keys and numeric values. A key will be of the form [product_id, 'week1'] and the value will be the corresponding sum of units for that week.

Simplify your SQL to the following and try converting it..
SELECT ds.product_id,
, SUM(CASE WHEN date BETWEEN '2015-01-01' AND '2015-01-08' AND service_type = 1
THEN units
END) WK1
, SUM(CASE WHEN date BETWEEN '2015-01-09' AND '2015-01-16' AND service_type = 1
THEN units
END) WK2
FROM daily_sales as ds
group by ds.product_id

Every rail developer sooner or later hits his/her head against the walls of Active Record query interface just to find the solution in Arel.
Arel gives you the flexibility that you need in creating your query without using loops, etc. I am not going to give runnable code rather some hints how to do it yourself:
We are going to use arel_tables to create our query. For a model called for example Product, getting the Arel table is as easy as products = Product.arel_table
Getting sum of a column is like daily_sales.project(daily_sales[:units].count).where(daily_sales[:date].gt(BEGIN_DATE).where(daily_sales[:date].lt(END_DATE). You can chain as many wheres as you want and it will be translated into SQL ANDs.
Since we need to have multiple sums in our end result you need to make use of Common Table Expressions(CTE). Take a look at docs and this answer for more info on this.
You can use those CTEs from step 3 in combination with group and you are done!

How would I translate this SQL query into a Raven Map/Reduce query?

Following on from my previous question at When is a groupby query evaluated in RavenDB? I decided to completely restructure the data into a format that is theoretically easier to query on.
Having now created the new data structure, I am struggling to find how to query it.
It took me 30 seconds to write the following SQL query which gives me exactly the results I need:
SELECT GroupCompanyId, AccountCurrency, AccountName, DATEPART(year, Date) AS Year,
(SELECT SUM(Debit) AS Expr1
FROM Transactions AS T2
WHERE (T1.GroupCompanyId = GroupCompanyId) AND (T1.AccountCurrency = AccountCurrency) AND (T1.AccountName = AccountName) AND (DATEPART(year,
Date) < DATEPART(year, T1.Date))) AS OpeningDebits,
(SELECT SUM(Credit) AS Expr1
FROM Transactions AS T2
WHERE (T1.GroupCompanyId = GroupCompanyId) AND (T1.AccountCurrency = AccountCurrency) AND (T1.AccountName = AccountName) AND (DATEPART(year,
Date) < DATEPART(year, T1.Date))) AS OpeningCredits, SUM(Debit) AS Db, SUM(Credit) AS Cr
FROM Transactions AS T1
WHERE (DATEPART(year, Date) = 2011)
GROUP BY GroupCompanyId, AccountCurrency, AccountName, DATEPART(year, Date)
ORDER BY GroupCompanyId, AccountCurrency, Year, AccountName
So far I have got the Map/Reduce as follows, which from Studio appears to give the correct results - i.e. it breaks down and groups the data by date.
public Transactions_ByDailyBalance()
{
Map = transactions => from transaction in transactions
select new
{
transaction.GroupCompanyId,
transaction.AccountCurrency,
transaction.Account.Category,
transaction.Account.GroupType,
transaction.AccountId,
transaction.AccountName,
transaction.Date,
transaction.Debit,
transaction.Credit,
};
Reduce = results => from result in results
group result by new
{
result.GroupCompanyId,
result.AccountCurrency,
result.Category,
result.GroupType,
result.AccountId,
result.AccountName,
result.Date,
}
into g
select new
{
GroupCompanyId = g.Select(x=>x.GroupCompanyId).FirstOrDefault(),
AccountCurrency = g.Select(x=>x.AccountCurrency).FirstOrDefault(),
Category=g.Select(x=>x.Category).FirstOrDefault(),
GroupType=g.Select(x=>x.GroupType).FirstOrDefault(),
AccountId = g.Select(x=>x.AccountId).FirstOrDefault(),
AccountName=g.Select(x=>x.AccountName).FirstOrDefault(),
Date=g.Select(x=>x.Date).FirstOrDefault(),
Debit=g.Sum(x=>x.Debit),
Credit=g.Sum(x=>x.Credit)
};
Index(x=>x.GroupCompanyId,FieldIndexing.Analyzed);
Index(x=>x.AccountCurrency,FieldIndexing.Analyzed);
Index(x=>x.Category,FieldIndexing.Analyzed);
Index(x=>x.AccountId,FieldIndexing.Analyzed);
Index(x=>x.AccountName,FieldIndexing.Analyzed);
Index(x=>x.Date,FieldIndexing.Analyzed);
}
}
However, I can't work out how to query the data at one go.
I need the opening balance as well as the period balance, so I ended up writing this query which takes as a parameter the account. Following on from Oren's comments to my previous question, that I was mixing Linq with Lucene query, having rewritten the query, I've basically ended up again with a mixed query.
Even though I am showing in the SQL query above that I am filtering by year, in fact I need to be able to determine the current balance from any day.
private LedgerBalanceDto GetAccountBalance(BaseAccountCode account, DateTime periodFrom, DateTime periodTo, string queryName)
{
using (var session = MvcApplication.RavenSession)
{
var query = session.Query<Transactions_ByDailyBalance.Result, Transactions_ByDailyBalance>()
.Where(c=>c.AccountId==account.Id && c.Date>=periodFrom && c.Date<=periodTo)
.OrderBy(c=>c.Date)
.ToList();
var debits = query.Sum(c => c.Debit);
var credits = query.Sum(c => c.Credit);
var ledgerBalanceDto = new LedgerBalanceDto
{
Account = account,
Credits = credits,
Debits = debits,
Currency = account.Currency,
CurrencySymbol = account.CurrencySymbol,
Name = queryName,
PeriodFrom = periodFrom,
PeriodTo = periodTo
};
return ledgerBalanceDto;
}
}
Required result:
GroupCompanyId AccountCurrency AccountName Year OpeningDebits OpeningCredits Db Cr
Groupcompanies-2 EUR Customer 1 2011 148584.2393 125869.91 10297.6891 28023.98
Groupcompanies-2 EUR Customer 2 2011 236818.0054 233671.55 50959.85 54323.38
Groupcompanies-2 USD Customer 3 2011 69426.11761 23516.3776 10626.75 0
Groupcompanies-2 USD Customer 4 2011 530587.9223 474960.51 97463.544 131497.16
Groupcompanies-2 USD Customer 5 2011 29542.391 28850.19 4023.688 4231.388
Any suggestions would be greatly appreciated
Jeremy
In answer to the comment
I basically ended up doing pretty much the same thing. Actually, I wrote an index that does it in only two hits - once for the opening balance and again for the period balance. This is almost instantaneous for grouping by the account name, category etc.
However my problem now is getting a daily running balance for the individual account. If I bring down all the data for the account and the period, its not a problem - I can sum the balance on the client, however, when the data is paged, and the debits and credits are grouped by Date and Id, the paging cuts across the date, so the opening/closing balance is not correct.
Page 1
Opening balance until 26/7/12 = 0
25/7/12 Acct1 Db 100 Cr 0 Bal +100 Runn Bal +100
26/7/12 Acct1 Db 100 Cr 0 Bal +100 Runn Bal +200
26/7/12 Acct1 Db 200 Cr 0 Bal +200 Runn Bal +400
Closing balance until 26/7/12 = +400
Page 2
Opening balance until 26/7/12 = +450 (this is wrong - it should be the balance at the end of Page 1, but it is the balance until the 26/7/12 - i.e. includes the first item on Page 2)
26/7/12 Acct1 Db 50 Cr 0 Bal +50 Runn Bal +500 (should be +450)
27/7/12 Acct1 Db 60 Cr 0 Bal +60 Runn Bal +560 (should be +510)
I just can't think up an algorithm to handle this.
Any ideas?

Hi this is a problem I have also faced recently with RavenDb when I needed to retrieve rolling balances as at any imaginable date. I never found a way of doing this all in one go but I managed reduce the amount of documents that I needed to pull back in order to calculate the rolling balance.
I did this by writing multiple map reduce indexes that summed up the value of transactions within specific periods:
My First Summed up the value of all transactions grouped at the year level
My Second index Summed up the value of all transactions at the Day level
So if someone wanted their account balance As At 1st June 2012 I would:
Use the Year level Map-reduce index to get the Value of transactions for years up to 2012 and summed them together (so if transactions started being captured in 2009 I should be pulling back 3 documents)
Use the Day level Map-reduce index to get all documents from the start of the year and the 1st of June
I then Added the Day totals to the year totals for my final rolling balance (I could have also had a monthly map reduce as well but didn't bother).
Anyway not as quick as in SQL but it was the best alternative I could come up with to avoid bringing back every single transaction

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pig data cleaning based on date - apache-pig

Related

How to write SQL in DAX?

Using the sum of previous payslips of the same year in salary calculation in Odoo

SQL Server Exception Reporting Script

SubQuery Aggregates in ActiveRecord

How would I translate this SQL query into a Raven Map/Reduce query?

Categories

Resources