Hive: Create rows with summed data, by date (unknown number of dates)

Hive: Create rows with summed data, by date (unknown number of dates) - sql

I am currently working with a Hive Table which contains transactions data and I need to do some basic statistics on these data, and put the results in a new table.
EDIT: I'm using Hive 0.13 on Hadoop 2.4.1.
CONTEXT
First, let me try to present the input table: here's a table with 3 columns, an ID, a date (month/year), and an amount:
<ID> <Date> <Amount>
1 11.2014 5.00
2 11.2014 10.00
3 12.2014 15.00
1 12.2014 7.00
1 12.2014 15.00
2 01.2015 20.00
3 01.2015 30.00
3 01.2015 45.00
... ... ...
And the desired output consist of a table grouped by IDs, where in each line I sum the the amounts, for each corresponding months:
<ID> <11.2014> <12.2014> <01.2015> <...>
1 5.00 22.00 0.00 ...
2 10.00 0.00 20.00 ...
3 15.00 0.00 75.00 ...
... ... ... ... ...
Considering that the original table has >4 million IDs and > 500 million lines, on more then 2 years. It seems pretty hard to hardcode the table by hand since I don't know how many columns I should create.
(I know how many different dates I have, but if the original table grows over 5, 10, 15 years, there is going to be a lot to do by hand and that's risky.)
THE CHALLENGE
I know how to do some basic manipulations and GROUP BYs, I can even do some CASE WHEN, but the tricky part in my problem is that I can not create columns like this (as mentionned above)...
SUM (CASE WHEN Date = 11.2014 THEN Amount ELSE 0 END) AS 11.2014
SUM (CASE WHEN Date = 12.2014 THEN Amount ELSE 0 END) AS 12.2014
SUM (CASE WHEN Date = 01.2015 THEN Amount ELSE 0 END) AS 01.2015
SUM (CASE WHEN Date = ??? THEN Amount ELSE 0 END) AS ???
... because I don't know how many different dates I'll eventually have, so I would need something like this:
SUM (CASE WHEN Date = [loop over each dates] THEN Amount ELSE 0 END)
AS [the date selected in the loop]
THE QUESTION
Do you have something to propose in order to :
How can I loop over all the dates ?
And be able to create a colum for every dates I have without specifying myself the name of the soon to be created column ?
Is it doable in a single HiveQL script ? (not obligated but could be really nice)
I would like to avoid UDF but at this point I'm not sure it's preventable since I haven't find any case that ressemble mine.
Thanks in advance and don't hesitate to ask for more info.

This is too long for a comment.
You cannot do exactly what you want in Hive, because a SQL query has to have a fixed number of columns when it is defined.
What can you do?
The easiest thing is simply to change what you want. Product multiple rows instead of multiple columns:
select id, date, sum(amount)
from table t
group by id, date;
You can then load the data into your favorite spreadsheet and pivot it there.
Other alternatives. You can write a query that will write the appropriate query. This would go through the table, identify the possible dates, and construct a SQL statement. You can then run the SQL statement.
Or, you could use some other data types, such as a list or JSON to store the aggregated values in one row.

Related

SQL Query to return all of the same numbers ignoring if the number is positive or not

I am looking for an SQL query to return all records based on a number. The trick is that the number can be positive or negative, but as long as the number is the same it should return.
So I have a table like this:
1 amt1 100.00
2 amt2 100.00
3 amt3 -100.00
4 amt4 120.00
A query such as this:
Select * from table where amount = '100.00'
Only returns the first 2 rows.
I want to return the first 3 rows, thus ignoring the minus sign but still matching the amount.
Any ideas?

Use in:
where amount in (100.00, -100.00)
Note the single quotes are not necessary.
You can also use abs():
where abs(amount) = 100.00
if you don't care much about indexes or query optimization.

If amount is a number:
select * from table where abs(amount) = 100.00;
If amount is text:
select * from table where amount in ('100.00', '-100.00');

Cast decimal as int when a specific value is found

I have a query that compares data between weeks using CTEs. One of things it stores is percentage data. I am trying to convert or cast any value that is equal to 100.00.
For example, here is what the result would look like
Name Percentage Other
App1 99.56 5.5
App2 100.00 6
I am hoping to remove the zero's from the 100. I have tried some case statements but I never get the results I am looking for. Here is what I have now.
SELECT
f.Application,
f.Percentage,
f.Other,
CASE
WHEN f.Percentage = 100.00
THEN 100
END AS Percentage
FROM Table1 f
Currently, the values that are 100.00 are changed to 100. However, everything not equal to 100.00 is nulled out. I can get rid of the NULLs by adding
ELSE f.Percentage
But then 100.00 is not changed.
Any help would be appreciated.

As I mentioned in the comments, you can't have a single SQL column that contains multiple data types.
That being said, whether or not it's useful in your case remains to be seen, but a string-type like nvarchar can handle both, though your code would need to be able to handle them as strings instead of numbers.
An example would be something like this:
SELECT
f.Application,
f.Percentage,
f.Other,
CASE
WHEN f.Percentage = 100.00
THEN '100'
ELSE CAST(f.Percentage as nvarchar(10))
END AS Percentage
FROM Table1 f
The Percentage column would now be of type nvarchar, but it would return values in this fashion:
Name Percentage Other
App1 99.56 5.5
App2 100 6
App3 91.23 3.5
App4 94.41 4.8
App5 100 6

UPDATE with HAVING in duplicate values in Excel

Need help with this issue. I have a Develop, i need find the duplicate values in SQL, after need Sum the INVOICE_AMOUNT and Divide for individualy amount Example.
FA-0001 $25.00 BILL-0001
FA-0001 $75.00 BILL-0002.
I need SUM TOTAL of this invoice. SUM(AMOUNT_INVOICE)= $100.00, after divide this result with the individual amount. Example 100.00/25=0.25 , etc etc. and this percentage multiply for DET_SOL_AMOUNT.
I need apply this query in duplicate values.
I try with this query.
UPDATE [T4DET] SET [DET_SOL]=(([LOC_AMOUNT]/SUM([LOC_AMOUNT]))*[DET_SOL_CALC]) FROM [1WEB] WHERE [1WEB].[INVOICE] IN (SELECT [T4DET].[ASSIGNMENT] FROM [T4DET] GROUP BY [T4DET].[ASSIGNMENT] HAVING COUNT(*) > 1)
Thanks for your Help.

If I understood what you want to do correctly, it is easy with Excel. You need to write formulas in 2 columns only, for example:
Group Amount Bill No DET_SOL_CALC Sum of Group Result
FA-0001 $25.00 BILL-0001 2 100 0.5
FA-0001 $75.00 BILL-0002 2 100 1.5
FA-0002 $200.00 BILL-0001 5 600 1.666666667
FA-0002 $100.00 BILL-0002 5 600 0.833333333
FA-0002 $300.00 BILL-0003 5 600 2.5
Put your data in columns A, B and C
ColumnD: DET_SOL_CALC
Column E formula should be: =SUMIF($A$2:$C$6,A2,$B$2:$B$6)
Column F formula should be: =B2/E2*D2
Row 1 is headers of your data
put these formulas in row to and drag them down to the last row of your data, your numbers should be calculated correctly.
Please hit the check mark if this is your answer!

The alter Solution is, Create a Temporal Table with SUM and GROUP BY and agregate three columns for calculations
Example
DET4TEMP
ASSINGMENT NVARCHAR
DOC_AMOUNT MONEY
INSERT INTO 4DETTEMP (ASSINGNMENT,[TOTAL]) ASSIGNMENT, SUM(DOC_AMOUNT) FROM FBL5N GROUP BY ASSIGNMENT
and after query is+
Obtain DET SOL Amount in the other table.
UPDATE 4BET SET DET_SOL_CAL=T2.INCOMING_AMOUNT FROM FBL5N T2 WHERE ASSIGNMENT=T2.INV_CON
Obtain DOC AMOUNT TOTAL of the temporal table.
UPDATE 4BET SET DOC_AMNT_TOTAL=T2.[TOTAL] FROM 4DETTEMP T2 WHERE ASSIGNMENT=T2.ASSIGNMENT
Obtain the Calculation Percentage.
UPDATE 4BET PERC_CAL_AMNT=(DOC_AMNT_TOTAL/DOC_AMNT), DET_SOL=(PERC_CAL_AMNT*DET_SOL_CALC)
after delete temp tables and finish.
This is my solution. The question is Viable?

SQL with two subqueries - MS Access

I have two tables (tblAAA and tblBBB). I am trying to get a query that will sum the total from one column based on criteria from another column (1) and the sum of that same column based on different criteria from the same column I got the first criteria from (2). Then I need to do the same one more time from another table (3). The last thing I need to do is perform a calculation that will yield one result (3 + 1 - 2). Here is what I have so far that I got somewhere else tailored to my situation. It works but only on two of the above. Any help is appreciated.
SELECT tblAAA.ID,
tblAAA.Type, SUM(tblAAA.Amount),
SUM(tblBBB.CurTotal),
(SUM(tblBBB.CurTotal)) - SUM(tblAAA.Amount)
FROM tblAAA INNER JOIN tblBBB ON tblAAA.ID = tblBBB.ID
GROUP BY tblAAA.ID, tblAAA.Type
HAVING tblAAA.ID=15 AND tblAAA.Type="Credit";
Table1
Type Amount
Debit 15.00
Debit 15.00
Debit 10.00
Credit 7.00
Credit 13.00
Table2
CurTotal
5.00
10.00
15.00
Expected Output (30.00 + 20.00 - 40.00)
10.00

SQL Query to separate data into two fields

I have data in one column that I want to separate into two columns. The data is separated by a comma if present. This field can have no data, only one set of data or two sets of data saperated by the comma. Currently I pull the data and save as a comma delimited file then use an FoxPro to load the data into a table then process the data as needed then I re-insert the data back into a different SQL table for my use. I would like to drop the FoxPro portion and have the SQL query saperate the data for me. Below is a sample of what the data looks like.
Store Amount Discount
1 5.95
1 5.95 PO^-479^2
1 5.95 PO^-479^2
2 5.95
2 5.95 PO^-479^2
2 5.95 +CA8A09^-240^4,CORDRC^-239^7
3 5.95
3 5.95 +CA8A09^-240^4,CORDRC^-239^7
3 5.95 +CA8A09^-240^4,CORDRC^-239^7
In the data above I want to sum the data in the amount field to get a gross amount. Then pull out the specific discount amount which is located between the carat characters and sum it to get the total discount amount. Then add the two together and get the total net amount. The query I want to write will separate the discount field as needed, see store 2 line 3 for two discounts being applied, then pull out the value between carat characters.

For SQL Server:
You can use ChardIndex(',',fieldname) in a sql statement to find the location of the comma and then Substring to parse out the first and second field.

For Oracle you can use a case statement like this in your select clause. Use one for each of the two discounts:
CASE WHEN LENGTH(foo.discount) > 0 AND INSTR(foo.discount,',') > 0 THEN
SUBSTR(foo.discount,1,INSTR(foo.discount,',',1,1)) ELSE foo.discount END AS discount_column_1

I finally figured out exactly how to separate the fields as I need them. Below is the code that breaks the discount field into two. I can now separate the fields as needed and insert the data separated into a temp table then use a similar set of code to pull out the exact amount enclosed by the carat characters. Thanks for the help in the two answers above. I used a combination of both to get exactly what I needed.
CASE LEN(X.DISCOUNT)-LEN(REPLACE(X.DISCOUNT,',',''))
WHEN 1 THEN SUBSTRING(X.DISCOUNT,1,CHARINDEX(',',X.DISCOUNT)-1)
ELSE X.DISCOUNT
END 'FIRST_DISCOUNT',
CASE LEN(X.DISCOUNT)-LEN(REPLACE(X.DISCOUNT,',',''))
WHEN 1 THEN SUBSTRING(X.DISCOUNT,CHARINDEX(',',X.DISCOUNT)+1,LEN(X.DISCOUNT)-CHARINDEX(',',X.DISCOUNT)+1)
ELSE ''
END 'SECOND_DISCOUNT'

This alternative solution uses LEFT and RIGHT functions for split the column.
select Store, Amount,
Discount1 = CASE
WHEN CHARINDEX(',',Discount) > 1 THEN LEFT(Discount, CHARINDEX(',',Discount)-1 )
ELSE Discount END,
Discount2 = CASE
WHEN CHARINDEX(',',Discount) > 1 THEN RIGHT(Discount, LEN(Discount) - CHARINDEX(',',Discount)-1 )
END
from #Temp

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive: Create rows with summed data, by date (unknown number of dates) - sql

Related

SQL Query to return all of the same numbers ignoring if the number is positive or not

Cast decimal as int when a specific value is found

UPDATE with HAVING in duplicate values in Excel

SQL with two subqueries - MS Access

SQL Query to separate data into two fields

Categories

Resources