Oracle Creating Table using Data from Previous Tables with Calculations - sql

I currently have two tables, cts(time, symbol, open, close, high, low, volume) and dividends(time, symbol, dividend). i am attempting to make a third table named, dividend_percent with columns Time, Date and Percent. To get the percentage for the dividend I believe the formula to be ((close-(open+dividend))/open)*100.
The request however exceeded the size allowed by oraclexe and thus failed but i don't believe my request should have been that big.
SQL> create table dividend_percent
2 as (select c.Time, c.Symbol, (((c.close-(c.open+d.dividend))/c.open)*100) P
RCNT
3 from cts c inner join dividend d
4 on c.Symbol=d.Symbol);
from cts c inner join dividend d
*
ERROR at line 3:
ORA-12953: The request exceeds the maximum allowed database size of 11 GB
Am i writing the query wrong or in such a way that's really inefficient? the two tables are big but i don't think too big.

Perhaps you could make a view which combines the two tables and performs the necessary calculations when needed:
CREATE VIEW DIVIDEND_PERCENT_VIEW AS
SELECT c.TIME,
c.SYMBOL,
((c.CLOSE - (c.OPEN + d.DIVIDEND)) / c.OPEN) * 100 AS PRCNT
FROM CTS c
INNER JOIN DIVIDEND d
ON c.SYMBOL = d.SYMBOL AND
c.TIME = d.TIME
WHERE c.OPEN <> 0;
This would avoid duplicating the data, eliminate the need to store everything twice, and perform the PRCNT calculation for data added after the view is created as well as for pre-existing data.

Perhaps you could use materialized view if you are intending to perform DML operations as well as keep the table in sync.

Related

How would I optimize this? I need to pull data from different tables and use results of queries in queries

Please propose an approach I should follow since I am obviously missing the point. I am new to SQL and still think in terms of MS Access. Here's an example of what I'm trying to do: Like I said, don't worry about the detail, I just want to know how I would do this in SQL.
I have the following tables:
Hrs_Worked (staff, date, hrs) (200 000+ records)
Units_Done (team, date, type) (few thousand records)
Rate_Per_Unit (date, team, RatePerUnit) (few thousand records)
Staff_Effort (staff, team, timestamp) (eventually 3 - 4 million records)
SO I need to do the following:
1) Calculate what each team earned by multiplying their units with RatePerUnit and Grouping on Team and Date. I create a view TeamEarnPerDay:
Create View teamEarnPerDay AS
SELECT
,Units_Done.Date,
,Units_Done.TeamID,
,Sum([Units_Done]*[Rate_Per_Unit.Rate]) AS Earn
FROM Units_Done INNER JOIN Rate_Per_Unit
ON (Units_Done.quality = Rate_Per_Unit.quality)
AND (Units_Done.type = Rate_Per_Unit.type)
AND (Units_Done.TeamID = Rate_Per_Unit.TeamID)
AND (Units_Done.Date = Rate_Per_Unit.Date)
GROUP BY
Units_Done.Date,
Units_Done.TeamID;
2) Count the TEAM's effort by Grouping Staff_Effort on Team and Date and counting records. This table has a few million records.
I have to cast the timestamp as a date....
CREATE View team_effort AS
SELECT
TeamID
,CAST([Timestamp] AS Date) as TeamDate,
,Count(Staff_EffortID) AS TeamEffort
FROM Staff_Effort
GROUP BY
TeamID
,CAST([Timestamp] AS Date);
3) Calculate the Team's Rate_of_pay: (1) Team_earnings / (2) Team_effort
I use the 2 views I created above. This view's performance drops but is still acceptable to me.
Create View team_rate_of_pay AS
SELECT
tepd.Date
,tepd.TeamID
,tepd.Earn
,tepd.TeamBags
,[Earn]/[TeamEffort] AS teamRate
FROM teamEarnPerDay
INNER JOIN team_effort
ON (teamEarnPerDay.Date = team_effort.TeamDate)
AND (teamEarnPerDay.TeamID = team_effort.TeamID);
4) Group Staff_Effort on Date and Staff and count records to get each individuals's effort. (share of the team effort)
I have to cast the Timestamp as a date....
Create View staff_effort AS
SELECT
TeamID
,StaffID
,CAST([Timestamp] AS Date) as StaffDate
,Count(Staff_EffortID) AS StaffEffort
FROM Staff_Effort
GROUP BY
,TeamID
,StaffID
,CAST([Timestamp] AS Date);
5) Calculate Staff earnings by: (4) Staff_Effort x (3) team_rate_of_pay
Multiply the individual's effort by the team rate he worked at on the day.
This one is ridiculously slow. In fact, it's useless.
CREATE View staff_earnings AS
SELECT
staff_effort.StaffDate
,staff_effort.StaffID
,sum(staff_effort.StaffEffort) AS StaffEffort
,sum([StaffEffort]*[TeamRate]) AS StaffEarn
FROM staff_effort INNER JOIN team_rate_of_pay
ON (staff_effort.TeamID = team_rate_of_pay.TeamID)
AND (staff_effort.StaffDate = team_rate_of_pay.Date)
Group By
staff_effort.StaffDate,
staff_effort.StaffID;
So you see what I mean.... I need various results and subsequent queries are dependent on those results.
What I tried to do is to write a view for each of the above steps and then just use the view in the next step and so on. They work fine but view nr 3 runs slower than the rest, even though still acceptable. View nr 5 is just ridiculously slow.
I actually have another view after nr.5 which brings hours worked into play as well but that just takes forever to produce a few rows.
I want a single line for each staff member, showing what he earned each day calculated as set out above, with his hours worked each day.
I also tried to reduce the number of views by using sub-queries instead but that took even longer.
A little guidance / direction will be much appreciated.
Thanks in advance.
--EDIT--
Taking the query posted in the comments. Did some formatting, added aliases and a little cleanup it would look like this.
SELECT epd.CompanyID
,epd.DATE
,epd.TeamID
,epd.Earn
,tb.TeamBags
,epd.Earn / tb.TeamBags AS RateperBag
FROM teamEarnPerDay epd
INNER JOIN teamBags tb ON epd.DATE = tb.TeamDate
AND epd.TeamID = tb.TeamID;
I eventually did 2 things:
1) Managed to reduce the nr of nested views by using sub-queries. This did not improve performance by much but it seems simpler with fewer views.
2) The actual improvement was caused by using LEFT JOIN in stead of Inner Join.
The final view ran for 50 minutes with the Inner Join without producing a single row yet.
With LEFT JOIN, it produced all the results in 20 seconds!
Hope this helps someone.

select query showing decimal places on some fields but not others

I have two tables, A & B.
Table A has a column called Nominal which is a float.
Table B has a column called Units which is also a float.
I have a simple select query that highlights any differences between Nominals in table A & Units in table B.
select coalesce(A.Id, B.Id) Id, A.Nominal, B.Units, isnull(A.Nominal, 0) - isnull(B.Units, 0) Diff
from tblA A full outer join tblB B
on tblA.Id = tblB.Id
where isnull(A.Nominal, 0) - isnull(B.Units, 0) <> 0
this query works. However this morning I have a slight problem.
The query is showing on line as having a difference,
Id Nominal Units Diff
FJLK 100000 100000 1.4515E-11
So obviously one or both of the figures are not 100,000 exactly. However when I run a select query on both tables (individually) on this id both of them return 100,000 I can't see which one has decimal places, why is this? Is this some sort of default display in SQL Server?
In the excel you will find this kind of behavior.
It's a standard way to represent a low numbers. The number 1.4515E-11 you got is same 1.4515 * 10^(-11)

Migrating from non-partitioned to Partitioned tables

In June the BQ team announced support for date-partitioned tables. But the guide is missing how to migrate old non-partitioned tables into the new style.
I am looking for a way to update several or if not all tables to the new style.
Also outside of DAY type partitioned what other options are available? Does the BQ UI show this, as I wasn't able to create such a new partitioned table from the BQ Web UI.
What works for me is the following set of queries applied directly in the big query (big query create new query).
CREATE TABLE (new?)dataset.new_table
PARTITION BY DATE(date_column)
AS SELECT * FROM dataset.table_to_copy;
Then as the next step I drop the table:
DROP TABLE dataset.table_to_copy;
I got this solution from https://fivetran.com/docs/warehouses/bigquery/partition-table
using only step 2
from Pavan’s answer: Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
from Pentium10 comments: So suppose I have several years of data, I need to prepare different query for each day and run all of it, and suppose I have 1000 days in history, I need to pay 1000 times the full query price from the source table?
As we can see - the main problem here is on having full scan for each and every day. The rest is less of a problem and can be easily scripted out in any client of the choice
So, below is to - How to partition table while avoid full table scan for each and every day?
Below step-by-step shows the approach
It is generic enough to extend/apply to anyone real use-case - meantime I am using bigquery-public-data.noaa_gsod.gsod2017 and I am limiting "exercise" to just 10 days to keep it readable
Step 1 – Create Pivot table
In this step we
a) compress each row’s content into record/array
and
b) put them all into respective ”daily” column
#standardSQL
SELECT
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170101' THEN r END) AS day20170101,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170102' THEN r END) AS day20170102,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170103' THEN r END) AS day20170103,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170104' THEN r END) AS day20170104,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170105' THEN r END) AS day20170105,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170106' THEN r END) AS day20170106,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170107' THEN r END) AS day20170107,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170108' THEN r END) AS day20170108,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170109' THEN r END) AS day20170109,
ARRAY_CONCAT_AGG(CASE WHEN d = 'day20170110' THEN r END) AS day20170110
FROM (
SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
)
GROUP BY line
Run above query in Web UI with pivot_table (or whatever name is preferred) as a destination
As we can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:
Step 2 – Processing partitions one-by-one ONLY scanning respective column (no full table scan) – inserting into respective partition
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170101) AS r
Run above query from Web UI with destination table named mytable$20160101
You can run same for next day
#standardSQL
SELECT r.*
FROM pivot_table, UNNEST(day20170102) AS r
Now you should have destination table as mytable$20160102 and so on
You should be able to automate/script this step with any client of your choice
There are many variations of how you can use above approach - it is up to your creativity
Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)
Unless there is a limitation on how far back you can go with new partitions – I heard (but didn’t have chance to check yet) there is now no more than 90 days back
Update
Please note:
Above version has a little extra logic of packing all aggregated cells into as least final number of rows as possible.
ROW_NUMBER() OVER(PARTITION BY d) AS line
and then
GROUP BY line
along with
ARRAY_CONCAT_AGG(…)
does this
This works well when row size in your original table is not that big so final combined row size still will be within rows size limit that BigQuery has (which I believe is 10 MB as of now)
If your source table already has row size close to that limit – use below adjusted version
In this version – grouping is removed such that each row has only value for one column
#standardSQL
SELECT
CASE WHEN d = 'day20170101' THEN r END AS day20170101,
CASE WHEN d = 'day20170102' THEN r END AS day20170102,
CASE WHEN d = 'day20170103' THEN r END AS day20170103,
CASE WHEN d = 'day20170104' THEN r END AS day20170104,
CASE WHEN d = 'day20170105' THEN r END AS day20170105,
CASE WHEN d = 'day20170106' THEN r END AS day20170106,
CASE WHEN d = 'day20170107' THEN r END AS day20170107,
CASE WHEN d = 'day20170108' THEN r END AS day20170108,
CASE WHEN d = 'day20170109' THEN r END AS day20170109,
CASE WHEN d = 'day20170110' THEN r END AS day20170110
FROM (
SELECT
stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
FROM `bigquery-public-data.noaa_gsod.gsod2017` AS t
GROUP BY stn, d
)
WHERE d BETWEEN 'day20170101' AND 'day20170110'
As you can see now - pivot table (sparce_pivot_table) is sparse enough (same 21.5 MB but now 114,089 rows vs. 11,584 rows in pivot_table) so it has average row size of 190B vs 1.9KB in initial version. Which is obviously about 10 times less as per number of columns in the example.
So before using this approach some math needs to be done to project/estimate what and how can be done!
Still: each cell in pivot table is sort of JSON representation of whole row in original table. It is such as it holds not just values as it was for rows in original table but also has a schema in it
As such it is quite verbose - thus the size of cell can be multiple times bigger than original size [which limits the usage of this approach ... unless you get even more creative :o) ... which is still plenty of areas here to apply :o) ]
As of today, you can now create a partitioned table from a non-partitioned table by querying it and specifying the partition column. You'll pay for one full table scan on the original (non-partitioned) table. Note: this is currently in beta.
https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result
To create a partitioned table from a query result, write the results to a new destination table. You can create a partitioned table by querying either a partitioned table or a non-partitioned table. You cannot change an existing standard table to a partitioned table using query results.
Until the new feature is rolled out in BigQuery, there is another (much cheaper) way to partition the table(s) by using Cloud Dataflow. We used this approach instead of running hundreds of SELECT * statements, which would have cost us thousands of dollars.
Create the partitioned table in BigQuery using the normal partition command
Create a Dataflow pipeline and use a BigQuery.IO.Read sink to read the table
Use a Partition transform to partition each row
Using a max of 200 shards/sinks at a time (any more than that and you hit API limits), create a BigQuery.IO.Write sink for each day/shard that will write to the corresponding partition using the partition decorator syntax - "$YYYYMMDD"
Repeat N times until all data is processed.
Here's an example on Github to get you started.
You still have to pay for the Dataflow pipeline(s), but it's a fraction of the cost of using multiple SELECT * in BigQuery.
If you have date sharded tables today, you can use this approach:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table
If you have a single non-partitioned table to be converted to partitioned table, you can try the approach of running a SELECT * query with allow large results and using the table's partition as the destination (similar to how you'd restate data for a partition):
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition
Please note that this approach will charge you the scan cost of the source table for the query as many times as you query it.
We are working on something to make this scenario significantly better in the next few months.
CREATE TABLE `dataset.new_table`
PARTITION BY DATE(date_column)
AS SELECT * FROM `dataset.old_table`;
drop table `dataset.old_table`;
ALTER TABLE `dataset.new_table`
RENAME TO old_table;

Speed discrepancies when selecting different records in same table

I've got a pretty basic SQL query that's become a bottleneck in my processing. It's selecting a large varchar(999) column that's slowing it down. Removing that column from the select speeds it up considerably so I know it's the column that's causing problem.
I was experimenting with breaking it up into smaller 300 record batches to see if that helped and I saw something weird happening. Some of the batches were taking almost 30 seconds, and some were taking 0.012 seconds. I don't know what's causing this discrepancy.
I have a reproducible scenario where the first query is taking many times faster than the 2nd:
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1118482415 and 1118509835
0.3 seconds
select r.ID, r.FileID, r.Data
from Calls c
join RawData r on r.ID = c.ID
join DataFiles f on f.ID = r.FileID
where r.ID between 1115330220 and 1118482415
8 seconds
I see no visible differences in the returned data. They both return 300 records and all of the record's "Data" column values are about 170 characters long. I'm running this directly from the SqlStudio client. Also there's no other traffic in this database.
Does anybody know what could be causing this problem or have any suggestions to try? I can't decrease the size of the column because there are some bigger records in there, just not in this example. I do have indexes on all the columns used in the joins (Calls.ID, RawData.ID, RawData.FileID, DataFiles.ID).

Filtering out random rows in a table by using the sum function

I have an expense table designed using sqlite I would like to construct a query to filter out some random rows using the sum function on the amount column of the table.
Sample Expense table
Clients Amounts
A 1000
B 3000
C 5000
D 2000
E 6000
Assuming i would like total sum in the table above to be 10,000 i would like to construct a query which would return any number of randoms rows that would add up to 10,000
So far i tried
SELECT *
FROM Expense Table
GROUP BY (Clients)
HAVING SUM(AMOUNT)=10000
but i got nothing generated
I have also had a go with the random function but i'm assuming i need to specify a LIMIT
SQLLite does not support CTEs (specifically recursive ones), so I can't think of an easy way of doing this. Perhaps you would be better off doing this in your presentation logic.
One option via SQL would be to string to together a number of UNION statements. Using your above sample data, you would need to string 3 UNIONs to get your results:
select clients
from expense
where amounts = 10000
union
select e.clients || e2.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
where e.amounts + e2.amounts = 10000
union
select e.clients || e2.clients || e3.clients
from expense e
inner join expense e2 on e2.rowid > e.rowid
inner join expense e3 on e3.rowid > e2.rowid
where e.amounts + e2.amounts + e3.amounts = 10000
Resulting in ABE and BCD. This would work for any group of clients, 1 to 3, whose sum is 10000. You could string more unions to get more clients -- this is just an example.
SQL Fiddle Demo
(Here's a sample with up to 4 clients - http://sqlfiddle.com/#!7/b01cf/2).
You can probably use dynamic sql to construct your endless query if needed, however, I do think this is better suited on the presentation side.
What you are describing is the knapsack problem (in your case, the value is equal to the weight).
This can be solved in SQL (see sgeddes's answer), but due to SQL's set-oriented design, the computation is rather complex and very slow.
You would be better off by reading the amounts into your program and solving the problem there (see the pseudocode on the Wikipedia page).