Need to load huge dataset (32 Million) into table using SSIS

Need to load huge dataset (32 Million) into table using SSIS - sql

I have a huge dataset to return in SQL Server( about 32 million rows) This is implemented in view and source code is as follows :
SELECT Idenitifier = ISNULL(mle.MIdeer, mle.Ider) + em.MemberId,
EffectiveDate = ISNULL(em.EffectiveDate,
(SELECT TOP 1 EffectiveDate
FROM c
WHERE SourceType = em.SourceType
AND GroupNumber = em.GroupNumber
AND ISNULL(GroupDivision, '') =
ISNULL(em.GroupDivision, '')))
FROM a em
JOIN b mle
ON mle.Identifier = em.GroupNumber + ISNULL('-' + em.GroupDivision, '')
-- Filter invalid legal entities
AND ISNULL(mle.Filter, 0) = 0
--- Gets a resultset of 531798 rows
CROSS JOIN -- this returns 63 rows , so
-- I am presuming 531798*63 rows here.
(SELECT *
FROM map
WHERE domaintype = 'MC')b;
I need to load this dataset using SSIS into a table. After 16 million rows, I am getting a system.out of memory exception in sql server when I am giving a select * from <<view>>. How to load this dataset in table using SSIS,avoiding this exception..
What other better methods to do this query efficiently as it takes more than 30 mins to run?

I'm still thinking through this, but you might need to separate the CROSS JOIN:
;WITH cte AS (SELECT ISNULL(mle.MIdeer, mle.Ider) + em.MemberId AS Idenitifier
, ISNULL(em.EffectiveDate,
( SELECT TOP 1
EffectiveDate
FROM c
WHERE SourceType = em.SourceType
AND GroupNumber = em.GroupNumber
AND ISNULL(GroupDivision, '') = ISNULL(em.GroupDivision,
'')
)) AS EffectiveDate
FROM a em
JOIN b mle ON mle.Identifier = em.GroupNumber + ISNULL('-'+ em.GroupDivision,'')
AND ISNULL(mle.Filter, 0) = 0)
SELECT *
FROM cte
CROSS JOIN ( SELECT *
FROM map
WHERE domaintype = 'MC') b;

Related

Out of range integer: infinity

So I'm trying to work through a problem thats a bit hard to explain and I can't expose any of the data I'm working with but what Im trying to get my head around is the error below when running the query below - I've renamed some of the tables / columns for sensitivity issues but the structure should be the same
"Error from Query Engine - Out of range for integer: Infinity"
WITH accounts AS (
SELECT t.user_id
FROM table_a t
WHERE t.type like '%Something%'
),
CTE AS (
SELECT
st.x_user_id,
ad.name as client_name,
sum(case when st.score_type = 'Agility' then st.score_value else 0 end) as score,
st.obs_date,
ROW_NUMBER() OVER (PARTITION BY st.x_user_id,ad.name ORDER BY st.obs_date) AS rn
FROM client_scores st
LEFT JOIN account_details ad on ad.client_id = st.x_user_id
INNER JOIN accounts on st.x_user_id = accounts.user_id
--WHERE st.x_user_id IN (101011115,101012219)
WHERE st.obs_date >= '2020-05-18'
group by 1,2,4
)
SELECT
c1.x_user_id,
c1.client_name,
c1.score,
c1.obs_date,
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
FROM CTE c1
LEFT JOIN CTE c2 on c1.x_user_id = c2.x_user_id and c1.client_name = c2.client_name and c1.rn = c2.rn +2
I know the query works for sure because when I get rid of the first CTE and hard code 2 id's into a where clause i commented out it returns the data I want. But I also need it to run based on the 1st CTE which has ~5k unique id's
Here is a sample output if i try with 2 id's:
Based on the above number of row returned per id I would expect it should return 5000 * 3 rows = 150000.
What could be causing the out of range for integer error?

This line is likely your problem:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
When the value of c2.score is 0, 1.0/c2.score will be infinity and will not fit into an integer type that you’re trying to cast it into.
The reason it’s working for the two users in your example is that they don’t have a 0 value for c2.score.
You might be able to fix this by changing to:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / NULLIF(c2.score, 0)) * 100, 0) AS INT) AS score_diff

postgresql Multiple identical conditions are unified into one parameter

I have one sql that need convert string column to array and i have to filter with this column，sql like this：
select
parent_line,
string_to_array(parent_line, '-')
from
bx_crm.department
where
status = 0 and
'851' = ANY(string_to_array(parent_line, '-')) and
array_length(string_to_array(parent_line, '-'), 1) = 5;
parent_line is a varchar(50) column，the data in this like 0-1-851-88
question:
string_to_array(parent_line, '-') appear many times in my sql.
how many times string_to_array(parent_line) calculate in each row. one time or three times
how convert string_to_array(parent_line) to a parameter. at last,my sql may like this:
depts = string_to_array(parent_line, '-')
select
parent_line,
depts
from
bx_crm.department
where
status = 0 and
'851' = ANY(depts) and
array_length(depts, 1) = 5;

Postgres supports lateral joins which can simplify this logic:
select parent_line, v.parents, status, ... other columns ...
from bx_crm.department d cross join lateral
(values (string_to_array(parent_line, '-')) v(parents)
where d.status = 0 and
cardinality(v.parents) = 5
'851' = any(v.parents)

Use a derived table:
select *
from (
select parent_line,
string_to_array(parent_line, '-') as parents,
status,
... other columns ...
from bx_crm.department
) x
where status = 0
and cardinality(parents) = 5
and '851' = any(parents)

Add rows based on a column value

I am having issues with creating additional rows based on a column value.
If my PageCount = 3 then I would need to have 2 additional rows where PONo is repeated but the ImagePath is incremented by 1 for each new row.
I am able to get the first row but, creating the additional rows with the ImagePath incremented by 1 is where I am stuck.
My result:
Expected result:
Finished: finished values
Current Select statement:
SELECT PO, CASE WHEN LEFT(u.Path,3)= 'M:\' THEN '\\ServerName\'+RIGHT(u.Path,LEN(u.Path)-3) ELSE u.Path END AS [Imagepath],PAGECOUNT
FROM OPENQUERY([LinkedServer],'select * from data.vw_purchasing_docs_unc') AS u INNER JOIN
OPENQUERY([LinkedServer],'select * from data.purchasing_docs') AS d ON u.docid=d.docid
WHERE (CONVERT(VARCHAR(10),d.STATUS_DATE,120)=CONVERT(VARCHAR(10),GETDATE(),120))
batch file:
bcp "select d.DOCID,DOC_TYPE,PO,d.STATUS, CASE WHEN LEFT(Path,3)= 'M:\' THEN '\\ServerName'+RIGHT(DWPath,LEN(Path)-3) ELSE Path END AS ImagePath, STATUS_DATE,'No' AS dwimport from openquery([LinkedServer],'select * from data.vw_purchasing_docs_unc') as u INNER JOIN openquery([LinkedServer],'select * from dwdata.purchasing_docs') as d ON u.docid=d.docid WHERE (CONVERT(varchar(10),STATUS_DATE,120)=CONVERT(varchar(10),GETDATE(),120)) AND d.STATUS IN ('FILED - Processing Complete','FILED - Partial Payment','FILED - Confirming') AND DOC_TYPE IN ('CO = Change Order','Purchase Order','CP = Capital Projects','Change Order','PO = Purchase Order','PO','PR = General Operating')" queryout "E:\Data\PO Trigger CSV\PO_Trigger_Doc.csv" -r \n -T -c -t"," -Umv -Smtvwrtst -Pm -q -k

Select ponumber,b.rplc,pagecount
from table t
cross apply
(select replace(imagepath,'f'+cast(n-1) as varchar(100),'f0') as rplc from numbers n where n<=t.pagecount)b
To create numbers table,if you are wondering why you need it.Look here
CREATE TABLE Number (N INT IDENTITY(1,1) PRIMARY KEY NOT NULL);
GO
INSERT INTO Number DEFAULT VALUES;
GO 10000
Using your select statement after update:
;With cte(ponumber,imagepath,pagecount)
as
SELECT PO, CASE WHEN LEFT(u.Path,3)= 'M:\' THEN '\\ServerName\'+RIGHT(u.Path,LEN(u.Path)-3) ELSE u.Path END AS [Imagepath],PAGECOUNT
FROM OPENQUERY([LinkedServer],'select * from data.vw_purchasing_docs_unc') AS u INNER JOIN
OPENQUERY([LinkedServer],'select * from data.purchasing_docs') AS d ON u.docid=d.docid
WHERE (CONVERT(VARCHAR(10),d.STATUS_DATE,120)=CONVERT(VARCHAR(10),GETDATE(),120))
)
select Ponumber,b.rplc,pagecount from cte c
cross apply
(select replace(imagepath,'f'+cast((n-1) as varchar(100)),'f0') as rplc from numbers n where n<=c.pagecount)b

If you would like to avoid additional table, you can use CTE:
WITH Images AS
(
SELECT * FROM (VALUES
('C:\Folder', 2),
('D:\Folder', 3)) T(ImagePath, Val)
), Numbers AS
(
SELECT * FROM (VALUES (1),(2),(3),(4)) T(N)
UNION ALL
SELECT N1.N*4+T.N N FROM (VALUES(1),(2),(3),(4)) T(N) CROSS JOIN Numbers N1
WHERE N1.N*4+T.N<=100
)
SELECT ImagePath + '\f' + CONVERT(nvarchar(10) ,ROW_NUMBER() OVER (PARTITION BY ImagePath ORDER BY (SELECT 1))) NewPath
FROM Images
CROSS APPLY (SELECT TOP(Val) * FROM Numbers) T(N)
Images is your source table. It can be anything, i.e. OPENQUERY. It produces:
NewPath
-------
C:\Folder\f1
C:\Folder\f2
D:\Folder\f1
D:\Folder\f2
D:\Folder\f3

Why CTE calculation is duplicated in query plan and how to optimize it without duplicating code?

Calculation of grp_set is duplicated 4 times in query plan of this query (distinct sort takes 23% each time, so it takes 23 * 4 = 92% of all resources):
with
grp_set as (select distinct old_num,old_tbl,old_db,old_val_num from err_calc)
,grp as (select id = row_number() over (order by old_num),* from grp_set)
,leaf as (select grp.id ,c.* ,sort = convert(varchar(max),old_col) + " - " + severity + " - " + err
from grp
join err_calc c on
c.old_num = grp.old_num
and c.old_tbl = grp.old_tbl
and c.old_db = grp.old_db
and c.old_val_num = grp.old_val_num
)
select old_num,old_tbl,old_db,old_val_num,conc.*
from (select sep=",") sep
cross join grp
cross apply (select
old_col = stuff((select sep + old_col from leaf where leaf.id = grp.id order by leaf.sort FOR XML PATH("")),1,len(sep),"")
,old_val = stuff((select sep + old_val from leaf where leaf.id = grp.id order by leaf.sort FOR XML PATH("")),1,len(sep),"")
,severity = stuff((select sep + severity from leaf where leaf.id = grp.id order by leaf.sort FOR XML PATH("")),1,len(sep),"")
,err = stuff((select sep + err from leaf where leaf.id = grp.id order by leaf.sort FOR XML PATH("")),1,len(sep),"")
) conc
Table err_calc contains about 350K records and it has only one index by old_db,old_tbl,new_tbl,severity,err,old_col,new_col,old_val_num,old_val,old_num,new_num.
The purpose of this query is to concatenate 4 string fields per group due to lack of concatenation aggregate in SQL.
Equivalent and desired query if concatenation aggregate existed or was implemented with CLR and if order by could be applied to source of aggregation and if all grouping fields could be referenced by grouping.* would be:
select grouping.*
,severity =conc(sep+severity)
,err =conc(sep+err)
,old_col =conc(sep+old_col)
,old_val =conc(sep+old_val)
from err_calc
cross join (select sep=',') sep
group by old_num,old_tbl,old_db,old_val_num
order by old_col,severity,err

Because it is used like a subquery, and used multiple times. cf. Calling CTE multiple times in same query
You should rewrite your query with a JOIN with your CTE instead of a CROSS APPLY, and put the logic of string concatenation in the SELECT part of your query, then the CTE will be called once.

Why is this query with a nested select faster when I include the where clause twice

I had a large sql query that had a nested select in the from clause.
Similar to this:
SELECT * FROM
( SELECT * FROM SOME_TABLE WHERE some_num = 20)
WHERE some_num = 20
In my sql query if I remove the outer "some_num" = 20 it takes 5 times as long . Shouldent these querys run in almost exactly the same time, if not wouldn't having the the additional where slow it down slightly?
What am I not understanding about how sql querys work?
Here is the original query in question
SELECT a.ITEMNO AS Item_No,
a.DESCRIPTION AS Item_Description,
UNITPRICE / 100 AS Retail_Price,
b.UNITSALES AS Units_Sold,
( Dollar_Sales ) AS Dollar_Sales,
( Dollar_Cost ) AS Dollar_Cost,
( Dollar_Sales ) - ( Dollar_Cost ) AS Gross_Profit,
( Percent_Page * c.PAGECOST ) AS Page_Cost,
( Dollar_Sales - Dollar_Cost - ( Percent_Page * c.PAGECOST ) ) AS Net_Profit,
Percent_Page * 100 AS Percent_Page,
( CASE
WHEN UNITPRICE = 0 THEN NULL
WHEN Percent_Page = 0 THEN NULL
WHEN ( Dollar_Sales - Dollar_Cost - ( Percent_Page * c.PAGECOST ) ) > 0 THEN 0
ELSE ( ceiling(abs(Dollar_Sales - Dollar_Cost - ( Percent_Page * c.PAGECOST )) / ( UNITPRICE / 100 )) )
END ) AS Break_Even,
b.PAGENO AS Page_Num
FROM (SELECT PAGENO,
OFFERITEM,
UNITSALES,
UNITPRICE,
( DOLLARSALES / 100 ) AS Dollar_Sales,
( DOLLARCOST / 10000 ) AS Dollar_Cost,
(( CAST(STUFF(PERCENTPAGE, 2, 0, '.') AS DECIMAL(9, 6)) )) AS Percent_Page
FROM OFFERITEMS
WHERE LEFT(OFFERITEM, 6) = 'CH1301'
AND PERCENTPAGE > 0) AS b
INNER JOIN ITEMMAST a
ON a.EDPNO = 1 * RIGHT(OFFERITEM, 8)
LEFT JOIN OFFERS c
ON c.OFFERNO = 'CH1301'
WHERE LEFT(OFFERITEM, 6) = 'CH1301'
ORDER BY Net_Profit DESC
Notice the two
WHERE left(OFFERITEM,6) = 'CH1301'
If I remove the outer Where then the query takes 5 times as long
As requested the Execution plan excuse the crappy upload
http://i.imgur.com/1PqmpVf.png

Is the column OFFERITEM in an index but PERCENTPAGE is not?
In your inner query you reference both these columns, in the outer query you only reference OFFERITEM.
Difficult to say without seeing the execution plan, but it could be that the outer query is causing the optimizer to run an 'index scan' whereas the inner query would cause a full table scan.
On a separate note, you should definitely modify:
WHERE left(OFFERITEM,6) ='CH1301'
to:
where offeritem like 'CH1301%'
As this will allow an index seek if there is an index on offeritem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Need to load huge dataset (32 Million) into table using SSIS - sql

Related

Out of range integer: infinity

postgresql Multiple identical conditions are unified into one parameter

Add rows based on a column value

Why CTE calculation is duplicated in query plan and how to optimize it without duplicating code?

Why is this query with a nested select faster when I include the where clause twice

Categories

Resources