how to optimize multi join job - azure-data-lake

How can I speed up my joins against CSV files?
I've got a query that is joining 8 files:
//please note this is the simplified query
DECLARE ... //summarizing here
FROM ...
USING Extractors.Text(delimiter : '|');
//8 more statements like the above ommitted
SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM #one_files AS one
JOIN #two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
JOIN #three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN #four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN #five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
JOIN #six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN #seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN #eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN #nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
JOIN #ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
I submitted the job to Azure, and had to cancel it after a few hours and $80 in cost!
It was my understanding that Data Lake is meant exactly for this type of job?! I've got perhaps 100 files total, totaling maybe 20mb of data.
How can I speed up my joins?

What is important for you to note that small files are suboptimal in every scenario. The suggested solution for smaller files, by Michal Rys is to consider these alternative to concat them into large files:
Offline outside of Azure
Event Hubs Capture
Stream Analytics
or ADLA fast file sets to compact most recent deltas
Note:
fast file set allows you to consume hundreds of thousands of such files in bulk in a single EXTRACT.
I would use INNER JOIN instead of JOIN to be sure you know which join you are really using.
It is rather important to see how you have EXTRACTed the information from the CSV files. The JOINed result should be OUTPUTed into a tsv (Tab-Separated-Value - Note: TVF is Table-Valued Functions for u-sql code reuse) file.
The TSV structure:
TSV = Tab-Separated-Value
It has no header row
Each row has the same number of columns
This format should be very efficient for u-sql (I did not yet measure it myself).
To have complete information you can have three different build-in outputter types .Text(), .Csv(), Tsv().
Your example is missing the variables so I'll try to guess them
USE DATABASE <your_database>;
USE SCHEMA <your_schema>;
DECLARE #FirstCsvFile string = "/<path>/first.csv";
#firstFile = EXTRACT an_episode_id string, id2_enc_kgb_id string, suffix string
FROM #FirstCsvFile USING Extractors.Text(delimiter : '|');
// probably 8 more statements which where omitted in the OP
#encode = SELECT one.an_episode_id,
one.id2_enc_kgb_id,
one.suffix,
two.suffixa,
three.suffixThree,
four.suffixFour,
five.suffixFive,
six.suffixSix,
seven.suffixSeven,
eight.suffixEight,
nine.suffixNine,
ten.suffixTen
FROM #firstFile AS one
INNER JOIN #two_files AS two
ON one.id3_enc_kgb_id == two.id3_enc_kgb_id
INNER JOIN #three_files AS three
ON three.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN #four_files AS four
ON four.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN #five_files AS five
ON five.id3_enc_kgb_id == one.id3_enc_kgb_id
INNER JOIN #six_files AS six
ON six.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN #seven_files AS seven
ON seven.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN #eight_files AS eight
ON eight.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN #nine_files AS nine
ON nine.id2_enc_kgb_id == one.id2_enc_kgb_id
INNER JOIN #ten_files AS ten
ON ten.id2_enc_kgb_id == one.id2_enc_kgb_id;
OUTPUT #encode TO "/outputs/encode_joins.tsv" USING Outputters.Tsv();

Related

SQL Left Fuzzy Join with Levenshtein Distance

I have two data sets from two different systems being merged together within SQL, however, there is a slight difference within the naming conventions on the two systems. The change in convention is not consistent across the larger data sample but normally requires one modification to match.
System 1 data
System 2 data
AA0330
AA_330
AA0340
AA_340
AA0331
AA_331
AA0341
AA-341
I have been using the below Levenshtein distance SQL function too fuzzy match and get the below result, however, end up with duplicate joins. How could I modify my code to mitigate this?
SELECT [System1].[mvmt1],
[System2].[mvmt2]
FROM [System1]
left join [System2]
ON dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) < 2;
http://www.artfulsoftware.com/infotree/qrytip.php?id=552&m=0
Current output:
System 1 data
System 2 data
AA0330
AA_330
AA0330
AA_340
AA0340
AA_340
AA0331
AA_331
AA0341
AA-341
How can I make sure I only get one outcome from the join?
not the best solution , but you can compare first 2 character and last 3 character , if all the codes are following the same pattern (2 character on right and 3 digit at the end ):
SELECT [System1].[mvmt1],
[System2].[mvmt2]
FROM [System1]
inner join [System2]
ON left(mvmt1,2) = left(mvmt2,2)
and right(mvmt1,3) = right(mvmt2,3)
what about something like this (sorry about the poor formatting):
WITH Initial_Fuzzy_Join as(
SELECT [System1].[mvmt1],
[System2].[mvmt2] ,
dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) as StringDistanceMetric
FROM [System1]
left outer join [System2]
ON dbo.ufn_levenshtein([System1].[mvmt1], [System2].[mvmt2]) < 2
)
SELECT mvmt1, mvmt2, max(StringDistanceMetric)
FROM Initial_Fuzzy_Join
Group by mvmt1,mvmt2

LEFT JOIN in MS-Access with multiple match criteria (AND(OR)) bogging down

I'm working in MS-Access from Office 365.
t1 is a table with about 1,000 rows. I'm trying to LEFT JOIN t1 with t2 where t2 has a little under 200k rows. I'm trying to match up rows using Short Text strings in multiple fields, and all the relevant fields are indexed. The strings are relatively short, with the longest fields (the street fields) being about 15 characters on average.
Here is my query:
SELECT one.ID, two.ACCOUNT
FROM split_lct_2 AS one LEFT JOIN split_parcel AS two
ON (
nz(one.mySTREET) = nz(two.pSTREET)
OR nz(one.mySTREET_2) = nz(two.pSTREET)
OR nz(one.mySTREET_3) = nz(two.pSTREET)
)
AND (nz(one.myDIR) = nz(two.pDIR))
AND (nz(one.myHOUSE) = nz(two.pHOUSE));
The query works, however it behaves like a 3-year-old. The query table appears after several seconds, but remains sluggish indefinitely. For example, selecting a cell in the talble takes 3-7 seconds. Exporting the query table as a .dbf takes about 8 minutes.
My concern is that this is just a sample file to build the queries, the actual t1 will have over 200k rows to process.
Is there a way to structure this query that will significantly improve performance?
I don't know if it will help but
(
nz(one.mySTREET) = nz(two.pSTREET)
OR nz(one.mySTREET_2) = nz(two.pSTREET)
OR nz(one.mySTREET_3) = nz(two.pSTREET)
)
is the same as
nz(two.pSTREET) IN (nz(one.mySTREET),nz(one.mySTREET_2),nz(one.mySTREET_3))
it might be the optimizer can handle this better.
Definetely, joining tables using text fields is not something You are hoping for.
But, life is life.
If there is no possibility to convert text strings into integers (for example additional table with street_name and street_id), try this:
SELECT one.ID, two.ACCOUNT
FROM split_lct_2 AS one LEFT JOIN split_parcel AS two
ON (nz(one.mySTREET) = nz(two.pSTREET))
AND (nz(one.myDIR) = nz(two.pDIR))
AND (nz(one.myHOUSE) = nz(two.pHOUSE))
UNION
SELECT one.ID, two.ACCOUNT
FROM split_lct_2 AS one LEFT JOIN split_parcel AS two
ON (nz(one.mySTREET_2) = nz(two.pSTREET))
AND (nz(one.myDIR) = nz(two.pDIR))
AND (nz(one.myHOUSE) = nz(two.pHOUSE)
UNION
SELECT one.ID, two.ACCOUNT
FROM split_lct_2 AS one LEFT JOIN split_parcel AS two
ON (nz(one.mySTREET_3) = nz(two.pSTREET)
)
AND (nz(one.myDIR) = nz(two.pDIR))
AND (nz(one.myHOUSE) = nz(two.pHOUSE));
I suppose using Nz() does not allow to use index. Try to avoid them. If data has no NULLs in join key fields then Nz() should be safely removed from query and it should help. But if data has NULLs, you probably need to change this - for example to replace all NULLs with empty strings to make them join-able without Nz() - that's additional data processing outside of this query.

Ibis Impala JOIN problem with relabel/name 'column AS newName'

When you use the Ibis API to query impala, for some reason Ibis API forces it to become a subquery (when you join 4-5 tables it suddenly becomes super slow). It simply won't join normally, due to column name overlap problem on joins. I want a way to quickly rename the columns perhaps, isn't that's how SQL usually works?
i0 = impCon.table('shop_inventory')
s0 = impCon.table('shop_expenditure')
s0 = s0.relabel({'element_date': 'spend_element_date', 'element_shop_item': 'spend_shop_item'})
jn = i0.inner_join(s0, [i0['element_date'] == s0['spend_element_date'], i0['element_shop_item'] == s0['spend_shop_item']])
jn.materialize()
jn.execute(limit=900)
Then you have IBIS generating SQL that is SUBQUERYING it without me suggesting it:
SELECT *
FROM (
SELECT `element_date`, `element_shop_item`, `element_address`, `element_expiration`,
`element_category`, `element_description`
FROM dbp.`shop_inventory`
) t0
INNER JOIN (
SELECT `element_shop_item` AS `spend_shop_item`, `element_comm` AS `spend_comm`,
`element_date` AS `spend_date`, `element_amount`,
`element_spend_type`, `element_shop_item_desc`
FROM dbp.`shop_spend`
) t1
ON (`element_shop_item` = t1.`spend_shop_item`) AND
(`element_category` = t1.`spend_category`) AND
(`element_subcategory` = t1.`spend_subcategory`) AND
(`element_comm` = t1.`spend_comm`) AND
(`element_date` = t1.`spend_date`)
LIMIT 900
Why is this so difficult?
It should be ideally as simple as:
jn = i0.inner_join(s0, [s0['element_date'].as('spend_date') == i0['element_date']]
to generate a single: SELECT s0.element_date as spend_date, i0.element_date INNER JOIN s0 dbp.shop_spend ON s0.spend_date == i0.element_date
right?
Are we not ever allowed to have same column names on tables that are being joined? I am pretty sure in raw SQL you can just use "X AS Y" without having to need subquery.
I spent the last few hours struggling with this same issue. A better solution I found is to do the following. Join keeping the variable names the same. Then, before you materialize, only select a subset of the variables such that there isn't any overlap.
So in your code it would look something like this:
jn = i0.inner_join(s0, [i0['element_date'] == s0['element_date'], i0['element_shop_item'] == s0['element_shop_item']])
expr = jn[i0, s0['variable_of_interest_1'],s0['variable_of_interest_2']]
expr.materialize()
See here for more resources
https://docs.ibis-project.org/sql.html

Optimize SQL query with many left join

I have a SQL query with many left joins
SELECT COUNT(DISTINCT po.o_id)
FROM T_PROPOSAL_INFO po
LEFT JOIN T_PLAN_TYPE tp ON tp.plan_type_id = po.Plan_Type_Fk
LEFT JOIN T_PRODUCT_TYPE pt ON pt.PRODUCT_TYPE_ID = po.cust_product_type_fk
LEFT JOIN T_PROPOSAL_TYPE prt ON prt.PROPTYPE_ID = po.proposal_type_fk
LEFT JOIN T_BUSINESS_SOURCE bs ON bs.BUSINESS_SOURCE_ID = po.CONT_AGT_BRK_CHANNEL_FK
LEFT JOIN T_USER ur ON ur.Id = po.user_id_fk
LEFT JOIN T_ROLES ro ON ur.roleid_fk = ro.Role_Id
LEFT JOIN T_UNDERWRITING_DECISION und ON und.O_Id = po.decision_id_fk
LEFT JOIN T_STATUS st ON st.STATUS_ID = po.piv_uw_status_fk
LEFT OUTER JOIN T_MEMBER_INFO mi ON mi.proposal_info_fk = po.O_ID
WHERE 1 = 1
AND po.CUST_APP_NO LIKE '%100010233976%'
AND 1 = 1
AND po.IS_STP <> 1
AND po.PIV_UW_STATUS_FK != 10
The performance seems to be not good and I would like to optimize the query.
Any suggestions please?
Try this one -
SELECT COUNT(DISTINCT po.o_id)
FROM T_PROPOSAL_INFO po
WHERE PO.CUST_APP_NO LIKE '%100010233976%'
AND PO.IS_STP <> 1
AND po.PIV_UW_STATUS_FK != 10
First, check your indexes. Are they old? Did they get fragmented? Do they need rebuilding?
Then, check your "execution plan" (varies depending on the SQL Engine): are all joins properly understood? Are some of them 'out of order'? Do some of them transfer too many data?
Then, check your plan and indexes: are all important columns covered? Are there any outstandingly lengthy table scans or joins? Are the columns in indexes IN ORDER with the query?
Then, revise your query:
- can you extract some parts that normally would quickly generate small rowset?
- can you add new columns to indexes so join/filter expressions will get covered?
- or reorder them so they match the query better?
And, supporting the solution from #Devart:
Can you eliminate some tables on the way? does the where touch the other tables at all? does the data in the other tables modify the count significantly? If neither SELECT nor WHERE never touches the other joined columns, and if the COUNT exact value is not that important (i.e. does that T_PROPOSAL_INFO exist?) then you might remove all the joins completely, as Devart suggested. LEFTJOINs never reduce the number of rows. They only copy/expand/multiply the rows.

What would be a reasonably fast way to code this sql query in c#?

I have this SQL query:
select
sum(h.nbHeures)
from
Heures h
join
HeuresProjets hp on h.HpGuid=hp.HPId
join
ActivityCodes ac on h.Code=ac.ActivityId
join
MainDoeuvre mdo on ac.ActivityId=mdo.CodeGuid
where
hp.ExtraGuid = '61E931C8-3268-4C9C-9FF5-ED0213D348D0'
and mdo.NoType = 1
It runs in less than a second, which is good. My project uses LINQ to entities to get data. This (very similar to the sql) query is terribly slow, taking more than a minute.
var result = (from hp in this.HeuresProjets
join h in ctx.Heures on hp.HPId equals h.HpGuid
join ac in ctx.ActivityCodes on h.Code equals ac.ActivityId
join mdo in ctx.MainDoeuvre on ac.ActivityId equals mdo.CodeGuid
where hp.ExtraGuid == this.EntityGuid && mdo.NoType == (int)spType
select h.NbHeures).Sum();
total = result;
I tried using nested loops instead. It's faster but still slow (~15 seconds).
foreach (HeuresProjets item in this.HeuresProjets)
{
foreach (Heures h in ctx.Heures.Where(x => x.HpGuid == item.HPId))
{
if (h.ActivityCodes != null && h.ActivityCodes.MainDoeuvre.FirstOrDefault() != null && h.ActivityCodes.MainDoeuvre.First().NoType == (int)type)
{
total += h.NbHeures;
}
}
}
Am I doing something obviously wrong? If there's no way to optimize this I'll just call a stored procedure but I would really like the keep the logic in the code.
EDIT
I modified my query according to IronMan84's advice.
decimal total = 0;
var result = (from hp in ctx.HeuresProjets
join h in ctx.Heures on hp.HPId equals h.HpGuid
join ac in ctx.ActivityCodes on h.Code equals ac.ActivityId
join mdo in ctx.MainDoeuvre on ac.ActivityId equals mdo.CodeGuid
where hp.ExtraGuid == this.EntityGuid && mdo.NoType == (int)spType
select h);
if(result.Any())
total = result.Sum(x=>x.NbHeures);
This almost works. It runs fast and gives back a decimal but:
1. It's not the right value
2. The result is clearly cached because it returns the exact same value with different parameters.
From looking at your code I'm thinking that your query is grabbing every single record from those tables that you're joining on (hence the long amount of time). I'm seeing you using this.HeuresProjets, which I'm assuming is a collection of database objects that you already had grabbed from the database (and that's why you're not using ctx.HeuresProjets). That collection, then, has probably already been hydrated by the time you get to your join query. In which case it becomes a LINQ-To-Objects query, necessitating that EF go and grab all of the other tables' records in order to complete the join.
Assuming I'm correct in my assumption (and let me know if I'm wrong), you might want to try this out:
var result = (from hp in ctx.HeuresProjets
join h in ctx.Heures on hp.HPId equals h.HpGuid
join ac in ctx.ActivityCodes on h.Code equals ac.ActivityId
join mdo in ctx.MainDoeuvre on ac.ActivityId equals mdo.CodeGuid
where hp.ExtraGuid == this.EntityGuid && mdo.NoType == (int)spType
select h).Sum(h => h.NbHeures);
total = result;
Also, if this.HeuresProjets is a filtered list of only specific objects, you can then just add to the where clause of the query to make sure that the IDs are in this.HeuresProjets.Select(hp => hp.HPId)