Considering the sample table
Col 1, Col2, Col3
1 , x , G
1 , y , H
2 , z , J
2 , a , K
2 , a , K
3 , b , E
I want below result, i.e distinct rows
1 , x , G
1 , y , H
2 , z , J
2 , a , K
3 , b , E
I tried
var Result = Context.Table.Select(C =>
new {
Col1 = C.Col1,
Col2 = C.Col2,
Col3 = C.Col3
}).Distinct();
and
Context.Table.GroupBy(x=>new {x.Col1,x.Col2,x.Col3}).Select(x=>x.First()).ToList();
The results are as expected, however my table has 35 columns and 1 million records and its size will keep on growing, the current time for the query is 22-30 secs, so how to improve the performance and get it down to 2-3 secs?
Using distinct is the way to go... I'd say that the first approach you tried is the correct one - but do you really need all 1 million rows? See what where conditions you can add or maybe take just the first x records?
var Result = Context.Table.Select(c => new
{
Col1 = c.Col1,
Col2 = c.Col2,
Col3 = c.Col3
})
.Where(c => /*some condition to narrow results*/)
.Take(1000) //some number of the wanted amount of records
.Distinct();
What you might be able to do, is to use the rownum to select in bulks. Something like:
public <return type> RetrieveBulk(int fromRow, int toRow)
{
return Context.Table.Where(record => record.Rownum >= fromRow && record.Rownum < toRow)
.Select(c => new
{
Col1 = c.Col1,
Col2 = c.Col2,
Col3 = c.Col3
}).Distinct();
}
This code you can then do something like:
List<Task<return type>> selectTasks = new List<Task<return type>>();
for(int i = 0; i < 1000000; i+=1000)
{
selectTasks.Add(Task.Run(() => RetrieveBulk(i, i + 1000)));
}
Task.WaitAll(selectTasks);
//And then intercet data using some efficient structure as a HashSet so when you intersect it wont be o(n)2 but o(n)
Related
I have a table like the following in Google BigQuery. I am trying to get all possible unique combination(all subsets except the null subset) of the Item column partitioned on Group.
Group Item
1 A
1 B
1 C
2 X
2 Y
2 Z
I am looking for an output like the following:
Group Item
1 A
1 B
1 C
1 A,B
1 B,C
1 A,C
1 A,B,C
2 X
2 Y
2 Z
2 X,Y
2 Y,Z
2 X,Z
2 X,Y,Z
I have tried to use this accepted answer to incorporate Group to no avail:
How to get combination of value from single column?
Consider below approach
CREATE TEMP FUNCTION generate_combinations(a ARRAY<STRING>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with your_table as (
select 1 as _Group,'A' as Item union all
select 1, 'B' union all
select 1, 'C' union all
select 2, 'X' union all
select 2, 'Y' union all
select 2, 'Z'
)
select _group, item
from (
select _group, generate_combinations(array_agg(item)) items
from your_table
group by _group
), unnest(items) item
with output
Try this
with _data as
(
select 1 as _Group,'A' as Item union all
select 1 as _Group,'B' as Item union all
select 1 as _Group,'C' as Item union all
select 2 as _Group,'X' as Item union all
select 2 as _Group,'Y' as Item union all
select 2 as _Group,'Z' as Item
)
select distinct _Group ,Item from
(
select _Group,
Item
from _data
union all
select _Group,
string_agg(Item ,',') over(partition by _Group order by Item ) as item
from _data
union all
select a._Group ,
concat(a.item,',',b.item)
from _data a left join _data b on a._group = b._group and a.Item < b.Item
)
where item is not null
order by _group
Given data as such:
Month ValueA
1 T
2 T
3 T
4 F
Is there a way to make a measure that would find if for each month, last three Values were True?
So the output would be (F,F,T,F)?
That would propably mean that my actual problem is solvable, which is finding from:
Month ValueA ValueB ValueC
1 T F T
2 T T T
3 T T T
4 F T F
the count of those booleans for each row, so the output would be (0,0,2[A and C],1[B])
EDIT:
Okay, I managed to solve the first part with this:
Previous =
VAR PreviousDate =
MAXX(
FILTER(
ALL( 'Table' ),
EARLIER( 'Table'[Month] ) > 'Table'[Month]
),
'Table'[Month]
)
VAR PreviousDate2 =
MAXX(
FILTER(
ALL( 'Table' ),
EARLIER( 'Table'[Month] ) - 1 > 'Table'[Month]
),
'Table'[Month]
)
RETURN
IF(
CALCULATE(
MAX( 'Table'[Value] ),
FILTER(
'Table',
'Table'[Month] = PreviousDate
)
) = "T"
&& CALCULATE(
MAX( 'Table'[Value] ),
FILTER(
'Table',
'Table'[Month] = PreviousDate2
)
) = "T"
&& 'Table'[Value] = "T",
TRUE,
FALSE
)
But is there a way to use it with unknown number of columns?
Without hard - coding every column name? Like a loop or something.
I would redo the data table in power query (upivoting the ValueX-columns) and changing T/F to 1/0. Then have a dim table with a relationship to Month, like this:
Then add a measure like this:
Three Consec T =
var maxMonth = MAX('Data'[Month])
var tempTab =
FILTER(
dimMonth;
'dimMonth'[MonthNumber] <= maxMonth && 'dimMonth'[MonthNumber] > maxMonth -3
)
var sumMonth =
MAXX(
'dimMonth';
CALCULATE(
SUM('Data'[OneOrZero]);
tempTab
)
)
return
IF(
sumMonth >= 3;
"3 months in a row";
"No"
)
Then I can have a visual like this when the slicer indicates which time window I'm looking at and the table shows if there has been 3 consecutive Ts or not.
I have a data set with these columns:-
FMID,County,WIC,WICcash
Here is a sample of data:-
1002267,Douglas,Y,N
21005876,Douglas,Y,N
1001666,Douglas,N,Y
I have grouped the data based on County and have filtered the data based on County = 'Douglas'. Here is the output:
(Douglas,{(1002267,Douglas,Y,N),(21005876,Douglas,Y,N),(1001666,Douglas,N,Y)})
Now if the WIC and WICcash columns have value as Y then I want to take the combine count of the values from both the columns.
Here, combining WIC and WICcash columns I have 3 Y values, so my output will be
Douglas 3
How can I achieve this?
Below is the code that I have written till now
load_data = LOAD 'PigPrograms/Markets/DATA_GOV_US_Farmers_Market_DataSet.csv' USING PigStorage(',') as (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
group_markets_by_county = GROUP load_data BY County;
filter_county = FILTER group_markets_by_county BY group == 'Douglas';
DUMP filter_county;
For looking inside a bag, you can use a nested-foreach.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = GROUP A by County;
describe B; /* B: {group: chararray,A: {(FMID: long,County: chararray,WIC: chararray,WICcash: chararray)}} */
C = FOREACH B {
FILTER_WIC_Y = FILTER A by WIC == 'Y';
COUNT_WIC_Y = COUNT(FILTER_WIC_Y);
FILTER_WICcash_Y = FILTER A by WICcash == 'Y';
COUNT_WICcash_Y = COUNT(FILTER_WICcash_Y);
GENERATE group, COUNT_WIC_Y + COUNT_WICcash_Y as count;
}
dump C;
Or, you can replace 'Y'&'N' into 1&0 and add them up.
A = LOAD 'input3.txt' AS (FMID:long,County:chararray, WIC:chararray, WICcash:chararray);
B = FOREACH A GENERATE FMID, County, (WIC == 'Y' ? 1 : 0 ) as wic, (WICcash == 'Y' ? 1 : 0 ) as wiccash;
C = GROUP B by County;
D = FOREACH C GENERATE group, SUM(B.wic) + SUM(B.wiccash) as count;
dump D;
I am trying to construct a double count left join query in Slick 3.1 similar to this:
SELECT
COUNT(p.id) AS replies,
COUNT(f.id) AS images
FROM posts AS p
LEFT JOIN files AS f
ON p.id = f.post_id
WHERE thread = :thread_id
Join part of the query is quite simple and looks like this:
val joinQ = (threadId: Rep[Long]) =>
(postDAO.posts joinLeft fileRecordDAO.fileRecords on (_.id === _.postId))
.filter(_._1.thread === threadId)
Using joinQ(1L).map(_._1.id).length generates COUNT(1) which counts all rows - not the result I want to obtain. Using joinQ(1L).map(_._1.id).countDistinct generates COUNT(DISTINCT p.id) which is somewhat what I'm looking for, but trying to do two of these generates this monstrosity:
select x2.x3, x4.x5
from (select count(distinct x6.`id`) as x3
from `posts` x6
left outer join `files` x7 on x6.`id` = x7.`post_id`
where x6.`thread` = 1) x2,
(select
count(distinct (case when (x8.`id` is null) then null else 1 end)) as x5
from `posts` x9
left outer join `files` x8 on x9.`id` = x8.`post_id`
where x9.`thread` = 1) x4
Here's the double countDistinct code:
val q = (threadId: Rep[Long]) => {
val join = joinQ(threadId)
val q1 = join.map(_._1.id).countDistinct // map arg type is
val q2 = join.map(_._2).countDistinct // (Post, Rep[Option[FileRecord]])
(q1, q2)
}
Any ideas? :)
Count is aggregation function, so you also need grouping by some field (e.x. id). Try next query:
def joinQ(threadId: Long) = (postDAO.posts joinLeft fileRecordDAO.fileRecords on (_.id === _.postId))
.filter(_._1.thread === threadId)
val q = (threadId: Long) => {
joinQ(threadId).groupBy(_._1.id).map {
case (id, qry) => (id, qry.map(_._1.id).countDistinct, qry.map(_._2.map(_.id)).countDistinct)
}
}
It generates next sql:
SELECT x2.`id`,
Count(DISTINCT x2.`id`),
Count(DISTINCT x3.`id`)
FROM `posts` x2
LEFT OUTER JOIN `files` x3
ON x2.`id` = x3.`post_id`
WHERE x2.`thread` = 10
GROUP BY x2.`id`
I am trying to translate this sql into a slick 3.1 style collection query (single call). This sql (postgres) returns what I am looking for:
select
minDate.min as lastModified,
(select count("id") from "Items" where "orderId" = 1) as totalItemCount,
(select count("id") from "Items" where "orderId" = 1 and "dateModified" >= minDate.min) as addedCount
from
(select min("dateModified") as "min" from "Items" where "orderId" = 1 and "state" = 'new') as minDate
Returns: for a specified set of Items (from orderId), returns:
date of item last modified
total number of items
number of items added since the lastModified
But after many attempts, I can't figure out how to translate this to a single slick-style query
This codes
import scala.slick.driver.PostgresDriver
case class Item(id: Int, orderId: Int, state: String, dateModified: Int)
object SlickComplexQuery {
def main(args: Array[String]) = {
val driver = PostgresDriver
import driver.simple._
class ItemsTable(tag: Tag) extends Table[Item](tag, "Items"){
def id = column[Int]("id")
def orderId = column[Int]("orderId")
def state = column[String]("state")
def dateModified = column[Int]("dateModified")
def * = (id, orderId, state, dateModified) <> (Item.tupled, Item.unapply)
}
val items = TableQuery[ItemsTable]
val query1 = items
.filter(i => i.orderId === 1 && i.state === "new")
.map(_.dateModified)
.min
val query2 = items
.filter(_.orderId === 1)
.map(_.id)
.length
val query3 = items
.filter(i => i.orderId === 1 && i.dateModified >= query1)
.map(_.id)
.length
val query = Query(query1, query2, query3)
results in such query:
select x2.x3, x4.x5, x6.x7
from (select min(x8.x9) as x3
from (select x10."dateModified" as x9
from "Items" x10
where (x10."orderId" = 1) and (x10."state" = 'new')) x8) x2,
(select count(1) as x5
from (select x11."id" as x12
from "Items" x11
where x11."orderId" = 1) x13) x4,
(select count(1) as x7
from (select x14."id" as x15
from "Items" x14, (select min(x16.x17) as x18
from (select x19."dateModified" as x17
from "Items" x19
where (x19."orderId" = 1) and (x19."state" = 'new')) x16) x20
where (x14."orderId" = 1) and (x14."dateModified" >= x20.x18)) x21) x6
This query is much alike yours, slick 2.0 was used.