spark jdbc read for greenplum -how read a large table - apache-spark-sql

the code as:
Dataset<Row> invoke = spark.read().format("jdbc").option("url", url)
.option("dbtable", query)
.option("partitionColumn", "id")
.option("numPartitions", "5")
.option("lowerBound", Long.MIN_VALUE)
.option("upperBound", Long.MAX_VALUE)
.load();
when the gp table is a large table,oom will occur
how can handle the large table?

Related

How to optimize querying multiple unrelated tables in SQLite?

I have scenario when I have to iterate through multiple tables in quite big sqlite database. In tables I store informations about planet position on sky through years. So e.g. for Mars I have tables Mars_2000, Mars_2001 and so on. Table structure is always the same:
|id:INTEGER|date:TEXT|longitude:REAL|
Thing is that for certain task I need to iterate through this tables, which cost much time (for more than 10 queries it's painful).
I suppose that if I merge all tables with years to one big table performance might be better as one query through one big table is better than 50 through smaller tables. I wanted to make sure that this might work, as database is humongous (around 20Gb), and reshaping it would cost a while.
Is this plan I just described viable? Is there any other solution for such case?
It might be helpfull so I attach function that produces my SQL query that is unique for each table:
pub fn transition_query(
select_param: &str, // usually asterix
table_name: &str, // table I'd like to query
birth_degree: &f64, // constant number
wanted_degree: &f64, // another constant number
orb: &f64, // another constant number
upper_date_limit: DateTime<Utc>, // casts to SQL-like string
lower_date_limit: DateTime<Utc>, // casts to SQL-like string
) -> String {
let parsed_upper_date_limit = CelestialBodyPosition::parse_date(upper_date_limit);
let parsed_lower_date_limit = CelestialBodyPosition::parse_date(lower_date_limit);
return format!("
SELECT *,(SECOND_LAG>60 OR SECOND_LAG IS NULL) AS TRANSIT_START, (SECOND_LEAD > 60 OR SECOND_LEAD IS NULL) AS TRANSIT_END, time FROM (
SELECT
*,
UNIX_TIME - LAG(UNIX_TIME,1) OVER (ORDER BY time) as SECOND_LAG,
LEAD(UNIX_TIME,1) OVER (ORDER BY time) - UNIX_TIME as SECOND_LEAD FROM (
SELECT {select_param},
DATE(time) as day_scoped_date,
CAST(strftime('%s', time) AS INT) AS UNIX_TIME,
longitude
FROM {table_name}
WHERE ((-{orb} <= abs(realModulo(longitude -{birth_degree} -{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} -{wanted_degree},360)) <= {orb})
OR
(-{orb} <= abs(realModulo(longitude -{birth_degree} +{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} +{wanted_degree},360)) <= {orb}))
AND time < '{parsed_upper_date_limit}' AND time > '{parsed_lower_date_limit}'
)
) WHERE (TRANSIT_START AND NOT TRANSIT_END) OR (TRANSIT_END AND NOT TRANSIT_START) ;
");
}
I solved the issue programmatically. Whole thing was done with Rust and r2d2_sqlite library. I'm still doing a lot of queries, but now it's done in threads. It allowed me to reduce execution time from 25s to around 3s. Here's the code:
use std::sync::mpsc;
use std::thread;
use r2d2_sqlite::SqliteConnectionManager;
use r2d2;
let manager = SqliteConnectionManager::file("db_path");
let pool = r2d2::Pool::builder().build(manager).unwrap();
let mut result: Vec<CelestialBodyPosition> = vec![]; // Vector of structs
let (tx, rx) = mpsc::channel(); // Allows ansynchronous communication
let mut children = vec![]; //vector of join handlers (not sure if needed at all
for query in queries {
let pool = pool.clone(); // For each loop I clone connection to databse
let inner_tx = tx.clone(); // and messager, as each thread should have spearated one.
children.push(thread::spawn(move || {
let conn = pool.get().unwrap();
add_real_modulo_function(&conn); // this adds custom sqlite function I needed
let mut sql = conn.prepare(&query).unwrap();
// this does query, and maps result to my internal type
let positions: Vec<CelestialBodyPosition> = sql
.query_map(params![], |row| {
Ok(CelestialBodyPosition::new(row.get(1)?, row.get(2)?))
})
.unwrap()
.map(|position| position.unwrap())
.collect();
// this sends partial result to receiver
return inner_tx.send(positions).unwrap();
}));
}
// first messenger has to be dropped, otherwise program will wait for its input
drop(tx);
for received in rx {
result.extend(received); // combine all results
}
return result;
As you can see no optimization happened from sqlite site, which kinda makes me feel I'm doing something wrong, but for now it's alright. It might be good to press some more control over amount of spawned threads.

how to use BigQueryIO .write() to write into specific partition not the table itself

I am trying to use BigQueryIO .write() to insert row by row to a table name:
val tableName = new SerializableFunction[ValueInSingleWindow[KVOfTableRowAndString], TableDestination] {
override def apply( input: ValueInSingleWindow[KVOfTableRowAndString] ): TableDestination = {
new TableDestination(input.getValue.getValue, EmptyTableDescription)
}
}
BigQueryIO
.write()
.to(tableName)
.withFormatFunction(tableRow)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withExtendedErrorInfo()
but am getting an error because of the
Too many partitions produced by query, allowed 4000, query produces at least 10000 partitions
is there a thing I can do to load some of the partitions only while writing to the bq as the tables partitions exceed the limit ?

Many to many query joins in aqueduct

I have A -> AB <- B many to many relationship between 2 ManagedObjects (A and B), where AB is the junction table.
When querying A from db, how do i join B values to AB joint objects?
Query<A> query = await Query<A>(context)
..join(set: (a) => a.ab);
It gives me a list of A objects which contains AB joint objects, but AB objects doesn't include full B objects, but only b.id (not other fields from class B).
Cheers
When you call join, a new Query<T> is created and returned from that method, where T is the joined type. So if a.ab is of type AB, Query<A>.join returns a Query<AB> (it is linked to the original query internally).
Since you have a new Query<AB>, you can configure it like any other query, including initiating another join, adding sorting descriptors and where clauses.
There are some stylistic syntax choices to be made. You can condense this query into a one-liner:
final query = Query<A>(context)
..join(set: (a) => a.ab).join(object: (ab) => ab.b);
final results = await query.fetch();
This is OK if the query remains as-is, but as you add more criteria to a query, the difference between the dot operator and the cascade operator becomes harder to track. I often pull the join query into its own variable. (Note that you don't call any execution methods on the join query):
final query = Query<A>(context);
final join = query.join(set: (a) => a.ab)
..join(object: (ab) => ab.b);
final results = await query.fetch();

Has anyone noticed that EF Core 1.0 2015.1 has made the queries very inefficient

After upgrading to Asp.Net Core 2015.1 I have noticed that a lot of EF queries have become a lot slower to run.
I have done some investigation and found that a lot of the queries with where filters now get evaluated in Code, rather than passing the filters to SQL as part of a where clause to run with the query.
We ended up having to re-write a number of our queries as stored procedures to get back performance. Note these used to be efficient prior to the 2015.1 release. Something obviously was changed, and a lot of queries are doing select all queries on a table and then filtering the data in code. This approach is terrible for performance, e.g. reading a table with lots of rows, to filter everything but maybe 2 rows.
I have to ask what changed, and whether anyone else is seeing the same thing?
For example: I have a ForeignExchange table along with a ForeignExchangeRate table which are linked via ForeignExchangeid = ForeignExchangeRate.ForeignExchangeId
await _context.ForeignExchanges
.Include(x => x.ForeignExchangeRates)
.Select(x => new ForeignExchangeViewModel
{
Id = x.Id,
Code = x.Code,
Name = x.Name,
Symbol = x.Symbol,
CountryId = x.CountryId,
CurrentExchangeRate = x.ForeignExchangeRates
.FirstOrDefault(y => (DateTime.Today >= y.ValidFrom)
&& (y.ValidTo == null || y.ValidTo >= DateTime.Today)).ExchangeRate.ToFxRate(),
HistoricalExchangeRates = x.ForeignExchangeRates
.OrderByDescending(y => y.ValidFrom)
.Select(y => new FxRate
{
ValidFrom = y.ValidFrom,
ValidTo = y.ValidTo,
ExchangeRate = y.ExchangeRate.ToFxRate(),
}).ToList()
})
.FirstOrDefaultAsync(x => x.Id == id);
And I use this to get the data for editing a foreign exchange rate
So the SQL generated is not as expected. It generates the following 2 SQL statements to get the data
SELECT TOP(1) [x].[ForeignExchangeId], [x].[ForeignCurrencyCode], [x].[CurrencyName], [x].[CurrencySymbol], [x].[CountryId], (
SELECT TOP(1) [y].[ExchangeRate]
FROM [ForeignExchangeRate] AS [y]
WHERE ((#__Today_0 >= [y].[ValidFrom]) AND ([y].[ValidTo] IS NULL OR ([y]. [ValidTo] >= #__Today_1))) AND ([x].[ForeignExchangeId] = [y].[ForeignExchangeId])
)FROM [ForeignExchange] AS [x]
WHERE [x].[ForeignExchangeId] = #__id_2
and
SELECT [y0].[ForeignExchangeId], [y0].[ValidFrom], [y0].[ValidTo], [y0].[ExchangeRate]
FROM [ForeignExchangeRate] AS [y0]
ORDER BY [y0].[ValidFrom] DESC
The second query is the one that causes the slowness. If the table has many rows, then it essentially gets the whole table and filters the data in code
This has changed in the latest release as this used to work in the RC versions of EF
One other query I used to have was the following
return await _context.CatchPlans
.Where(x => x.FishReceiverId == fishReceiverId
&& x.FisherId == fisherId
&& x.StockId == stockId
&& x.SeasonStartDate == seasonStartDate
&& x.EffectiveDate >= asAtDate
&& x.BudgetType < BudgetType.NonQuotaed)
.OrderBy(x => x.Priority)
.ThenBy(x => x.BudgetType)
.ToListAsync();
and this query ended up doing a Table read (the entire table which was in the tens of thousands of rows) to get a filter subset of between 2 and 10 records. Very inefficient. This was one query I had to replace with a stored procedure. Reduced from approx 1.5-3.0 seconds down to milliseconds. And note this used to run efficiently before the upgrade
This is a known issue on EF core 1.0.The solution right now is to convert all your critical queries to sync one.The problem is on Async queries right now.They'll sort out this issue on EF core 1.1.0 version.But it has not been released yet.
Here is the Test done by the EF core dev team member :
You can find out more info here :EF Core 1.0 RC2: async queries so much slower than sync
Another suggestion I would like to do.That is try your queries with .AsNoTracking().That too will improve the query performance.
.AsNoTracking()
Sometimes you may want to get entities back from a query but not have
those entities be tracked by the context. This may result in better
performance when querying for large numbers of entities in read-only
scenarios.

Perform query and record count using CriteriaBuilder

I populating a Primefaces datatable lazily with a custom complex dynamic query built using CriteriaBuilder, performing a SQL query on a database.
For that I need to perform a record count using this query and I also need to run the query itself to get the record in the specified interval for the datatable.
So I thought I could do something like this:
CriteriaBuilder criteriaBuilder = this.getEntityManager().getCriteriaBuilder();
CriteriaQuery<Object> criteriaQuery = criteriaBuilder.createQuery();
Root<RSip> from = criteriaQuery.from(RSip.class);
Path<Long> eSipss = from.join("idESip").get("ss");
CriteriaQuery<Object> select = criteriaQuery.select(from);
List<Predicate> predicates = new ArrayList();
//Many predicates added according to user end
predicates.add(...);
predicates.add(...);
predicates.add(...);
//The query is now ready
select.where(predicates.toArray(new Predicate[predicates.size()]));
//but I need also to perform the record count using SQL Count as the dataset returned can be very large
CriteriaQuery<Object> selectCount = criteriaQuery.select(criteriaBuilder.count(fromCount));
//and then perform both selects like this:
//Record Count:
TypedQuery<Object> typedQueryCount = this.getEntityManager().createQuery(selectCount);
List<Object> recordCount = typedQueryCount.getResultList();
//Query:
TypedQuery<Object> typedQuery = this.getEntityManager().createQuery(select);
List<Object> records = typedQuery.getResultList();
The problem is that on this second query, it returns me the record count also and not the actual records...what can I be doing wrong?
If there is some other way to do this I'm happy to read your answer!