Apache Spark Performance Issue - sql

We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement.
Using,
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("AIRecordLinkage").setMaster("local[*]"));<br>
Dataset<Row> sourceFileContent = spark.read().jdbc("jdbc:oracle:thin:#//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
We are able to import around 1.8 million records to spark environment which is stored in dataset object.
Now using filter function
targetFileContent.filter(col("TARGETUPC").equalTo(upcValue))
The above filter statement is in a loop where upcValue gets updated for approximately 46k IDs.
This program is executing for several hours, but we tried the same using sql IN operator where in we kept all the 46k UPC IDs which executed in less than a minute.
Configuration:
Spark-sql 2.11
Spark-core 2.11
JDK 8
Windows 10, Single node 4 cores 3Ghz, 16 GB RAM.
C drive -> 12 GB free space.
Eclipse -> Run configuration -> –Xms15000m.
Kindly help us to analyze and understand if there are any mistakes and suggest us what needs to be done to improve the performance.
#Component("upcExactMatch")
public class UPCExactMatch {
#Autowired
private Environment envirnoment;
#Autowired
private LoadCSV loadCSV;
#Autowired
private SQLHandler sqlHandler;
public ArrayList<Row> perform(){
ArrayList<Row> upcNonMatchedItemIDs=new ArrayList<Row>();
ArrayList<Row> upcMatchedItemIDs=new ArrayList<Row>();
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(javaSparkContext);
SparkSession sparkSession = SparkSession.builder().appName("JavaStopWordshandlerTest").getOrCreate();
try{
Dataset<Row> sourceFileContent =loadCSV.load(sourceFileName,sourceFileLocation,javaSparkContext,sqlContext);
// load target from database
Dataset<Row> targetFileContent = spark.read().jdbc("jdbc:oracle:thin:#//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
System.out.println("File counts :"+sourceFileContent.count()+" : "+targetFileContent.count());
ArrayList<String> upcMatched = new ArrayList<String>();
ArrayList<String> partNumberMatched = new ArrayList<String>();
List<Row> sourceFileContents = sourceFileContent.collectAsList();
int upcColumnIndex=-1;
int itemIDColumnIndex=-1;
int partNumberTargetIndex=-1;
String upcValue="";
StructType schema = targetFileContent.schema();
List<Row> data = Arrays.asList();
Dataset<Row> upcMatchedRows = sparkSession.createDataFrame(data, schema);
for(Row rowSourceFileContent: sourceFileContents){
upcColumnIndex=rowSourceFileContent.fieldIndex("Vendor UPC");
if(!rowSourceFileContent.isNullAt(upcColumnIndex)){
upcValue=rowSourceFileContent.get(upcColumnIndex).toString();
upcMatchedRows=targetFileContent.filter(col("TARGETUPC").equalTo(upcValue));
if(upcMatchedRows.count() > 0){
for(Row upcMatchedRow: upcMatchedRows.collectAsList()){
partNumberTargetIndex=upcMatchedRow.fieldIndex("PART_NUMBER");
if(partNumberTargetIndex != -1){
upcMatched.add(upcValue);
partNumberMatched.add(upcMatchedRow.get(partNumberTargetIndex).toString());
System.out.println("Source UPC : "+upcValue +"\tTarget part number :"+ upcMatchedRow.get(partNumberTargetIndex));
}
}
}
}
}
for(int i=0;i<upcMatched.size();i++){
System.out.println("Matched Exact UPC ids are :"+upcMatched.get(i) + "\t:Target\t"+partNumberMatched.get(i));
}
}catch(Exception e){
e.printStackTrace();
}finally{
sparkSession.stop();
sqlContext.clearCache();
javaSparkContext.close();
}
return upcMatchedItemIDs;
}
}

Try doing inner join between two dataframes of data sets for matched records.

Related

Order of the iterations of entries in an Ignite cache and seek method

What is the ordering of the keys in an Ignite cache (without using indexing) and is it possible to do the equivalent of the following RocksDB snippet
try (final RocksIterator rocksIterator =
rocksDB.newIterator(columnFamilyHandleList.get(1))) {
for (rocksIterator.seek(prefixKey);
i.e. jump to the next entry starting with a given byte[] or String?
The way you'd do that in Ignite is by using SQL.
var query = new SqlFieldsQuery("select x,y,z from table where z like ? order by x").setArgs("prefix%");
try (var cursor = cache.query(query)) {
for (var r : cursor) {
Long id = (Long) r.get(0);
BigDecimal value = (BigDecimal) r.get(1);
String name = (String) r.get(2);
}
}

Hibernate Search manual indexing throw a "org.hibernate.TransientObjectException: The instance was not associated with this session"

I use Hibernate Search 5.11 on my Spring Boot 2 application, allowing to make full text research.
This librairy require to index documents.
When my app is launched, I try to re-index manually data of an indexed entity (MyEntity.class) each five minutes (for specific reason, due to my server context).
I try to index data of the MyEntity.class.
MyEntity.class has a property attachedFiles, which is an hashset, filled with a join #OneToMany(), with lazy loading mode enabled :
#OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true)
private Set<AttachedFile> attachedFiles = new HashSet<>();
I code the required indexing process, but an exception is thrown on "fullTextSession.index(result);" when attachedFiles property of a given entity is filled with one or more items :
org.hibernate.TransientObjectException: The instance was not associated with this session
The debug mode indicates a message like "Unable to load [...]" on entity hashset value in this case.
And if the HashSet is empty (not null, only empty), no exception is thrown.
My indexing method :
private void indexDocumentsByEntityIds(List<Long> ids) {
final int BATCH_SIZE = 128;
Session session = entityManager.unwrap(Session.class);
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<MyEntity> criteria = builder.createQuery(MyEntity.class);
Root<MyEntity> root = criteria.from(MyEntity.class);
criteria.select(root).where(root.get("id").in(ids));
TypedQuery<MyEntity> query = fullTextSession.createQuery(criteria);
List<MyEntity> results = query.getResultList();
int index = 0;
for (MyEntity result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index % BATCH_SIZE == 0 || index == ids.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
}
Does someone have an idea ?
Thanks !
This is probably caused by the "clear" call you have in your loop.
In essence, what you're doing is:
load all entities to reindex into the session
index one batch of entities
remove all entities from the session (fullTextSession.clear())
try to index the next batch of entities, even though they are not in the session anymore... ?
What you need to do is to only load each batch of entities after the session clearing, so that you're sure they are still in the session when you index them.
There's an example of how to do this in the documentation, using a scroll and an appropriate batch size: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#search-batchindex-flushtoindexes
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
Thanks for the explanations #yrodiere, they helped me a lot !
I chose your alternative solution :
Alternatively, you can just split your ID list in smaller lists of 128 elements, and for each of these lists, run a query to get the corresponding entities, reindex all these 128 entities, then flush and clear.
...and everything works perfectly !
Well seen !
See the code solution below :
private List<List<Object>> splitList(List<Object> list, int subListSize) {
List<List<Object>> splittedList = new ArrayList<>();
if (!CollectionUtils.isEmpty(list)) {
int i = 0;
int nbItems = list.size();
while (i < nbItems) {
int maxLastSubListIndex = i + subListSize;
int lastSubListIndex = (maxLastSubListIndex > nbItems) ? nbItems : maxLastSubListIndex;
List<Object> subList = list.subList(i, lastSubListIndex);
splittedList.add(subList);
i = lastSubListIndex;
}
}
return splittedList;
}
private void indexDocumentsByEntityIds(Class<Object> clazz, String entityIdPropertyName, List<Object> ids) {
Session session = entityManager.unwrap(Session.class);
List<List<Object>> splittedIdsLists = splitList(ids, 128);
for (List<Object> splittedIds : splittedIdsLists) {
FullTextSession fullTextSession = Search.getFullTextSession(session);
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
Transaction transaction = fullTextSession.beginTransaction();
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Object> criteria = builder.createQuery(clazz);
Root<Object> root = criteria.from(clazz);
criteria.select(root).where(root.get(entityIdPropertyName).in(splittedIds));
TypedQuery<Object> query = fullTextSession.createQuery(criteria);
List<Object> results = query.getResultList();
int index = 0;
for (Object result : results) {
index++;
try {
fullTextSession.index(result); //index each element
if (index == splittedIds.size()) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
} catch (TransientObjectException toEx) {
LOGGER.info(toEx.getMessage());
throw toEx;
}
}
transaction.commit();
}
}

Kafka Spark streaming HBase insert issues

I'm using Kafka to send a file with 3 columns using Spark streaming 1.3 to insert into HBase.
This is how my HBase looks like :
ROW COLUMN+CELL
zone:bizert column=travail:call, timestamp=1491836364921, value=contact:numero
zone:jendouba column=travail:Big data, timestamp=1491835836290, value=contact:email
zone:tunis column=travail:info, timestamp=1491835897342, value=contact:num
3 row(s) in 0.4200 seconds
And this is how I read data with spark streaming, I'm using spark-shell:
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set ("zed")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "xx.xx.xxx.xx:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
lines.saveAsTextFiles("hdfs://xxxxx:8020/user/admin/zed/steams3/")
})
this code is working when I'm saving data into HDFS even it save many empty data to HDFS.
before writing this question I was searching here and some other question like mine but I didn't get a good solution.
May you propose the best way to do this?.
This is how my code look now
val sc = new SparkContext("local", "Hbase spark")
val tableName = "notz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
lines.foreachRDD(rdd => { (!rdd.partitions.isEmpty)
if(!admin.isTableAvailable(tableName)) {
print("Creating GHbase Table")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("zone"
.getBytes()))
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
}
val myTable = new HTable(conf, tableName)
// i'm blocked here
})

Spark Streaming + Spark SQL

I am trying to process logs via Spark Streaming and Spark SQL. The main idea is to have a "compacted" dataset with Parquet format for "old" data converted to DataFrame as needed for queries, the compacted dataset loading is done with:
SQLContext sqlContext = JavaSQLContextSingleton.getInstance(sc.sc());
DataFrame compact = null;
compact = sqlContext.parquetFile("hdfs://auto-ha/tmp/data/logs");
As the uncompacted dataset (I compact the dataset daily) is composed of many files, I would like to have the data in the current day within a DStream in order to get those queries fast.
I have tried the DataFrame approach without results....
DataFrame df = JavaSQLContextSingleton.getInstance(sc.sc()).createDataFrame(lastData, schema);
df.registerTempTable("lastData");
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
DataFrame df = JavaSQLContextSingleton.getInstance(v1.context()).createDataFrame(v1, schema);
......drop old data from lastData table
df.insertInto("lastData");
}
});
Using this approach I do not get any results if I query the temp table in a different thread for example.
I have also tried to use the RDD transform method, more specifically I tried to follow the Spark Example where I create a empty RDD and then I union the DSStream RDD contents with the empty RDD:
JavaRDD<Row> lastData = sc.emptyRDD();
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
lastData.union(v1).filter(let only recent data....);
}
});
This approach does not work too as I do not get any contents in the lastData
Could I use for this purpose Windowed computations or updateStateBy key?
Any suggestions?
Thanks for your help!
Well I finally got it.
I use updateState function and return 0 if the timestamp is older than 24 hour like this.
final static Function2<List<Long>, Optional<Long>, Optional<Long>> RETAIN_RECENT_DATA
= (List<Long> values, Optional<Long> state) -> {
Long newSum = state.or(0L);
for (Long value : values) {
newSum += value;
}
//current milis uses UTC
if (System.currentTimeMillis() - newSum > 86400000L) {
return Optional.absent();
} else {
return Optional.of(newSum);
}
};
Then on each batch I register the DataFrame as temp table:
finalsum.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
if (!rdd.isEmpty()) {
HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
if (sqlContext1.cacheManager().isCached("alarm_recent")) {
sqlContext1.uncacheTable("alarm_recent");
}
DataFrame wordsDataFrame = sqlContext1.createDataFrame(rdd, schema);
wordsDataFrame.registerTempTable("alarm_recent");
wordsDataFrame.cache();//
wordsDataFrame.first();
}
return null;
});
You could use mapwithState with Spark1.6.
The mapwithState function is much more efficient and easy to implement.
Take a look at this link.
mapwithState supports cool functionality like State time out and initialRDD which comes handy while maintaining a Stateful Dstream.
Thanks
Manas

Linq2SQL vs NHibernate performance (have I gone mad?)

I have written the following tests to compare performance of Linq2SQL and NHibernate and I find results to be somewhat strange. Mappings are straight forward and identical for both. Both are running against a live DB. Although I'm not deleting Campaigns in case of Linq, but that shouldn't affect performance by more than 10 ms.
Linq:
[Test]
public void Test1000ReadsWritesToAgentStateLinqPrecompiled()
{
Stopwatch sw = new Stopwatch();
Stopwatch swIn = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000; i++)
{
swIn.Reset();
swIn.Start();
ReadWriteAndDeleteAgentStateWithLinqPrecompiled();
swIn.Stop();
Console.WriteLine("Run ReadWriteAndDeleteAgentState: " + swIn.ElapsedMilliseconds + " ms");
}
sw.Stop();
Console.WriteLine("Total Time: " + sw.ElapsedMilliseconds + " ms");
Console.WriteLine("Average time to execute queries: " + sw.ElapsedMilliseconds / 1000 + " ms");
}
private static readonly Func<AgentDesktop3DataContext, int, EntityModel.CampaignDetail>
GetCampaignById =
CompiledQuery.Compile<AgentDesktop3DataContext, int, EntityModel.CampaignDetail>(
(ctx, sessionId) => (from cd in ctx.CampaignDetails
join a in ctx.AgentCampaigns on cd.CampaignDetailId equals a.CampaignDetailId
where a.AgentStateId == sessionId
select cd).FirstOrDefault());
private void ReadWriteAndDeleteAgentStateWithLinqPrecompiled()
{
int id = 0;
using (var ctx = new AgentDesktop3DataContext())
{
EntityModel.AgentState agentState = new EntityModel.AgentState();
var campaign = new EntityModel.CampaignDetail { CampaignName = "Test" };
var campaignDisposition = new EntityModel.CampaignDisposition { Code = "123" };
campaignDisposition.Description = "abc";
campaign.CampaignDispositions.Add(campaignDisposition);
agentState.CallState = 3;
campaign.AgentCampaigns.Add(new AgentCampaign
{
AgentState = agentState
});
ctx.CampaignDetails.InsertOnSubmit(campaign);
ctx.AgentStates.InsertOnSubmit(agentState);
ctx.SubmitChanges();
id = agentState.AgentStateId;
}
using (var ctx = new AgentDesktop3DataContext())
{
var dbAgentState = ctx.GetAgentStateById(id);
Assert.IsNotNull(dbAgentState);
Assert.AreEqual(dbAgentState.CallState, 3);
var campaignDetails = GetCampaignById(ctx, id);
Assert.AreEqual(campaignDetails.CampaignDispositions[0].Description, "abc");
}
using (var ctx = new AgentDesktop3DataContext())
{
ctx.DeleteSessionById(id);
}
}
NHibernate (the loop is the same):
private void ReadWriteAndDeleteAgentState()
{
var id = WriteAgentState().Id;
StartNewTransaction();
var dbAgentState = agentStateRepository.Get(id);
Assert.IsNotNull(dbAgentState);
Assert.AreEqual(dbAgentState.CallState, 3);
Assert.AreEqual(dbAgentState.Campaigns[0].Dispositions[0].Description, "abc");
var campaignId = dbAgentState.Campaigns[0].Id;
agentStateRepository.Delete(dbAgentState);
NHibernateSession.Current.Transaction.Commit();
Cleanup(campaignId);
NHibernateSession.Current.BeginTransaction();
}
Results:
NHibernate:
Total Time: 9469 ms
Average time to execute 13 queries: 9 ms
Linq:
Total Time: 127200 ms
Average time to execute 13 queries: 127 ms
Linq lost by 13.5 times! Event with precompiled queries (both read queries are precompiled).
This can't be right, although I expected NHibernate to be faster, this is just too big of a difference, considering mappings are identical and NHibernate actually executes more queries against the DB.
Update. I have refactored a project to use NHibernate instead of Linq2Sql and the performance gain seems to be a lot less (about 20-30%) compared to test working on the same mappings. Does anyone have some real world examples of their own?
Run a profiler, both on the .NET code and on the SQL Server database. Also, identify what SQL statements are being run under the covers for both scenarios. Where is the time being lost for LinqToSql? If the underlying SQL statements are different, why? It's very likely you can tweak both ORMs to be faster. They should likely be in the same ballpark performance wise for simple tests. This feels like a configuration problem.