Why is it slow to apply a Spark pipeline to dataset with many columns but few rows? - apache-spark-ml

I am testing spark pipelines using a simple dataset with 312 (mostly numeric) columns, but only 421 rows.
It is small, but it takes 3 minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. This seems much to long for such a tiny dataset.
Similar piplines run quickly on datasets that have fewer columns and more rows. It's something about the number of columns that is causing the slow performance.
Here are a list of the stages in my pipline:
000_strIdx_5708525b2b6c 048_bucketizer_888b0055c1ad 096_bucketizer_e677659ca253
001_strIdx_ec2296082913 049_bucketizer_974e0a1433a6 097_bucketizer_396e35548c72
002_bucketizer_3cbc8811877b 050_bucketizer_e848c0937cb9 098_bucketizer_78a6410d7a84
003_bucketizer_5a01d5d78436 051_bucketizer_95611095a4ac 099_bucketizer_e3ae6e54bca1
004_bucketizer_bf290d11364d 052_bucketizer_660a6031acd9 100_bucketizer_9fed5923fe8a
005_bucketizer_c3296dfe94b2 053_bucketizer_aaffe5a3140d 101_bucketizer_8925ba4c3ee2
006_bucketizer_7071ca50eb85 054_bucketizer_8dc569be285f 102_bucketizer_95750b6942b8
007_bucketizer_27738213c2a1 055_bucketizer_83d1bffa07bc 103_bucketizer_6e8b50a1918b
008_bucketizer_bd728fd89ba1 056_bucketizer_0c6180ba75e6 104_bucketizer_36cfcc13d4ba
009_bucketizer_e1e716f51796 057_bucketizer_452f265a000d 105_bucketizer_2716d0455512
010_bucketizer_38be665993ba 058_bucketizer_38e02ddfb447 106_bucketizer_9bcf2891652f
011_bucketizer_5a0e41e5e94f 059_bucketizer_6fa4ad5d3ebd 107_bucketizer_8c3d352915f7
012_bucketizer_b5a3d5743aaa 060_bucketizer_91044ee766ce 108_bucketizer_0786c17d5ef9
013_bucketizer_4420f98ff7ff 061_bucketizer_9a9ef04a173d 109_bucketizer_f22df23ef56f
014_bucketizer_777cc4fe6d12 062_bucketizer_3d98eb15f206 110_bucketizer_bad04578bd20
015_bucketizer_f0f3a3e5530e 063_bucketizer_c4915bb4d4ed 111_bucketizer_35cfbde7e28f
016_bucketizer_218ecca3b5c1 064_bucketizer_8ca2b6550c38 112_bucketizer_cf89177a528b
017_bucketizer_0b083439a192 065_bucketizer_417ee9b760bc 113_bucketizer_183a0d393ef0
018_bucketizer_4520203aec27 066_bucketizer_67f3556bebe8 114_bucketizer_467c78156a67
019_bucketizer_462c2c346079 067_bucketizer_0556deb652c6 115_bucketizer_380345e651ab
020_bucketizer_47435822e04c 068_bucketizer_067b4b3d234c 116_bucketizer_0f39f6de1625
021_bucketizer_eb9dccb5e6e8 069_bucketizer_30ba55321538 117_bucketizer_d8500b2c0c2f
022_bucketizer_b5f63dd7451d 070_bucketizer_ad826cc5d746 118_bucketizer_dc5f1fd09ff1
023_bucketizer_e0fd5041c841 071_bucketizer_77676a898055 119_bucketizer_eeaf9e6cdaef
024_bucketizer_ffb3b9737100 072_bucketizer_05c37a38ce30 120_bucketizer_5614cd4533d7
025_bucketizer_e06c0d29273c 073_bucketizer_6d9ae54163ed 121_bucketizer_2f1230e2871e
026_bucketizer_36ee535a425f 074_bucketizer_8cd668b2855d 122_bucketizer_f8bf9d47e57e
027_bucketizer_ee3a330269f1 075_bucketizer_d50ea1732021 123_bucketizer_2df774393575
028_bucketizer_094b58ea01c0 076_bucketizer_c68f467c9559 124_bucketizer_259320b7fc86
029_bucketizer_e93ea86c08e2 077_bucketizer_ee1dfc840db1 125_bucketizer_e334afc63030
030_bucketizer_4728a718bc4b 078_bucketizer_83ec06a32519 126_bucketizer_f17d4d6b4d94
031_bucketizer_08f6189c7fcc 079_bucketizer_741d08c1b69e 127_bucketizer_da7834230ecd
032_bucketizer_11feb74901e6 080_bucketizer_b7402e4829c7 128_bucketizer_8dbb503f658e
033_bucketizer_ab4add4966c7 081_bucketizer_8adc590dc447 129_bucketizer_e09e2eb2b181
034_bucketizer_4474f7f1b8ce 082_bucketizer_673be99bdace 130_bucketizer_faa04fa16f3c
035_bucketizer_90cfa5918d71 083_bucketizer_77693b45f94c 131_bucketizer_d0bd348a5613
036_bucketizer_1a9ff5e4eccb 084_bucketizer_53529c6b1ac4 132_bucketizer_de6da796e294
037_bucketizer_38085415a4f4 085_bucketizer_6a3ca776a81e 133_bucketizer_0395526346ce
038_bucketizer_9b5e5a8d12eb 086_bucketizer_6679d9588ac1 134_bucketizer_ea3b5eb6058f
039_bucketizer_082bb650ecc3 087_bucketizer_6c73af456f65 135_bucketizer_ad83472038f7
040_bucketizer_57e1e363c483 088_bucketizer_2291b2c5ab51 136_bucketizer_4a17c440fd16
041_bucketizer_337583fbfd65 089_bucketizer_cb3d0fe669d8 137_bucketizer_d468637d4b86
042_bucketizer_73e8f6673262 090_bucketizer_e71f913c1512 138_bucketizer_4fc473a72f1d
043_bucketizer_0f9394ed30b8 091_bucketizer_156528f65ce7 139_vecAssembler_bd87cd105650
044_bucketizer_8530f3570019 092_bucketizer_f3ec5dae079b 140_nb_f134e0890a0d
045_bucketizer_c53614f1e507 093_bucketizer_809fab77eee1 141_sql_a8590b83c826
046_bucketizer_8fd99e6ec27b 094_bucketizer_6925831511e6
047_bucketizer_6a8610496d8a 095_bucketizer_c5d853b95707
There are 2 string columns that are converted to ints with StringIndexerModel.
Then there are bucketizers that bin all the numeric columns into 2 or 3 mins each. Is there a way to bin many columns at once with a single stage? I did not see a way.
Next there is a vectorAssembler to combine all the columns into one for the NaiveBayes classifier.
Lastly there is a simple SQLTransformer to cast one the prection column to an int.
Here is what the metadata for the two stringIndexers looks like:
{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1492551461778,"sparkVersion":"2.1.1","uid":"strIdx_5708525b2b6c","paramMap":{"outputCol":"ADI_IDX__","handleInvalid":"skip","inputCol":"ADI_CLEANED__"}}
{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1492551462004,"sparkVersion":"2.1.1","uid":"strIdx_ec2296082913","paramMap":{"outputCol":"State_IDX__","inputCol":"State_CLEANED__","handleInvalid":"skip"}}
The bucketizers all look very simimlar. Here is what the meta data for few of them look like:
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462636,"sparkVersion":"2.1.1","uid":"bucketizer_bd728fd89ba1","paramMap":{"outputCol":"HH_02_BINNED__","inputCol":"HH_02_CLEANED__","handleInvalid":"keep","splits":["-Inf",7521.0,12809.5,20299.0,"Inf"]}}
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462711,"sparkVersion":"2.1.1","uid":"bucketizer_e1e716f51796","paramMap":{"splits":["-Inf",6698.0,13690.5,"Inf"],"handleInvalid":"keep","outputCol":"HH_97_BINNED__","inputCol":"HH_97_CLEANED__"}}
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462784,"sparkVersion":"2.1.1","uid":"bucketizer_38be665993ba","paramMap":{"splits":["-Inf",4664.0,7242.5,11770.0,14947.0,"Inf"],"outputCol":"HH_90_BINNED__","handleInvalid":"keep","inputCol":"HH_90_CLEANED__"}}
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462858,"sparkVersion":"2.1.1","uid":"bucketizer_5a0e41e5e94f","paramMap":{"splits":["-Inf",6107.5,10728.5,"Inf"],"outputCol":"HH_80_BINNED__","inputCol":"HH_80_CLEANED__","handleInvalid":"keep"}}
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551462931,"sparkVersion":"2.1.1","uid":"bucketizer_b5a3d5743aaa","paramMap":{"outputCol":"HHPG9702_BINNED__","splits":["-Inf",8.895000457763672,"Inf"],"handleInvalid":"keep","inputCol":"HHPG9702_CLEANED__"}}
{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1492551463004,"sparkVersion":"2.1.1","uid":"bucketizer_4420f98ff7ff","paramMap":{"splits":["-Inf",54980.5,"Inf"],"outputCol":"MEDHI97_BINNED__","handleInvalid":"keep","inputCol":"MEDHI97_CLEANED__"}}
Here is the metadata for the NaiveBayes model:
{"class":"org.apache.spark.ml.classification.NaiveBayesModel","timestamp":1492551472568,"sparkVersion":"2.1.1","uid":"nb_f134e0890a0d","paramMap":{"modelType":"multinomial","probabilityCol":"_class_probability_column__","smoothing":1.0,"predictionCol":"_prediction_column_","rawPredictionCol":"rawPrediction","featuresCol":"_features_column__","labelCol":"DAYPOP_BINNED__"}}
and for the final SQLTransformer
{"class":"org.apache.spark.ml.feature.SQLTransformer","timestamp":1492551472804,"sparkVersion":"2.1.1","uid":"sql_a8590b83c826","paramMap":{"statement":"SELECT *, CAST(_prediction_column_ AS INT) AS `_*_prediction_label_column_*__` FROM __THIS__"}}
Why is it that the duration gets extremely slow when more than a couple hundred columns (and only a few rows), but having millions of rows (with fewer columns) performs fine?
In addition to it being slow when applying this pipeline, it is also slow to create it. The fit and evaluate steps take a few minutes each.
Is there anything that can be done to make it faster?
I get similar results using 2.1.1RC, 2.1.2(tip) and 2.2.0(tip). Spark 2.1.0 gives a Janino 64k limit error when trying to build this pipeline (see https://issues.apache.org/jira/browse/SPARK-16845).

Related

difflib.get_close_matches is not giving any output when I compared 2 columns in pandas dataframe

I have 2 pandas dataframes, one with clean city names(df2) and another with unclean city names(df1).
sample values:
df2.city_name: bangalore
df1.city_name: bongolor
I tried to use the below code to get the close match for the city name
city_names=df2.city_name.to_list()
for i in df1.city_name:
difflib.get_close_matches(i,city_names)
This is running for a very long time (more than an hour, so I stopped it).
I had tried fuzzywuzzy process as well. PFB:
list1=df1.city_name.to_list()
list2=df2.city_name.to_list()
mat1=[]
for i in list1:
mat1.append(process.extract(i, list2, limit=10))
df1['match']=mat1
This was also taking a very long time so I killed it.
Is there an optimized way to compare the column values and get the closer state name?
Note: my df1.city_names are 3.3M and df2.city_names are 2.7k

How to code a simple algorithm to fetch list of data through pagination in a fresh new application?

I'm making a clone of social app. I'm using graphQL as my backend. My problem is that every time I query a list of data it is returning the same result. When I will release that app, the user base will be very small so the amount or data is less in number. So I'm facing the issue described below:
1. My data in data base is like:
I'd=1 title=hello1
I'd=2 title=hello2
I'd=3 title=hello3
2. When I'm querying data through pagination with limit=3, I'm getting list of items is like:
Query 1
I'd=1 title=hello1
I'd=2 title=hello2
I'd=3 title=hello3
3. When I'm adding new items to data base, it is invoked in between the items like below:
I'd=1 title=hello1
I'd=4 title=hello4
I'd=2 title=hello2
I'd=3 title=hello3
I'd=5 title=hello5
4. So next fresh query result(limit=3) Will be like:
Query 2
I'd=1 title=hello1
I'd=4 title=hello4
I'd=2 title=hello2
Look at the data set previously our query result was: I'd=1,2 & 3 now I'd=1,4 & 2 so the user will get same result as id=1,2 is in new list.
If I will save pagination nextToken/cursor(I'd=3) of first query(query 1) then after new data added to data base the new query will start from I'd=5, because it is present after I'd=3. Look at the new dataset it will miss I'd=4 because nextToken is saved for I'd=3 for the query will start from I'd=5. Hope you can understand.
If your suggestion is add a sort key of created at, I want say that if I will add some filter, the data set will become so much selective that might become the reason of limited number of data in feed and we know a feed should query unlimited data.

How to run SPARQL queries in R (WBG Topical Taxonomy) without parallelization

I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.

How to setup a pr. job_type.Murnaghan that for each volume reads the structure output of another Murnaghan job?

For the Ene-Vol calculations of the non-cubic structures, one has to relax the structures for all volumes.
Suppose that I start with a pr.jobtype.Murnaghan() job that its ref_job_relax is a cell-shape and internal coordinates relaxation. Let's call the Murnaghan job R1 with 7 volumes, i.e. R1-V1,...,R1-V7.
After one or more rounds of relaxation (R1...RN), one has to perform a static calculation to acquire a precise energy. Let's call the final static round S.
For the final round, I want to create a pr.jobtype.Murnaghan() job that reads all the required setup configurations from the ref_job_static except the input structures .
Then for each volume S-Vn it should read the corresponding output structure of RN-Vn, e.g. R1-V1-->S-V1, ..., R1-V7-->S-V7 if there were only one round of relaxation.
I am looking for an implementation like below:
murn_relax = pr.create_job(pr.job_type.Murnaghan, 'R1')
murn_relax.ref_job = ref_job_relax
murn_relax.run()
murn_static = pr.create_job(pr.job_type.Murnaghan, 'S', continuation=True)
murn_static.ref_job = ref_job_static
murn_static.structures_from(prev_job='R1')
murn_static.run()
The Murnaghan object has two relevant functions:
get_structure() https://github.com/pyiron/pyiron_atomistics/blob/master/pyiron_atomistics/atomistics/master/murnaghan.py#L829
list_structures() https://github.com/pyiron/pyiron_atomistics/blob/master/pyiron_atomistics/atomistics/master/murnaghan.py#L727
The first returns the predicted equilibrium structure and the second returns the structures at the different volumes.
In addition you can get the IDs of the children and iterate over those:
structure_lst = [
pr.load(job_id).get_structure()
for job_id in murn_relax.child_ids
]
to get a list of converged structures.

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.