Apache Spark to add numbers from an ArrayList - arraylist

I was looking for a Spark program that adds the elements of an existing Integer ArrayList.I went through all transformations and actions in apache spark but couldn't find a suitable one to just add the elements.
If someone could tell me how to write the code for the above ie add elements of an arraylist in spark , then it'll be great.
Thanks.

If you have a RDD[Int] as shown below:
val myRdd = sc.parallelize(Seq(1,2,3,4,5,6))
you could do the following to add the elements of the List:
myRdd.reduce(_+_)
res1: Int = 21
Or you could do the following as well:
myRdd.fold(0)(_+_)
res6: Int = 21
Hope it helps.

Related

Passing DataFrame from notebook to another with pyspark

i'am trying to call a DataFrame that i created in notebook1 to use it in my notebook2 in Databricks Community addition with pyspark and i tried this code dbutils.notebook.run("notebook1", 60, {"dfnumber2"})
but it shows this error.
py4j.Py4JException: Method _run([class java.lang.String, class java.lang.Integer, class java.util.HashSet, null, class java.lang.String]) does not exist
any help please?
The actual problem is that you pass last parameter ({"dfnumber2"}) incorrectly - with this syntax it's a set, not the map type. You need to use syntax: {"table_name": "dfnumber2"} to represent it as a dict/map.
But if you look into documentation of dbutils.notebook.run, you will see following phrase:
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run, the dbutils.notebook.run() method starts a new job to run the notebook.
But jobs aren't supported on the Community Edition, so it won't work anyway.
Create a global temp view and pass the table name as argument to your next notebook.
Drnumber2.createOrReplaceGlobalTempView("dfnumber2")
dbutils.notebook.run("notebook1", 60, {table_name:"dfnumber2"})
In your notebook1 you can do
table_name= dbutils.widgets.get("table_name")
Dfnumber2 = spark.sql("select * from global_temp."+table_name)

Talend - Dynamic Column Name (Enterprise version)

Can anyone help me solve this case?
I have much file to process, two of them is like on below screenshot with my expected output.
I use this transformation on Talend: tFileList---tInputExcel---tUnpivotRow---tMap---tPostgresqlOutput
The output is different to my expected output. This is the screenshot of the output
Can anyone help me to reach my expected output which is like on my first picture above?
This will be pretty hard. You'd have to handle that as a text file. And whenever you found "store" value in the first column you'd update your type with the value.
Here's how I'd start:
Basically tJavaFlex begin piece would contain:
String col1Type
String colNType
main part:
if input_row.col0.equalsIgnoreCase("store") {
col1Type = input_row.col1;
col2Type = input_row.col2;
colNType = input_row.colN;
continue; /*(so this record will be Ignored for the rest of the components!)*/
}
output_row.col1Type = col1Type;
output_row.col1Value = Integer.valueOf(input_row.col1);
/*coz we have text and need numbers :( */
I think using propagate results will save you from writing down all the other fields.
And from here it would be very simple as you have key-type-value-type-value-type-value results.

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.

scilab : index in variable name loop

i would like to read some images with scilab and i use the function imread like this
im01=imread('kodim01t.jpg');
im02=imread('kodim02t.jpg');
im03=imread('kodim03t.jpg');
im04=imread('kodim04t.jpg');
im05=imread('kodim05t.jpg');
im06=imread('kodim06t.jpg');
im07=imread('kodim07t.jpg');
im08=imread('kodim08t.jpg');
im09=imread('kodim09t.jpg');
im10=imread('kodim10t.jpg');
i would like to know if there is a way to do something like below in order to optimize the
for i = 1:5
im&i=imread('kodim0&i.jpg');
end
thanks in advance
I see two possible solutions using execstr or using some kind of list/matrix
Execstr
First create a string of the command to execute with msprintf and then execute this with execstr. Note that in the msprintf conversion the right amount of leading zeros are inserted by %0d format specifier descbribed here.
for i = 1:5
cmd=msprintf('im%d=imread(\'kodim%02d.jpg\');', i, i);
execstr(cmd);
end
List/Matrix
This is probably the more intuitive option using a indexable container such as list.
// This list could be generated using msprintf from example above
file_names_list = list("kodim01t.jpg", "kodim02t.jpg" ,"kodim03t.jpg");
// Create empty list to contain images
opened_images = list();
for i=1:length(file_names_list)
// Open image and insert it at end of list
opened_images($+1) = imread(file_names_list[i]);
end

How to create secondary index in Cassandra Hector API programmatically

I have been trying to create indexing using below set of lines.
KeyspaceDefinition fromCluster = cluster.describeKeyspace(KEYSPACE);
ColumnFamilyDefinition cfDef = fromCluster.getCfDefs().get(0);
BasicColumnFamilyDefinition columnFamilyDefinition = newBasicColumnFamilyDefinition(cfDef);
BasicColumnDefinition columnDefinition = new BasicColumnDefinition();
columnDefinition.setName(StringSerializer.get().toByteBuffer("A_NO"));
columnDefinition.setIndexName("A_NO_idx");
columnDefinition.setIndexType(ColumnIndexType.KEYS);
columnDefinition.setValidationClass(ComparatorType.UTF8TYPE.getClassName());
columnFamilyDefinition.addColumnDefinition(columnDefinition);
But i am unable to do so. Actually i am storing the data in the columns dynamically as well as creating those columns dynamically and after that for better query purpose i am trying to put index on some particular columns. Any suggestion please how to do that.
Its eventually quite simple. You just have to create the secondary index while defining your columnfamily. In the above code, all the manipulation are done on the object index which has to be created while defining only. The steps for adding index are
List<ColumnDef> columns = new ArrayList<ColumnDef>();
columns.add(newIndexedColumnDef("columnName", "UTF8Type"));
List<ColumnDefinition> columnMetadata = ThriftColumnDef
.fromThriftList(columns);
cdefs.add(cf_def); //cf_def is your columnfamily definition
The helper method code is from KeyspaceCreationTest
public ColumnDef newIndexedColumnDef(String column_name, String comparer){
ColumnDef cd = new ColumnDef(se.toByteBuffer(column_name), comparer);
cd.setIndex_name(column_name);
cd.setIndex_type(IndexType.KEYS);
return cd;
}
References for comparer can be found here
I hope it will help you.