I am trying to experiment with java 8 streams and collections in jython to see if they are any efficient then when implemented in pure jython. It occurs to me it could (any comments on this also appreciated)
I started with some examples, the counting
from java.util.function import Function
from java.util import ArrayList
from java.util.stream import Collectors
letters = ArrayList(['a','b','a','c']);
printing cnt as dictionary
{u'a': 2L, u'b': 1L, u'c': 1L}
so far so good. Next, I found a example on using filter on streams in java
List<String>strings = Arrays.asList("abc", "", "bc", "efg", "abcd","", "jkl");
//get count of empty string
int count = strings.stream().filter(string -> string.isEmpty()).count();
how would this translate to in jython. specifically how can one write java lambda expression like string -> sting.isEmpty() in jython?
here is an example for using a filter on a stream need a object of type Predicate (java.util.function.Predicate)
for java code:
List<String>strings = Arrays.asList("abc", "", "bc", "efg", "abcd","", "jkl");
//get count of empty string
int count = strings.stream().filter(string -> string.isEmpty()).count();
eqvivalet jython would be to first subclassing Predicate and implementing a test method.
from java.util.function import Predicate
from java.util.stream import Collectors
class pred(Predicate):
def __init__(self,fn):
def mytest(s):
from java.lang import String
return not String(s).isEmpty() #or just use len(s.strip())==0
strings = ArrayList(["abc", "", "bc", "efg", "abcd","", "jkl"])
count = strings.stream().filter(mytest).count()
then prints
[abc, bc, efg, abcd, jkl]
I'm looking to access some fields on a Kafka Consumer record. I'm able to receive the event data which is a Java object i.e ConsumerRecord(topic = test.topic, partition = 0, leaderEpoch = 0, offset = 0, CreateTime = 1660933724665, serialized key size = 32, serialized value size = 394, headers = RecordHeaders(headers = [], isReadOnly = false), key = db166cbf1e9e438ab4eae15093f89c34, value = {"eventInfo":...}).
I'm able to access the eventInfo values which comes back as a json string. I'm fairly new to Kotlin and using Kafka so I'm not entirely sure if this is correct but I'm looking to basically access the fields in value but I can't get rid of an error that appears when trying to use mapper.readValue which is:
None of the following functions can be called with the arguments supplied.
import com.afterpay.shop.favorites.model.Product
import com.fasterxml.jackson.module.kotlin.jacksonObjectMapper
import org.apache.avro.generic.GenericData.Record
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.springframework.kafka.annotation.KafkaListener
import org.springframework.kafka.support.Acknowledgment
import org.springframework.stereotype.Component
class KafkaConsumer {
#KafkaListener(topics = ["test.topic"], groupId = "group-id")
fun consume(consumerRecord: ConsumerRecord<String, Any>, ack: Acknowledgment) {
val mapper = jacksonObjectMapper()
val value = consumerRecord.value()
val record = mapper.readValue(value, Product::class.java)
Is this the correct way to accomplish this?
First, change ConsumerRecord<String, Any> to ConsumerRecord<String, Product>, then change value.deserializer in your consumer config/factory to use JSONDeserializer
Then your consumerRecord.value() will already be a Product instance, and you don't need an ObjectMapper
Otherwise, if you use StringDeserializer, change Any to String so that the mapper.readValue argument types are correct.
How do I annotate a Django queryset with a Regex capture group without using RawSQL so that I later can use that value for filtering and sorting?
For example, in PostgreSQL I could make the following query:
CREATE TABLE foo (id varchar(100));
INSERT INTO foo (id) VALUES ('disk1'), ('disk10'), ('disk2');
CAST((regexp_matches("foo"."id", '^(.*\D)([0-9]*)$'))[2] AS integer) as grp2
FROM "foo"
ORDER BY "grp2"
You can use a custom Func class created to get it working, but I would like to implement in a better way, just like a normal function which could be used for further processing using other functions or annotations or etc. Like a "block" in the Django ORM ecosystem.
I would like to start with an "beta version" of the class which looks like this one:
from django.db.models.expressions import Func, Value
class RegexpMatches(Func):
function = 'REGEXP_MATCHES'
def __init__(self, source, regexp, flags=None, group=None, output_field=None, **extra):
template = '%(function)s(%(expressions)s)'
if group:
if not hasattr(regexp, 'resolve_expression'):
regexp = Value(regexp)
template = '({})[{}]'.format(template, str(group))
expressions = (source, regexp)
if flags:
if not hasattr(flags, 'resolve_expression'):
flags = Value(flags)
expressions += (flags,)
self.template = template
super().__init__(*expressions, output_field=output_field, **extra)
and a fully working example for an admin interface:
from django.contrib.admin import ModelAdmin, register
from django.db.models import IntegerField
from django.db.models.functions import Cast
from django.db.models.expressions import Func, Value
from .models import Foo
class Foo(ModelAdmin):
list_display = ['id', 'required_field', 'required_field_string']
def get_queryset(self, request):
qs = super().get_queryset(request)
return qs.annotate(
required_field=Cast(RegexpMatches('id', r'^(.*\D)([0-9]*)$', group=2), output_field=IntegerField()),
required_field_string=RegexpMatches('id', r'^(.*\D)([0-9]*)$', group=2)
def required_field(self, obj):
return obj.required_field
def required_field_string(self, obj):
return obj.required_field_string
As you see in I've added 2 annotations and one outputs like a number and the other one like a normal string (character), of course, we don't see it in the admin interface but it does in the SQL are executed:
SELECT "test_foo"."id" AS Col1,
((REGEXP_MATCHES("test_foo"."id", '^(.*\D)([0-9]*)$'))[2])::integer AS "required_field", (REGEXP_MATCHES("test_foo"."id", '^(.*\D)([0-9]*)$'))[2] AS "required_field_string"
FROM "test_foo"
And also a screenshot with an example for you :)
Github gist with a better source code formatting https://gist.github.com/phpdude/50675114aaed953b820e5559f8d22166
From Django 1.8 onwards, you can use Func() expressions.
from django.db.models import Func
class EndNumeric(Func):
function = 'REGEXP_MATCHES'
template = "(%(function)s(%(expressions)s, '^(.*\D)([0-9]*)$'))[2]::integer"
qs = Foo.objects.annotate(
).values('id', 'grp2').order_by('grp2')
Reference: Get sorted queryset by specified field with regex in django
I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)
I'm generating small dataFrames in for loop. At each round of for loop, I pass the generated dataFrame to a function which returns double. This simple process (which I thought could be easily taken care of by garbage collector) blow up my memory. When I look at Spark UI at each round of for loop it adds a new "SQL{1-500}" (my loop runs 500 times). My question is how to drop this sql object before generating a new one?
my code is something like this:
val data = (1 to 1000).map(_=>Random.nextInt(1000))
val dataframe = createDataFrame(data)
def myFunction(df: DataFrame)={
I tried to solve this problem by dataframe.unpersist() and sqlContext.clearCache() but neither of them worked.
You have two places where I suspect something fishy is happening:
in the definition of myFunction : you really need to put the = before the body of the definition. I had typos like that compile, but produce really weird errors (note I changed your myFunction for debugging purposes)
it is better to fill your Seq with something you know and then apply foreach or some such
(You also need to replace random.nexInt with Random.nextInt, and also, you can only create a DataFrame from a Seq of a type that is a subtype of Product, such as tuple, and need to use sqlContext to use createDataFrame)
This code works with no memory issues:
Seq.fill(500)(0).foreach{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
def myFunction(df: DataFrame) = {
Edit: parallelizing the computation (across 10 cores) and returning the RDD of counts:
sc.parallelize(Seq.fill(500)(0), 10).map{ i =>
val data = {1 to 1000}.map(_.toDouble).toList.zipWithIndex
val dataframe = sqlContext.createDataFrame(data)
def myFunction(df: DataFrame) = {
Edit 2: the difference between declaring function myFunction with = and without = is that the first is (a usual) function definition, while the other is procedure definition and is only used for methods that return Unit. See explanation. Here is this point illustrated in Spark-shell:
scala> def myf(df:DataFrame) = df.count()
myf: (df: org.apache.spark.sql.DataFrame)Long
scala> def myf2(df:DataFrame) { df.count() }
myf2: (df: org.apache.spark.sql.DataFrame)Unit
I'm trying to persist a groovy map to a file. My current attempt is to write the string representation out and then read it back in and call evaluate on it to recreate the map when I'm ready to use it again.
The problem I'm having is that the toString() method of the map removes vital quotes from the values of the elements. When my code calls evaluate, it complains about an unknown identifier.
This code demonstrates the problem:
m = [a: 123, b: 'test']
print "orig: $m\n"
s = m.toString()
print " str: $s\n"
m2 = evaluate(s)
print " new: ${m2}\n"
The first two print statements almost work -- but the quotes around the value for the key b are gone. Instead of showing [a: 123, b: 'test'], it shows [a: 123, b: test].
At this point the damage is done. The evaluate call chokes when it tries to evaluate test as an identifier and not a string.
So, my specific questions:
Is there a better way to serialize/de-serialize maps in Groovy?
Is there a way to produce a string representation of a map with proper quotes?
Groovy provides the inspect() method returns an object as a parseable string:
// serialize
def m = [a: 123, b: 'test']
def str = m.inspect()
// deserialize
m = Eval.me(str)
Another way to serialize a groovy map as a readable string is with JSON:
// serialize
import groovy.json.JsonBuilder
def m = [a: 123, b: 'test']
def builder = new JsonBuilder()
println builder.toString()
// deserialize
import groovy.json.JsonSlurper
def slurper = new JsonSlurper()
m = slurper.parseText('{"a": 123, "b": "test"}')
You can use myMap.toMapString()