Group lines of text to a map using Kotlin.Collections functions - kotlin

Let's say I have a text file with contents like this:
[John]
likes: cats, dogs
dislikes: bananas
other info: is a man baby
[Amy]
likes: theater
dislikes: cosplay
[Ghost]
[Gary]
age: 42
Now, this file is read to a String. Is there a way to produce a Map<String, List<String>> that would have the following content:
Key: "John"
Value: ["likes: cats, dogs", "dislikes: bananas"., "other info: is a man baby"]
Key: "Amy"
Value: ["likes: theater", "dislikes: cosplay"]
Key: "Ghost"
Value: []
Key: "Gary"
Value: ["age: 42"]
that is to say, is there a sequence of Kotlin.Collections operators that would take the key from the brackets and take all the following lines as a value for that key, collecting all these key-value pairs into a map? The number of lines belonging to any given entry is unknown beforehand - there might be any amount of lines of properties, including zero lines.
I'm aware this is trivial to implement without Kotlin.Collections; the question is, is there a (possibly elegant) way of doing it with the Collections operations?

You can do it like this:
text.split("\n\n")
.associate {
val lines = it.split("\n")
lines[0].drop(1).dropLast(1) to lines.drop(1)
}
Try it yourself
Here, we first divide the entire text into a list (by splitting with consecutive new lines) where each element contains data for one person.
Next we use associate to convert the list into a map by mapping each list element to a map entry. To get the map entry we first get the lines from person data string. The key is lines[0].drop(1).dropLast(1) i.e. first line after removing the first ([) and last (]) characters. The value is the list of all lines except the first one.

This might work, this divides the content by [ and then take the remaining elements for each group.
text.split("\\s(?=\\[)".toRegex())
.map { it.split("\n").filter(String::isNotEmpty) }
.associate {
it.first().replace("[\\[\\]]".toRegex(), "") to it.takeLast(it.size-1)
}

Related

How to load a key with values of different types in tfrecord?

I have some third-party generated tfrecord files. I just found there is a specific key that has different value types in these tfrecord files, shown as follows.
key: "similarity"
value {
float_list {
value: 0.3015786111354828
}
}
key: "similarity"
value {
bytes_list {
value: ""
}
}
When I try to decode this key-value pair in tfrecord, I encounter a problem. I cannot find the suitable type for this key similarity. When I use tf.string or tfds.features.Text() in tfds.features.FeaturesDict for decoding, it returns the error
Data types don't match. Data type: float but expected type: string
When I use tf.float64 in tfds.features.FeaturesDict for decoding, it returns the error
Data types don't match. Data type: string but expected type: float
I wonder if there is anything in tfds.features or tf.train.Example that allows me to decode both float and string?
Or if there is something like tfds.decode.SkipDecoding() that allows me read this key similarity and decide how to decode it afterwards? I am aware that tfds.builder().as_dataset() has that option, but I cannot find one in tf.data.TFRecordDataset. I have tried to simply remove the entry correspondind to the key similarity, but the data read from the tfrecord dataset simply drop the entry similarity.
Thanks a lot!

How to chain filter expressions together

I have data in the following format
ArrayList<Map.Entry<String,ByteString>>
[
{"a":[a-bytestring]},
{"b":[b-bytestring]},
{"a:model":[amodel-bytestring]},
{"b:model":[bmodel-bytestring]},
]
I am looking for a clean way to transform this data into the format (List<Map.Entry<ByteString,ByteString>>) where the key is the value of a and value is the value of a:model.
Desired output
List<Map.Entry<ByteString,ByteString>>
[
{[a-bytestring]:[amodel-bytestring]},
{[b-bytestring]:[bmodel-bytestring]}
]
I assume this will involve the use of filters or other map operations but am not familiar enough with Kotlin yet to know this
It's not possible to give an exact, tested answer without access to the ByteString class — but I don't think that's needed for an outline, as we don't need to manipulate byte strings, just pass them around. So here I'm going to substitute Int; it should be clear and avoid any dependencies, but still work in the same way.
I'm also going to use a more obvious input structure, which is simply a map:
val input = mapOf("a" to 1,
"b" to 2,
"a:model" to 11,
"b:model" to 12)
As I understand it, what we want is to link each key without :model with the corresponding one with :model, and return a map of their corresponding values.
That can be done like this:
val output = input.filterKeys{ !it.endsWith(":model") }
.map{ it.value to input["${it.key}:model"] }.toMap()
println(output) // Prints {1=11, 2=12}
The first line filters out all the entries whose keys end with :model, leaving only those without. Then the second creates a map from their values to the input values for the corresponding :model keys. (Unfortunately, there's no good general way to create one map directly from another; here map() creates a list of pairs, and then toMap() creates a map from that.)
I think if you replace Int with ByteString (or indeed any other type!), it should do what you ask.
The only thing to be aware of is that the output is a Map<Int, Int?> — i.e. the values are nullable. That's because there's no guarantee that each input key has a corresponding :model key; if it doesn't, the result will have a null value. If you want to omit those, you could call filterValues{ it != null } on the result.
However, if there's an ‘orphan’ :model key in the input, it will be ignored.

Kotlin functional programming keep reference of previous object in List

I have a list of Person objects which are related to each other with spouse relation in the order of which they appear in the list.
enum class Gender {
MAN, WOMAN
}
data class Person(val name: String, val age: Int, val gender: Gender)
In the list, each Person with Gender MAN is the spouse of the following Person with WOMAN Gender (and vice versa) and each entry in the list is followed by alternating genders with MAN gender be the first.
The list should ideally be like [MAN, WOMAN, MAN, WOMAN, MAN, WOMAN] (obviously it will be a list of Person objects for simplicity I am putting a list of Gender here) but it could also be like [WOMAN, MAN, WOMAN, MAN, WOMAN, MAN]. In the latter case, the first appearing WOMAN is the spouse of the last appearing MAN.
How this second case could be handled in kotlin by using functional programming.
My current approach involves checking if the first Person has a gender women then i remove the first and last objects in the list and then add them in the end but this is not fully a functional programming solution.
Anyone can guide me about that?
Thanks
What do you mean by fully functional approach?
Similar to what you mentioned, you can fix the order by a simple statement like this:
val correctList = if(list.first().gender == MAN) list else list.drop(1) + list.first()
If you want a more general approach, you can do something like this:
// separate the people into a list of gender=MAN and a list of everyone else
// the result is a Pair, so I'm destructuring that into two variables
val (men, women) = people.partition { it.gender == MAN }
// use zip to take a person from each list and combine them into a list of pairs
men.zip(women)
// use flatMap to turn each pair into a list of two people
// just using map would create a list of lists, flatMap flattens that into
// a single list of people, alternating gender=MAN, gender=WOMAN
.flatMap { it.toList() }
This way it doesn't matter how your original list is ordered, it can start with any element and you can have the different types completely mixed up - BABBAABA will still come out as ABABABAB. So it's a general way to combine mixed data streams - partition separates them into groups, and zip lets you take an element from each group and do something with them.
Here I'm just letting zip create Pairs, and then flatMap turns those back into an ordered list (if that's what you want). You could also do a forEach on each pair instead (say if you wanted to set a value on each Person to link them to each other), or zip can take a transform function too.
Also zip terminates when one of the lists runs out (e.g for AAA and BB you'll get two pairs) so this works for generating complete pairs of elements - if you also needed to handle elements without a "partner" you'd need to do a bit more work

How to read data from data bag within a PIG script

I have a databag which is the following format
{([ChannelName#{ (bigXML,[])} ])}
DataBag consists of only one item which is a Tuple.
Tuple consists of only item that is Map.
Map is of type which is a map between channel names and values.
Here is value is of type DataBag, which consists of only one tuple.
The tuple consists of two items one is a charrarray (very big string) and other is a map
I have a UDF that emits the above bag.
Now i need to invoke another UDF by passing the only tuple within the DataBag against a given Channel from the Map.
Assuming there was not data bag and a tuple as
([ChannelName#{ (bigXML,[])} ])
I can access the data using $0.$0#'StdOutChannel'
Now with the tuple inside a bag
{([ChannelName#{ (bigXML,[])} ])}
If i do $0.$0.$0#'StdOutChannel' (Prepend $0), i get the following error
ERROR 1052: Cannot cast bag with schema bag({bytearray}) to map
How can I access data within a data bag?
Try to break this problem down a little.
Let's say you get your inner bag:
MYBAG = $0.$0#'StdOutChannel';
First, can you ILLUSTRATE or DUMP this?
What can you do with this bag? Usually FOREACH over the tuples inside.
A = FOREACH MYBAG {
GENERATE $0 AS MyCharArray, $1 AS MyMap
};
ILLUSTRATE A; -- or if this doesn't work
DUMP A;
Can you try this interactively and maybe edit your question a little more with some details as a result of you trying these things.
Some editing hints for StackOverflow:
put backticks around your code (`ILLUSTRATE`)
indent code blocks by 4 spaces on each line

MongoDB or CouchDB or something else?

I know this is another question on this topic but I am a complete beginner in the NoSQL world so I would love some advice. People at SO told me MySQL might be a bad idea for this dataset so I'm asking this. I have lots of data in the following format:
TYPE 1
ID1: String String String ...
ID2: String String String ...
ID3: String String String ...
ID4: String String String ...
which I am hoping to convert into something like this:
TYPE 2
ID1: String
ID1: String
ID1: String
ID1: String
ID2: String
ID2: String
This is the most inefficient way but I need to be able to search by both the key and the value. For instance, my queries would look like this:
I might need to know what all strings a given ID contains and then intersect the list with another list obtained for a different ID.
I might need to know what all IDs contain a given string
I would love to achieve this without transforming Type 1 into Type 2 because of the sheer space requirements but would like to know if either MongoDB or CouchDB or something else (someone suggested NoSQL so started Googling and found these two are very popular) would help me out in this situation. I can a 14 node cluster I can leverage but would love some advice on which one is the right database for this usecase. Any suggestions?
A few extra things:
The input will mostly be static. I will create new data but will not modify any of the existing data.
The ID is 40 bytes in length whereas the strings are about 20 bytes
MongoDB will let you store this data efficiently in Type 1. Depending on your use it will look like one these (data is in JSON):
Array of Strings
{ "_id" : 1, "strings" : ["a", "b", "c", "d", "e"] }
Set of KV Strings
{ "_id" : 1, "s1" : "a", "s2" : "b", "s3" : "c", "s4" : "d", "s5" : "e" }
Based on your queries, I would probably use the Array of Strings method. Here's why:
I might need to know what all strings
a given ID contains and then intersect
the list with another list obtained
for a different ID.
This is easy, you get one Key Value look-up for the ID. In code, it would look something like this:
db.my_collection.find({ "_id" : 1});
I might need to know what all IDs contain a given string
Similarly easy:
db.my_collection.find({ "strings" : "my_string" })
Yeah it's that easy. I know that "strings" is technically an array, but MongoDB will recognize the item as an array and will loop through to find the value. Docs for this are here.
As a bonus, you can index the "strings" field and you will get an index on the array. So the find above will actually perform relatively fast (with the obvious trade-off that the index will be very large).
In terms of scaling a 14-node cluster may almost be overkill. However, Mongo does support auto-sharding and replication sets. They even work together, here's a blog post from a 10gen member to get you started (10gen makes Mongo).