Rich string with cell alignement formatting using xlsxwriter - xlsxwriter

I'm writing an HTML parser that generates an XLSX file from an HTML table. The table contains colored data such as:
<td>Some <mark color="red"><b>coloured, bolded</b></mark> text</td>
During parsing, I generate an array of tokens ready for passing to write_rich_string or write_string depending on how many strings are generated by the HTML parser.
There are quite a few cases where the HTML parser generates a array of 2 strings and a format, to be written to a cell, like:
['string 1', 'string2', format]
I cannot use write_string because there is more than 1 string. But I cannot use write_rich_string either, because write_rich_string pops the format and chokes on an array of 2 strings. Passing the following data to write_rich_string does not raise any issue, which feels strange in comparison:
['string1', 'string2', 'string3', format]
Am I missing something?

A workaround could have been to join string1 and string2, and then to feed that to write_string. I though this made the code unnecessarily complex.
I decided to use instead a 3rd, user-invisible string. This is is easily achieveable thanks the zero-width-space (\u200b):
string_parts = [...]
count = len(string_parts)
if count > 2:
wb.write_rich_string(row, col, *string_parts)
elif count == 2:
string_parts = ['\u200b'] + string_parts
wb.write_rich_string(row, col, *string_parts)
elif count == 1:
wb.write_string(row, col, string_parts[0])

Related

Convert String to array and validate size on Vertica

I need to execute a SQL query, which converts a String column to a Array and then validate the size of that array
I was able to do it easily with postgresql:
e.g.
select
cardinality(string_to_array('a$b','$')),
cardinality(string_to_array('a$b$','$')),
cardinality(string_to_array('a$b$$$$$','$')),
But for some reason trying to convert String on vertica to array is not that simple, Saw this links:
https://www.vertica.com/blog/vertica-quick-tip-dynamically-split-string/
https://forum.vertica.com/discussion/239031/how-to-create-an-array-in-vertica
And much more that non of them helped.
I also tried using:
select REGEXP_COUNT('a$b$$$$$','$')
But i get an incorrect value - 1.
How can i Convert String to array on Vertica and gets his Length ?
$ has a special meaning in a regular expression. It represents the end of the string.
Try escaping it:
select REGEXP_COUNT('a$b$$$$$', '[$]')
You could create a UDx scalar function (UDSF) in Java, C++, R or Python. The input would be a string and the output would be an integer. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/UDx/ScalarFunctions/ScalarFunctions.htm
This will allow you to use language specific array logic on the strings passed in. For example in python, you could include this logic:
input_list = input.split("$")
filtered_input_list = list(filter(None, input_list))
list_count = len(filtered_input_list)
These examples are a good starting point for writing UDx's for Vertica. https://github.com/vertica/UDx-Examples
I wasn't able to convert to an array - but Im able to get the length of the values
What i do is convert to Rows an use count - its not best performance wise
But with this way Im able to do also manipulation like filtering of each value between delimiter - and i dont need to use [] for characters like $
select (select count(1)
from (select StringTokenizerDelim('a$b$c','$') over ()) t)
Return 3

Parsing multiple values with Google Refine

I've a CSV column with content like this (just an example):
[{"qual"=>"05-Admmin "name"=>"CLARK C COHO"}, {"qual"=>"20-Soc Con", "name"=>"ALPHA S A"}, {"qual"=>"20-Soc Con", "name"=>"JACK SA"}
I would like to extract automatically the values from "name" field and separate it by comma, resulting in something like this: CLARCK C COHO, ALPHA S A, JACK SA and so on.
I know that I can get a specific value with this code:
value.parseJson()[0].name
I've been reading the documentation but i'm not figuring out how to loop this between all fields.
Any tips?
EDIT:
Here is another example of the column. The content really look like this:
[{"qual"=>"49-SocAdm", "name"=>"ALVARO R L"}, {"qual"=>"49-SocAdm", "name"=>"GABRIEL G L"}]
The data in your CSV is not in JSON format. I do not know what it is. A kind of key-value format, but I do not know which one. In addition, it sometimes lacks a comma or a bracket. We could try to transform it into a valid JSOn, but it will be easier to extract information using regular expressions. Here is an example with Python / Jython.
import re
pattern = re.compile(r'"name"=>"(.+?)"', re.M)
return ", ".join(pattern.findall(value))

Find records where length of array equal to - Rails 4

In my Room model, I have an attribute named available_days, which is being stored as an array.
For example:
Room.first.available_days
=> ["wed", "thurs", "fri"]
What is the best way to find all Rooms where the size of the array is equal to 3?
I've tried something like
Room.where('LENGTH(available_days) = ?', 3)
with no success.
Update: the data type for available_days is a string, but in order to store an array, I am serializing the attribute from my model:
app/models/room.rb
serialize :available_days
Can't think of a purely sql way of doing it for sqlite since available_days is a string.
But here's one way of doing it without loading all records at once.
rooms = []
Room.in_batches(of: 10).each_record do |r|
rooms << r if r.available_days.length == 3
end
p rooms
If you're using postgres you can parse the serialized string to an array type, then query on the length of the array. I expect other databases may have similar approaches. How to do this depends on how the text is being serialized, but by default for Rails 4 should be YAML, so I expect you data is encoded like this:
---
- first
- second
The following SQL will remove the leading ---\n- as well as the final newline, then split the remaining string on - into an array. It's not strictly necessary to cleanup the extra characters to find the length, but if you want to do other operations you may find it useful to have a cleaned up array (no leading characters or trailing newline). This will only work for simple YAML arrays and simple strings.
Room.where("ARRAY_LENGTH(STRING_TO_ARRAY(RTRIM(REPLACE(available_days,'---\n- ',''),'\n'), '\n- '), 1) = ?", 3)
As you can see, this approach is rather complex. If possible you may want to add a new structured column (array or jsonb) and migrate the serialized string into the a typed column to make this easier and more performant. Rails supports jsonb serialization for postgres.

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

How to cut out part of a string and insert into other string in vb.net?

I have unknown number of strings (assume 3 )
<li>A</li>
<li>B</li>
<li>C</li>
I want to cut out a.html, b.html and c.html. Then put them into the following structure with given a string MyLink = http://ccc.com/
randomlinks[0]="http://ccc.com/a.html"
randomlinks[1]="http://ccc.com/b.html"
randomlinks[2]="http://ccc.com/c.html"
What functions in Vb.net allow me to do that?
If your strings always have this exact format, String.Split is your friend.
myString.Split(""""c)
with myString containing the first of your strings will yield a three-element array with the following entries:
<li><a href=
a.html
>A</a></li>
How to proceed from there should be obvious and is left as an exercise to the reader. :-)
If the strings don't always have this exact format, a HTML parsing engine is probably the right tool.