How to show Feature Names in Graphviz? - pandas

I'm building a tree in Graphviz and I can't seem to be able to get the feature names to show up, I have defined a list with the feature names like so:
names = list(df.columns.values)
Which prints:
['Gender',
'SuperStrength',
'Mask',
'Cape',
'Tie',
'Bald',
'Pointy Ears',
'Smokes']
So the list is being created, later I build the tree like so:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=names)
But the final image still has the feature names listed like X[]:
How can I get the actual feature names to show up? (Cape instead of X[3], etc.)

I can only imagine this has to do with passing the names as an array of the values. It works fine if you pass the columns directly:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns)
If needed, you can also slice the columns:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns[5:])

Related

Open Refine: Exporting nested XML with templating

I have a question regarding the templating option for XML in Open Refine. Is it possible to export data from two columns in a nested XML-structure, if both columns contain multiple values, that need to be split first?
Here's an example to illustrate better what I mean. My columns look like this:
Column1
Column2
https://d-nb.info/gnd/119119110;https://d-nb.info/gnd/118529889
Grützner, Eduard von;Elisabeth II., Großbritannien, Königin
https://d-nb.info/gnd/1037554086;https://d-nb.info/gnd/1245873660
Müller, Jakob;Meier, Anina
Each value separated by semicolon in Column1 has a corresponding value in Column2 in the right order and my desired output would look like this:
<rootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<recordRootElement>
...
<edm:Agent rdf:about="https://d-nb.info/gnd/1037554086">
<skos:prefLabel xml:lang="zxx">Müller, Jakob</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/1245873660">
<skos:prefLabel xml:lang="zxx">Meier, Anina</skos:prefLabel>
</edm:Agent>
...
</recordRootElement>
<rootElement>
(note: in my initial posting, the position of the root element was not indicated and it looked like this:
<edm:Agent rdf:about="https://d-nb.info/gnd/119119110">
<skos:prefLabel xml:lang="zxx">Grützner, Eduard von</skos:prefLabel>
</edm:Agent>
<edm:Agent rdf:about="https://d-nb.info/gnd/118529889">
<skos:prefLabel xml:lang="zxx">Elisabeth II., Großbritannien, Königin</skos:prefLabel>
</edm:Agent>
)
I managed to split the values separated by ";" for both columns like this
{{forEach(cells["Column1"].value.split(";"),v,"<edm:Agent rdf:about=\""+v+"\">"+"\n"+"</edm:Agent>")}}
{{forEach(cells["Column2"].value.split(";"),v,"<skos:prefLabel xml:lang=\"zxx\">"+v+"</skos:prefLabel>")}}
but I can't find out how to nest the splitted skos:prefLabel into the edm:Agent element. Is that even possible? If not, I would work with seperate columns or another workaround, but I wanted to make sure, if there's a more direct way before.
Thank you!
Kristina
I am going to expand the answer from RolfBly using the Templating Exporter from OpenRefine.
I do have the following assumptions:
There is some other column left of Column1 acting as record identifying column (see first screenshot).
The columns actually have some proper names
The columns URI and Name are the only columns with multiple values. Otherwise we might produce empty XML elements with the following recipe.
We will use the information about records available via GREL to determine whether to write a <recordRootElement> or not.
Recipe:
Split first Name and then URI on the separator ";" via "Edit cells" => "Split multi-valued cells".
Go to "Export" => "Templating..."
In the prefix field use the value
<?xml version="1.0" encoding="utf-8"?>
<rootElement>
Please note that I skipped the namespace imports for edm, skos, rdf and xml.
In the row template field use the value:
{{if(row.index - row.record.fromRowIndex == 0, '<recordRootElement>', '')}}
<edm:Agent rdf:about="{{escape(cells['URI'].value, 'xml')}}">
<skos:prefLabel xml:lang="zxx">{{escape(cells['Name'].value, 'xml')}}</skos:prefLabel>
</edm:Agent>
{{if(row.index - row.record.fromRowIndex == row.record.rowCount - 1, '</recordRootElement>', '')}}
The row separator field should just contain a linebreak.
In the suffix field use the value:
</rootElement>
Disclaimer: If you're keen on using only OpenRefine, this won't be the answer you were hoping for. There may be ways in OR that I don't know of. That said, here's how I would do it.
Edit The trick is to keep URL and literal side by side on one line. b2m's answer below does just that: go from right to left splitting, not from left to right. You can then skip steps 2 and 3, to get the result in the image.
split each column into 2 columns by separator ;. You'll get 4 columns, 1 and 3 belong together, and 2 and 4 belong together. I'm assuming this will be the case consistently in your data.
export 1 and 3 to a file, and export 2 and 4 to another file, of any convenient format, using the custom tabular exporter.
concatenate those two files into one single file using an editor (I use Notepad++), or any other method you may prefer. Several ways to Rome here. Result in OR would be something like this.
You then have all sorts of options to put text strings in front, between and after your two columns.
In OR, you could use transform on column URL to build your XML using the below code
(note the \n for newline, that's probably just a line feed, you may want to use \r\n for carriage return + line feed if you're using Windows).
'<edm:Agent rdf:about="' + value + '">\n<skos:prefLabel xml:lang="zxx">' + cells.Name.value + '</skos:prefLabel>\n</edm:Agent>'
to get your XML in one column, like so
which you can then export using the custom tabular exporter again. Or instead you could use Add column based on this column in a similar manner, if you want to retain your URL column.
You could even do this in the editor without re-importing the file back into OR, but that's beyond the scope of this answer.

Extract key value pair from JSON string dynamically (Redshift)

I am working with a column which stores the data for camera effects usage in a JSON string.
The values inside it look something like this:
{"camera": {"GIFs": ["floss_dance", "kermit"], "filters": ["blur"], "GIF_count": 2, "Filter_count": 1}}
If I want to extract data for GIFs, I use this code:
json_extract_path_text(camera_effects, 'camera', 'GIFs') which will yield the result as ["floss_dance", "kermit"]
If I want to extract particular GIF names, I use json_extract_array_element_text(camera_effects, 'camera', 'GIFs') which will give me floss_dance and kermit in separate rows.
My issue is that we keep adding new effects to our camera, which means the number of elements in the 'camera' array will keep increasing and using the json_extract_path_text(camera_effects, 'camera', 'name_of_effect') is not a dynamic code.
Is there a way to extract a list of all key:value pairs that exist for 'camera' and then use the keys as a column, and values as rows?
Note: I am using Redshift SQL

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

Django Concat columns with integers and strings

I have run into an issue using attempting to add values from two different columns together in a query, namely that some of them contain numbers. This means that the built in Concat does not work as it requires strings or chars.
Considering how one can cast variables as other datatypes in SQL I don't see why I wouldn't be able to do that in Django.
cast(name as varchar(100))
I would assume that one would do it as follows in Django using the Concat function in combination with Cast.
queryset.annotate(new_col=Concat('existing_text_col', Cast('existing_integer_col', TextField())).get())
The above obviously does not work, so does anyone know how to actually do this?
The use case if anyone wonders are sending jenkins urls saved as fragments as a whole. So one url would be:
base_url: www.something.com/
url_fragment: name/
url_number: 123456
I ended up writing a serializer that inherits from the base serializer that contains the urls fragments and a lot of other things. In it I made a MethodField for my complete url and defined a getter function that loaded in the different fragments and added them together. I also redeclared the fragmented fields to None.
The code inside the new serializer is:
complete = serpy.MethodField("get_copmlete")
serverUrl = serpy.Field(attr=None, call=False, required=False)
jobName = serpy.Field(attr=None, call=False, required=False)
buildNumber = serpy.Field(attr=None, call=False, required=False)
def get_complete(self, obj):
return obj.server_url + obj.job_name + '/' + str(obj.build_number)

How to get name of streams in MinibatchSource?

How can I get the names of each of the streams in a MinibatchSource?
Can I get the names associated with with the stream information returned by stream_infos?
minibatch_source.stream_infos()
I also have a follow-up-question:
The result from:
print(reader_train.streams.keys())
is
dict_keys(['labels', 'features'
How does these names relate to the construction of the MiniBatchSource, which is done like this?
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
)))
I would have thought that my streams would be named ‘image’ and ‘label’, but they were named ‘labels’ and ‘features’.
I guess those names are somehow default names?
For your original question:
minibatch_source.streams.keys()
See for example this tutorial under the section "A Brief Look at Data and Data Reading".
For your followup question: The names returned by keys() are the arguments of StreamDefs(). This is all you need in your program. If you define your MinibatchSource like this
return MinibatchSource(ImageDeserializer(map_file, StreamDefs(
image = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
label = StreamDef(field='label', shape=num_classes) # and second as 'label')))
then the names will match. You can choose any names you want but the value of the field inside StreamDef() should match the source (which depends on your input data and the Deserializer you are using).