Domain Names to Webpage Titles in OpenRefine - openrefine

I have a column in Excel of domain names (like stackoverflow.com) and would like to create a corresponding column with the title of the domains (like "Stack Overflow").
I uploaded the Excel file into OpenRefine. I believe the best way to do this would be to call the "Add column by fetching URLs on column" function. But I don't know what expression to use.

The way I do it is as follows:
(1) Have visitable URLs in the source column. I.e., http://stackoverflow.com instead of just the domain name.
(2) Apply "Add column by fetching URLs..." as you said. (If you're hitting pages on the same domain over and over, make sure you set a reasonable delay.)
(3) Using this first new column, create a second new column based on newCol1 by parsing the HTML that's returned:
value.parseHtml().select("title")[0].toString()
Notes:
(a) You need the toString() else you'll see blank values in the new column after you apply the function.
(b) You don't have to create a second new column; you could just apply a transform using the same formula as above.
(c) I've also tried using a split:
value.split("")[1].split("")[0]
I don't have my results handy at the moment, but I believe that also worked.

Related

Creating a feature class in ArcGIS 10

I am trying to create a feature class from another feature class via below:
arcpy.CreateFeatureclass_management(path, name, "POLYGON")
In ArcGIS, it is creating the fields Shape, Shape_Length, and Shape_Area. I added additional fields to the newly feature class.
cursor = arcpy.da.SearchCursor("old featureclass", ["Shape#", "*"]
insert = arcpy.da.InsertCursor("new featureclass", ["*"]
for i in cursor:
insert.insertRow(i)
I am getting an error:
Sequence size must match size of the row
This is because the newly feature class has added additional fields as I mentioned above. Then I tried
for i in cursor:
append newly array with (ShapeLength, and ShapeArea)
insert.insertRow(newlyarray)
It worked fine but the Shape_Area and Shape_Length is returning zero. I've also tried to calculate field area and it didn't work as well.
Can someone please help me with this issue? The geometry shape is a polygon but the shape area and shape length won't populate based of the pre-existing shape.
I think what you want to do is:
Get the fields list from the old shapefile,
Add the desired fields to the new file,
then iterate over each field (or just the ones you want to copy) of each row from the old file and copy the value into its corresponding field in the new file.
Then you can add your brand new new fields whenever you like and this copying function won't be affected if it refers to fields by name.
Another method would be to literally copy the old shapefile using the copy tool, then edit the copied file.

SSRS - How to show external image based on URL inside column

I am trying to show images for products inside a basic report. The image needs to be dynamic, meaning the image should change based on the SKU value.
Right now I am inserting an image into a table, setting to external, and i've tried:
=Fields!URL.Value
=http://externalwebservername/sku= & Fields!SKU.Value
="http://externalwebservername/sku=" & Fields!SKU.Value
I do not get any images in my table.
My stored proc has all the data, including a URL with the image I wan't to show. Here is a sample of what the URL looks like:
http://externalwebservername/sku=123456
If I enter the URL in the field without "=" it will show that ONE image only.
How should I set up the expression to properly show the external image based on a dynamic URL? Running SQL 2016
Alan's answer should work, but in our environment we have strict proxy/firewall rules, so the two servers could not contact each other.
Instead we are navigating to the file stored on our storage system.
We altered the URL column to point to file path in the stored procedure. Insert image, set Source to External and Value set to [URL].
URL= file://server\imagepath.jpg
As long as the account executing the report has permissions to access the URLs then your 3rd expression should have worked.
I put together a simple example as follows.
I created a new blank report then added a Data Source. It doesn't matter where this points, we won't use it directly.
Then I created a dataset (Dataset1) with the following SQL to give me list of image names.
SELECT '350x120' AS suffix
UNION SELECT '200x100'
UNION SELECT '500x500'
Actually, these are just parameters for the website http://placehold.it/ which will generate images based on the size you request, but that's not relevant for this exercise.
We'll be showing three images from the following URLs
http://placehold.it/350x120
http://placehold.it/200x100
http://placehold.it/500x500
Next, create a table, I used 3 columns to give me more testing options. Set the DataSetName to DataSet1 if it isn't already.
In the first column the expression is just =Fields!suffix.Value
In the second column I added an image, set it's source property to External and the Value to ="http://placehold.it/" & Fields!suffix.Value
I then added a 3rd column with the same expression as the image Value so I could see what was being used as the image URL. I also added an action that goes to the same URL, just to check the URL did not have any unprintable characters in it that might cause a problem.
The basic report design looks like this.
The rendered result looks like this.

#Dblookup and formatting on web

I have been developing a web application using domino, therein I have dblookup-ing the field from notes client; Now, this is working fine but the format of value is missing while using on web.
For example in lotus notes client the field value format is as above
I am one, I am two, I am one , I am two, labbblallalalalalalalalalalalalalalalalalalaallllal
Labbbaalalalallalalalalalaalallaal
Hello there, labblalalallalalalllaalalalalalalalalalalalalalalalalalalalalalalala
Now when I retrieve the value of the field on web it seems it takes 2 immediate after 1. and so forth, I was expecting line feed here which is not happening.
The field above is multi valued field. Also on web I have used computed text which does db lookup from notes client.
Please help me what else could/alternate solution for this case.
Thanks
HD
Your multi-valued field has display options associated with it and the Notes client honors those. Obviously, your options are set up to display entries separated by newlines.
The computed text that you are using for the web does not have options like that and the field options are irrelevant because you aren't displaying the field. Your code has to insert the #Newlines. That's pretty easy because #DbLookup returns a list, and if you concatenate a list and a scalar, the scalar will be appended to each element of the list. (Look at the third example under "concatenation, pairwise" here to see what I mean.
The way you've worded your question is a little unclear to me, but what you need in your computed text formula is either something like this:
list := #DbLookup(etc,. etc.);
list + #Newline;
Or something like this:
multiValueFieldContainingListWithDbLookupResult + #NewLine;
I used #implode(Dblookupreturnedvalue;"");
thanks All :)

nested field in Solr 5.2

I'm new to Solr and I have a very specific problem that I need to solve:
I have a csv file that contains my Solr document. Now, I do have a column (field) that's not only multiValued, but also contains 'subfields'
for example
"id":"0101",
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
"mainProperty":"mainproperty1",
"URL":"http://www.mySite..."
where id, addMaterials, mainProperty, and URL are my main fields while 'name' and 'property' are my subfields. I know that Solr is designed to handle denormalized documents but denormalizing is not a possible solution for my application.
What I'm thinking is to just separate my data set and move the fields (that have subfields) to another document and somehow make a new field to link it to the orginial document (e.g. fromIdField).
Is there any other solution to do this? My minimum goal is to index the values of addMaterials field (even without indexing the subfields)
from:
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
to
"addMaterials":{"name":"Mat1", "property":"prop1"}
"addMaterials":{"name":"Mat2", "property":"prop2"}
"addMaterials":{"name":"Mat3", "property":"prop3"}
Thanks in advance.
I have found a solution to my problem. Instead of separating my data set, I kept the addMaterials field as a multiValued field and ignored the subfields. So I only have one multiValued field to be indexed. What I did was to use the update/ request of Solr to index my csv file and put },{ as my separator in my addMaterials multiValued field. The indexed document looks like this:
"addMaterials": ["[{\"name\":\"Mat1\", \"property\":\"prop1\"",
"\"name\":\"Mat2\", \"property\":\"prop2\"",
"\"name\":\"Mat3\", \"property\":\"prop3\"}]"]
I indexed my document using this:
curl "http://localhost:8983/solr/<coreName>/update/csv?
stream.file=C:/userName/Solr/solr-5.2.0/documentFolder/myFile.csv&
f.addMaterials.split=true&
f.addMaterials.separator=\},\{&
stream.contentType=text/plain;charset=utf-8"
Also, this assumes that the addMaterials field is a multiValued field. So make sure you modify your schema first before indexing your document using the procedure above. Otherwise, it will give an error saying that the f. is not a multiValued field.
Of course, if you need to query against the sub-fields then I guess you can use the !join command/function of Solr.

Remove sub-string from data in sql table column

I have a table that has a bunch of url's within a certain column. We no longer want a certain url within the table and instead of manually updating each data record I was curious if there is a way to remove just a certain type of url through an update query?
For instance, a data record with the following url's exists:
Presentation (PowerPoint File)<br> Presentation (Webcast)
and I want to remove the smil url so the data only shows:
Presentation (PowerPoint File)<br>
I want to remove the entire "smil" url from this string (from ), and every other smil url from the other records (the other records are similar with a different smil file name). Some of the records could have more than two urls, BUT the "smil" url is always the last one.
Preserving some of the comment history so future readers understand the decision points before implementing the solution
Does it always follow the pattern of text<br>text
there are a few times where there are two urls and they exclude the <br> and then there are a few times where it is just the smil url within the data.
You haven't clearly define what a "smil" url is. Is it one with smil in it anywhere? With the file suffix being .smil? With /smil/ in the path? some combination of these?
The problem you're going to have is that to properly solve this, you'll need to be able to have some insight into the html fragments. That's usually a .NET thing, the string matching in TSQL is likely to be insufficient for your needs. You could try taking multiple passes as it. If it follows the text<br>text pattern, you could left(myCol, charindex(mycol, '<br>')) where Mycol like '%smil%' and keep taking passes at it until you've found all the patterns.
#billinkc: I see where you are going, I was thinking if it would be possible to remove everything from the start of <a href="xxx since those "smil" links all start with that character string.
And there'd never be the case of streaming<br>foo? If so, then yeah, search for the <a href="http: using charindex/patindex (can never remember which) and then slice it out with left/substring.
#billinkc: yup that will always be the case. the "streaming" url is ALWAYS last. Ok this was easier than I thought, just needed some outside eyes. Thank you.
Given that we know we don't have to worry about anything useful existing after the smil url and that the url will always be an external, we can safely use a left/substring approach like
DECLARE #Source table
(
SourceUrl varchar(200)
)
INSERT INTO #Source
(SourceUrl)
VALUES
('Presentation (PowerPoint File)<br> Presentation (Webcast)');
-- INSPECT THIS, IF APPROPRIATE THEN
SELECT
S.SourceUrl AS Before
, CHARINDEX('<a href="http://', S.SourceUrl) AS WhereFound
, LEFT(S.SourceUrl, CHARINDEX('<a href="http://', S.SourceUrl) -1) AS After
FROM
#Source AS S
WHERE
S.SourceUrl LIKE '%smil%';
-- Only run this if you like the results of the above
UPDATE
S
SET
SourceUrl = LEFT(S.SourceUrl, CHARINDEX('<a href="http://', S.SourceUrl) -1)
FROM
#Source AS S
WHERE
S.SourceUrl LIKE '%smil%';