I had created the text semantic search engine. However, I cannot find the data set which is labeled so that I can evaluate the information retrieve of my system.
Is there any public available document (text) which is labeled. As I would need the text document to evaluate the information retrieve result. (recall, precision, F1 value...)
Thanks.
I do research in this direction. In all my research, i have used AOL dataset which consists of ~20M web queries collected from ~650k users over three months (March 01, 2006 to May 31, 2006). The data is sorted by anonymous user ID and sequentially arranged.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}. More details can be found in the link mentioned above. I am interested to know how you have implemented and if possible, share your engine's code. I am also interested to know the performance on AOL dataset in your search engine.
You can find the dataset in my git repository. Thanks!
Related
MarkLogic 9.0.8.2
We have around 20M records in MarkLogic.
For one of the business requirement, we need to generate additional data for each xml and then need user will search this data.
As we can't change original document, so need input on what is best way to manage additional data. Following are the few which we have thought of
Create separate collection and store additional data in separate xml with same unique number i.e. same as original xml. So when user search for it, search in this collection and then retrieved original documents and send response back.
Store additional data in original document properties
We also need to create element range index to make sure it works when end user provide data in range operators.
<abc>
<xyz>
<quan>qty1</quan>
<value1>1.01325E+05</value1>
<unit>Pa</unit>
</xyz>
<xyz>
<quan>qty2</quan>
<value1>9.73E+02</value1>
<value2>1.373E+03</value2>
<unit>K</unit>
</xyz>
<xyz>
<quan>qty3</quan>
<value1>1.8E+03</value1>
<unit>s</unit>
</xyz>
<xyz>
<quan>qty4</quan>
<value1>3.6E+03</value1>
<unit>s</unit>
</xyz>
</abc>
We need to process data from value1 element. User will then search for something like
qty1 >= minvalue AND qty1<=maxvalue
qty2 >= minvalue AND qty2<=maxvalue
qty3 >= minvalue AND qty3<=maxvalue
So when user will search for qty1 then it should only get data from element where value is qty1 and so on.
So would like to know
What is best approach to store data like this
What kind of index i should create to implement this
I would recommend wrapping the original data in an envelope, which allows adding extra data in the header. It could also allow creating a canonical view on the relevant pieces of the data, and either store that as instance, and original as 'attachment' (sub-property, not an attached binary), or keep the instance as-is, and put canonical values for indexing in the header.
There is a lengthy blog article about the topic, that discusses pros and cons in high detail: https://www.marklogic.com/blog/envelope-design-pattern/
HTH!
Grtjn's answer would be the recommended solution, as it is more performant to keep all the information inside the document itself, versus having to query across both the document with the properties, but it would require changes to the document.
Option 1 & 2 could both work.
Properties documents already exist, so it doesn't add fragments, but the properties must conform to the schema.
Creating a sidecar document provides more flexibility, because you are creating new documents, it will increase number of fragments.
How do I link my activities variable to only the corresponding KPIs variable?
Using guidance from a number of sources, but primarily the genius of Jeffery Shafer articulated through the SuperDataScience video, I built a Sankey Diagram for my work. For the most part it works, however, I have been trying to figure out how to adjust my Sankey Diagram model to line up each activity with ONLY the corresponding KPIs, but am having no luck.
The data structure looks like this:
You'll note I changed the binary value to "", 2 instead of 0, 1 as it makes visual calculations easier. For the "Viz" variable, I have "Activity" for the raw data set, then I copy/paste/replicate the data to mirror the data (required for the model) but with "KPI" for the mirrored data.
In the following image, you'll see my main issue is that the smallest represented activity still shows as corresponding to all KPIs when in fact it does not. I want activity to line up only with the corresponding KPIs as some activities don't correspond with all, or even any, KPIs.
Finally, here is the model very similar to what the above video link shows:
Can someone help provide insight into how I can adjust the model to fit activities linking only to corresponding KPIs? I appreciate any insight. Thanks!
I have a solution to the issue, thanks to a helpful Tableau support member named Anthony. It was in the data structure. The data was not structured to only associate "Activities" with their "KPI" values within Tableau's requirements, but every "Activities" value with every "KPI" value. As a result, to achieve the desired result, the data needs to be restructured to only contain a row for every valid "Activities" and "KPI" combination. See the visual below where data is removed to format properly:
-------------------------------------->
Once the table is restructured, the desired visual result should configure with the model. It works like a charm!
Good luck out there!
I've successfully migrated 1,000s of news items and other content from Sitefinity 5 to Wordpress after hours of excruciating analysis and sheer luck with guessing but have a few items that are still left over. Specifically the pages. I know a lot of the content is stored in very obscure ways but there has to be somebody who has done this before and can steer me in the right direction.
My research (and text-search against the DB) has found the page titles etc but when I search the content I get nothing. My gut tells me that the content is being stored in binary form, can anyone confirm if this is the case?
Sitefinity documentation is only helpful if you're a .net developer who has a site set up in Visual Studio (as far as I've seen).
This is probably the most obfuscated manner of storing content that I've ever encountered. After performing text searches against the database I've finally found where the content is stored but it's not a simple process to get it out.
Pages' master record appears to be sf_page_node, there are related tables:
sf_object_data (page_id is related to sf_page_node.content_id)
sf_draft_pages (page_id is related to sf_page_node.content_id)
sf_page_data (content_id is related to sf_page_node.content_id)
sf_control_properties (control_id is related to sf_object_data.id)
So you could get the info you need with a query like this:
select * from
[sf_page_node]
join sf_object_data on sf_page_node.content_id = sf_object_data.page_id
join sf_control_properties on sf_object_data.id = sf_control_properties.control_id
Other things to consider:
the parent_id field is related to the sf_page_node table, so if you're writing a script, be sure to query this as well
the page may have a banner image, you will pick up the "place_holder" value as 'BannerHolder' with a caption of "Image" The image may be stored as blobs in sf_media_content, you should handle this separately. The "nme" value of 'ImageId' will have a GUID in the "val" column. You can query sf_media_content with this value as "content_id" the actual binary data is stored in sf_chunks, they relate on "file_id"
My revised query taking into account what I'll need to migrate content is below:
select
original.content_id,
original.url_name_,
original.title_,
parent.id,
parent.url_name_,
parent.title_,
place_holder,
sf_object_data.caption_,
sf_control_properties.nme,
val
from [sf_page_node] original
join sf_object_data on original.content_id = sf_object_data.page_id
join sf_control_properties on sf_object_data.id = sf_control_properties.control_id
join sf_page_node parent on original.parent_id = parent.id
I hope this helps someone!
You don't need the version items in this case - as you already found out, it stores the previous version of the pages in binary format.
The current live pages' data is available in sf_control_properties and sf_object_data tables. You need to join these together with sf_page_data and sf_page_node and you will get the full picture.
Depending on your requirements, it may be easier to do a GET request to each page and parse the returned html response.
I want to be able to combine the functionality of the Kibana Terms Graph (be able to create buckets based on uniqueness of values from a particular attribute) and Histogram Graph (separate data into buckets based on queries and then illustrate the date based on time).
Overall, I want to create a Histogram, but I only want to create the Histogram based on the results of one query, not multiple queries like it's being done in the Kibana demo app. Instead, I want each bucket to be dynamically created per unique value of my particular field. For example, consider the following data returned by my query:
{"myValueType": "New York"}
{"myValueType": "New York"}
{"myValueType": "New York"}
{"myValueType": "San Francisco"}
{"myValueType": "San Francisco"}
Also assume that each record has a timestamp field for separating histogram data by date. For that particular date, I want the data to be communicated as a count of 3 into the New York bucket and a count of 2 into the San Francisco bucket. However, I am only able to show a count of 5 for my one linked query. When I configure the Histogram, I am able to specify a field to use for my timestamp, but not to create buckets from. I could've sent a field to compute a total/min/max/mean, but this field would've had to be numeric, so that is not the solution either.
If I were to use a Term Graph to create a pie or bar graph, I am indeed able to separate my data into buckets based on the unique values of my specified field (in this case, "myValueType"), but this would total up the data for all-time, not split up the data by timestamp. Although this is good information to know, it is not ideal because I wouldn't be able to detect trends in my data.
I am looking for a solution that will do one of the following:
Let me dynamically create queries in my Kibana dash board to create "buckets" in a Histogram
Allow me to run an ElasticSearch Terms Aggregation to supposidly split up my data into buckets based on "myValueType" and integrate these results into my Histogram
Customize the JSON of my dashboard, but this doesn't look possible to me
Create my own custom panel, but this is not desirable
Link a Kibana "TopN" query in Kibana. Actually, this has proven to be a work-around for my problem because the TopN query dynamically created one query per unique value/term from the specified fieldName. However, the problem is that I can only link one colour to this TopN query and each unique term will be placed in a bucket that uses a different shade of the colour. Ideally, every bucket in my Histogram will have a completely different colour associated to it. Imagine how difficult it will be to distinguish unique terms as the number of buckets grows.
If all else fails, I make one query per unique value from my search field. This will allow me to have one unique colour per bucket, but as the number of unique terms in the "myValueType" field changes, I need to keep adding/removing queries from Kibana, which can get quite messy.
I'm sure there is someting that I am missing here. Please help me out. Many thanks.
A highly related SOF question: Is it Possible to Use Histogram Facet or Its Curl Response in Kibana
This would be a great feature. It looks like it will be supported in Kibana4, but there doesn't seem to be much more info out there than that.
For reference: https://github.com/elasticsearch/kibana/issues/1249
Maybe a little late but it is actually possible in the newest BETA release.
kibana 4 beta 3 installation download
The are some articles are written in several parts,
for example, I got those articles from IBM developer works:
Distributed data processing with
Hadoop, Part 1:Getting started
Distributed data processing with
Hadoop, Part 2:Going further
Distributed data processing with
Hadoop, Part 3: Application
development
I will index those three articles separately. And some one search certain keywords, it is possible the part3 is on the top of hit whle part1 is on the 32th. Therefor, if I list results page by page, the part1 and part3 will display on different page.
How can I make sure the hitted documents in the same series displayed together?
I guess in SQL, we can use "group by".
I believe what you are asking for is Field Collapsing, which is currently a trunk feature in Solr, and will be incorporated into the next Solr version.
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a series. You will have to ensure that this gets incremented for every new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.