Is there a way to Index a doc to Elasticsearch with a specific _id filed? - api

I'm looking to simulate a state where I have a specific _id field inside an index.
Let's assume I want to take the EXACT same log from index1 in my example and index it into index2.
Like so:
This is my index1
{
_index: "index-number-one",
_type: "doc",
_id: "S0meSpec!f!cID",
_score: 1,
_source: {
message: "message1",
type: "type1",
tags: [
"_bla"],
number: 3
}
}
Now I want that exact same log in my index2
{
_index: "index-number-two",
_type: "doc",
_id: "S0meSpec!f!cID",
_score: 1,
_source: {
message: "message1",
type: "type1",
tags: [
"_bla"],
number: 3
}
}
Couldn't find an API in Elasticsearch that can insert a doc to an Index with a specific _id field... (?)
If this action isn't possible so that the Elasticsearch cluster won't have duplications in the _id field, I can imagine it's because they want to keep the ability to search a doc by it's _id
field which needs to be unique, in that case, assume that I don't mind deleting the entire doc from index1 (maybe save it aside as some variable in my code), but in the end, I need the doc in index2, to have the EXACT _id as index1 once had.
And if there's a way to edit an existing _id field it would also solve my problem.
Can anyone please shed any light on how to achieve that goal?

answer to myself,
I found that it can be done in a POST request on the index like so:
POST twitter/test-index-1234/abctype/Som3Cust0mID
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
And the outcome in ES:
{
_index: "test-index-1234",
_type: "abctype",
_id: "Som3Cust0mID",
_score: 1,
_source: {
user: "kimchy",
post_date: "2009-11-15T14:12:12",
message: "trying out Elasticsearch"
}
}

It is definitely possible to do this. IDs are unique per index, not per cluster.
Check the reindex API, it copies one index onto another and keeps the document IDs.
It is also possible to change the ID using a script inside the reindex call.

Related

FaunaDB: how to fetch a custom column

I'm just learning FaunaDB and FQL and having some trouble (mainly because I come from MySQL). I can successfully query a table (eg: users) and fetch a specific user. This user has a property users.expiry_date which is a faunadb Time() type.
What I would like to do is know if this date has expired by using the function LT(Now(), users.expiry_date), but I don't know how to create this query. Do I have to create an Index first?
So in short, just fetching one of the users documents gets me this:
{
id: 1,
username: 'test',
expiry_date: Time("2022-01-10T16:01:47.394Z")
}
But I would like to get this:
{
id: 1,
username: 'test',
expiry_date: Time("2022-01-10T16:01:47.394Z"),
has_expired: true,
}
I have this FQL query now (ignore oauthInfo):
Query(
Let(
{
oauthInfo: Select(['data'], Get(Ref(Collection('user_oauth_info'), refId))),
user: Select(['data'], Get(Select(['user_id'], Var('oauthInfo'))))
},
Merge({ oauthInfo: Var('oauthInfo') }, { user: Var('user') })
)
)
How would I do the equivalent of the mySQL query SELECT users.*, IF(users.expiry_date < NOW(), 1, 0) as is_expired FROM users in FQL?
Your use of Let and Merge show that you are thinking about FQL in a good way. These are functions that can go a long way to making your queries more organized and readable!
I will start with some notes, but they will be relevant to the final answer, so please stick with me.
The Query function
https://docs.fauna.com/fauna/current/api/fql/functions/query
First, you should not need to wrap anything in the Query function, here. Query is necessary for defining functions in FQL that will be run later, for example, in the User-Defined Function body. You will always see it as Query(Lambda(...)).
Fauna IDs
https://docs.fauna.com/fauna/current/learn/understanding/documents
Remember that Fauna assigns unique IDs for every Document for you. When I see fields named id, that is a bit of a red flag, so I want to highlight that. There are plenty of reasons that you might store some business-ID in a Document, but be sure that you need it.
Getting an ID
A Document in Fauna is shaped like:
{
ref: Ref(Collection("users"), "101"), // <-- "id" is 101
ts: 1641508095450000,
data: { /* ... */ }
}
In the JS driver you can use this id by using documentResult.ref.id (other drivers can do this in similar ways)
You can access the ID directly in FQL as well. You use the Select function.
Let(
{
user: Get(Select(['user_id'], Var('oauthInfo')))
id: Select(["ref", "id"], Var("user"))
},
Var("id")
)
More about the Select function.
https://docs.fauna.com/fauna/current/api/fql/functions/select
You are already using Select and that's the function you are looking for. It's what you use to grab any piece of an object or array.
Here's a contrived example that gets the zip code for the 3rd user in the Collection:
Let(
{
page: Paginate(Documents(Collection("user")),
},
Select(["data", 2, "data", "address", "zip"], Var("user"))
)
Bring it together
That said, your Let function is a great start. Let's break things down into smaller steps.
Let(
{
oauthInfo_ref: Ref(Collection('user_oauth_info'), refId)
oauthInfo_doc: Get(Var("oathInfoRef")),
// make sure that user_oath_info.user_id is a full Ref, not just a number
user_ref: Select(["data", "user_id"], Var("oauthInfo_doc"))
user_doc: Get(Var("user_ref")),
user_id: Select("id", Var("user_ref")),
// calculate expired
expiry_date: Select(["data", "expiry_date"], Var("user_doc")),
has_expired: LT(Now(), Var("expiry_date"))
},
// if the data does not overlap, Merge is not required.
// you can build plain objects in FQL
{
oauthInfo: Var("oauthInfo_doc"), // entire Document
user: Var("user_doc"), // entire Document
has_expired: Var("has_expired") // an extra field
}
)
Instead of returning the auth info and user as separate points if you do want to Merge them and/or add additional fields, then feel free to do that
// ...
Merge(
Select("data", Var("user_doc")), // just the data
{
user_id: Var("user_id"), // added field
has_expired: Var("has_expired") // added field
}
)
)

Creating an index for all active items

I have a collection of documents that follow this schema {label: String, status: Number}.
I want to introduce a new field, deleted_at: Date that will hold information if a document has already been deleted. Seems like a perfect use case for an index, to be able to search for all undeleted tasks.
CreateIndex({
name: "activeTasks",
source: Collection("tasks"),
terms: [
{ field: ["data", "deleted_at"] }
]
})
And then filter by undefined / null value in shell:
Paginate(Match(Index("activeTasks"), null))
Paginate(Match(Index("activeTasks"), undefined))
It returns nothing, even for documents where I explicitly set deleted_at to null.
That's not my point, though. I want to get documents that do not have the deleted_at defined at all, so that I do not have to update the whole collection.
PS. When I add document where deleted: "test" and query for it, the shell does return the expected result.
What do I don't get?
The reason is because FaunaDB doesn't support reading empty/null value the way you think it does. You need to use a special Bindings to do that.
Make sure to check out https://docs.fauna.com/fauna/current/tutorials/indexes/bindings.html#empty for a more thorough explanation and examples.
My understanding of how bindings work would yield the following code. I haven't tested it though and I'm not sure it works.
You need a special binding index:
CreateIndex({
name: "activeTasks",
source: [{
collection: Collection("tasks"),
fields: {
null_deleted_at: Query(
Lambda(
"doc",
Equals(Select(["data", "deleted_at"], Var("doc"), null), null)
)
)
}
}],
terms: [ {binding: "null_deleted_at"} ],
})
Usage:
Map(
Paginate(Match(Index("activeTasks"), true)),
Lambda("X", Get(Var("X")))
)

How to obtain data in a table from Wikipedia API?

I'm trying to get all the content from Wikipedia:Unusual_articles and I'm able to get the list of table content by calling this endpoint:
https://en.wikipedia.org/w/api.php?action=parse&format=json&prop=sections&page=Wikipedia:Unusual_articles
and the data I got back look something like this:
{
title: "Wikipedia:Unusual articles",
pageid: 154126,
sections: [
{
toclevel: 1,
level: "2",
line: "Places and infrastructure",
number: "1",
index: "T-1",
fromtitle: "Wikipedia:Unusual_articles/Places_and_infrastructure",
byteoffset: null,
anchor: "Places_and_infrastructure"
},
{
toclevel: 2,
level: "3",
line: "Americas",
number: "1.1",
index: "T-2",
fromtitle: "Wikipedia:Unusual_articles/Places_and_infrastructure",
byteoffset: null,
anchor: "Americas"
},
...
But I'm not able to get the content of a particular section. For example under Americas is a list of the table with a link and a short description, but is there a way to obtain the link and short description from the API?
You can get the content of every page section by using MediaWiki API with action=parse in two steps. First you have to get all sections from the page with:
https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=Wikipedia:Unusual_articles
From the response you see that section Americas has index=T-2 (T means transcluded page) and it comes from fromtitle=Wikipedia:Unusual_articles/Places_and_infrastructure. Now we use these index and fromtitle to get the content of the section with:
https://en.wikipedia.org/w/api.php?action=parse&page=Wikipedia:Unusual_articles/Places_and_infrastructure&section=2&prop=...
where:
prop=wikitext - gives the original section wikitext that was parsed.
prop=text - gives the parsed section text of the wikitext.

Arangodb dynamic index on object keys

Arangodb 2.8b3
Have document with some property "specification", can have 1-100 keys inside, like
document {
...
specification: {
key1: "value",
...
key10: "value"
}
}
Task fast query by specification.key
For Doc IN MyCollection FILTER Doc.specification['key1'] == "value" RETURN Doc
Tried create hash indexes with field: "specification", "specification.*", specification[*], specification[*].*
Index never used, any solution without reorganizing structure or plans for future exists?
No, we currently don't have any smart idea how to handle indices for structures like that. The memory usage would also increase since the attribute names would also have to be present in the index for each indexed value.
What we will release with 2.8 is the ability to use indices on array structures:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
with documents like:
{ tags: [ "foobar", "bar", "anotherTag" ] }
Using AQL queries like this:
FOR doc IN posts
FILTER 'foobar' IN doc.tags[*]
RETURN doc
You could also index documents under arrays:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].value" ] });
db.posts.insert({
tags: [ { key: "key1", value: "foobar"},
{ key: "key2", value: "baz" },
{ key: "key3", value: "quux" }
] });
The following query will then use the array index:
FOR doc IN posts
FILTER 'foobar' IN doc.tags[*].value
RETURN doc
However, the asterisk can only be used for array accesses - it can't substitute key matches in objects.

MongoDB: How retrieve data that is newly constructed instead of original documents in the collection?

I have a collection in which documents are all in this format:
{"user_id": ObjectId, "book_id": ObjectId}
It represents the relationship between user and book, which is also one-to-many, that means, a user can have more than one books.
Now I got three book_id, for example:
["507f191e810c19729de860ea", "507f191e810c19729de345ez", "507f191e810c19729de860efr"]
I want to query out the users who have these three books, because the result I want is not the document in this collection, but a newly constructed array of user_id, it seems complicated and I have no idea about how to make the query, please help me.
NOTE:
The reason why I didn't use the structure like:
{"user_id": ObjectId, "book_ids": [ObjectId, ...]}
is because in my system, books increase frequently and have no limit in amount, in other words, user may read thousands of books, so I think it's better to use the traditional way to store it.
This question is not restricted by MongoDB, you can answer it in relational database thoughts.
Using a regular find you cannot get back all user_id fields who own all the book_id's because you normalized your collection (flattened it).
You can do it, if you use aggregation framework:
db.collection.aggregate([
{
$match: {
book_id: {
$in: ["507f191e810c19729de860ea",
"507f191e810c19729de345ez",
"507f191e810c19729de860efr" ]
}
}
},
{
$group: {
_id: "$user_id",
count: { $sum: 1 }
}
},
{
$match: {
count: 3
}
},
{
$group: {
_id: null,
users: { $addToSet: "$_id" }
}
}
]);
What this does is filters through the pipeline only for documents which match one of the three book_id values, then it groups by user_id and counts how many matches that user got. If they got three they pass to the next pipeline operation which groups them into an array of user_ids. This solution assumes that each 'user_id,book_id' record can only appear once in the original collection.