Redis - handling changes to data structures - redis

I have been experimenting with Redis, and I really like the scalability that it brings to the table. However, I'm wondering how to handle changes to data structures for a system that's already in production.
For example, let me say that I am collecting information about a user, and I use the user_id as a key, and dumping the other data about the user as comma separated values.
user_id: name, email, etc.
Now, say after about 100,000 records, I realise that I needed to query by email - how would I now take a snapshot of the existing data and create a new index for it?

Using csv is not a great idea if you want to support changes. You need to use a serializer that handles missing/new values if everything is in one key, or you can use a redis hash, which gives you named subkeys. Either way you can add/remove fields with the only requirement being that your code knows what to do if it reads a record without the new value.
To allow lookup by email you need to add an index - basically a key (or list) for each email with the user id as the value. You will need to populate this index by getting all keys once, then making sure you update it when emails change.
You could iterate over all keys and store them with a different id, but that is probably more trouble than it is worth.

From my understanding of Redis, this would require something which Redis is not designed to do. You would have to loop though all your records (using keys *) and then change the order of the data and make a new key. I, personally, would recommend using a list instead of a comma separated string. In a list, you can reorder it from inside redis. A Redis List looks like the following:
"Colum" => [0] c.mcgaley#gmail.com
[1] password
[2] Something
I am building an app in which I encountered the same problem. I solved it by having a list for all the user's info, and then have a key with the user's email with a value of the user's id. So my database would something like this:
"Colum" => [0] c.mcgaley#gmail.com
[1] password
[2] Something
"c.mcgaley#gmail.com" => "Colum"
So I could query the ID or the Email and still get the information I needed.
Sorry that I was not able to directly answer your question. Just hope this helped.

Related

How to manage additional processed data in MarkLogic

MarkLogic 9.0.8.2
We have around 20M records in MarkLogic.
For one of the business requirement, we need to generate additional data for each xml and then need user will search this data.
As we can't change original document, so need input on what is best way to manage additional data. Following are the few which we have thought of
Create separate collection and store additional data in separate xml with same unique number i.e. same as original xml. So when user search for it, search in this collection and then retrieved original documents and send response back.
Store additional data in original document properties
We also need to create element range index to make sure it works when end user provide data in range operators.
<abc>
<xyz>
<quan>qty1</quan>
<value1>1.01325E+05</value1>
<unit>Pa</unit>
</xyz>
<xyz>
<quan>qty2</quan>
<value1>9.73E+02</value1>
<value2>1.373E+03</value2>
<unit>K</unit>
</xyz>
<xyz>
<quan>qty3</quan>
<value1>1.8E+03</value1>
<unit>s</unit>
</xyz>
<xyz>
<quan>qty4</quan>
<value1>3.6E+03</value1>
<unit>s</unit>
</xyz>
</abc>
We need to process data from value1 element. User will then search for something like
qty1 >= minvalue AND qty1<=maxvalue
qty2 >= minvalue AND qty2<=maxvalue
qty3 >= minvalue AND qty3<=maxvalue
So when user will search for qty1 then it should only get data from element where value is qty1 and so on.
So would like to know
What is best approach to store data like this
What kind of index i should create to implement this
I would recommend wrapping the original data in an envelope, which allows adding extra data in the header. It could also allow creating a canonical view on the relevant pieces of the data, and either store that as instance, and original as 'attachment' (sub-property, not an attached binary), or keep the instance as-is, and put canonical values for indexing in the header.
There is a lengthy blog article about the topic, that discusses pros and cons in high detail: https://www.marklogic.com/blog/envelope-design-pattern/
HTH!
Grtjn's answer would be the recommended solution, as it is more performant to keep all the information inside the document itself, versus having to query across both the document with the properties, but it would require changes to the document.
Option 1 & 2 could both work.
Properties documents already exist, so it doesn't add fragments, but the properties must conform to the schema.
Creating a sidecar document provides more flexibility, because you are creating new documents, it will increase number of fragments.

How to implement a key lookup for generated keys table in pentaho Kettle

I just started to use Pentaho Kettle for integration. Seems great so far, quite intuitive compared to Talend, which I was also investigating.
I am trying to migrate some customers without their keys. So I have their email addresses.
The customer may already exist in the database, so what I need to do is:
If the customer exists, add it's id to the imported field and continue.
But if the customer doesn't exist I need to get the next Hibernate key from the table Hibernate_Sequences and set it as the id.
But I don't want to always allocate a key, so I want to conditionally execute a step to allocate the next key.
So what I want to do, is in the flow execute the db procedure, which allocates the next key and returns it, only if there's no value in id from the "lookup id" step.
Is this possible?
Just posting my updated flow - so the answer was to use a filter rows component which splits the data on true/false. I really had trouble getting the id out of the database stored proc because of a bug, so I had to use decimal and then convert back to integer (which I also couldn't figure out how to do, so used a javascript component).
Yes it is. As per official documentation (i left only valuable information) "Lookup values are added as new fields onto the stream". So u need just to put step "Filter row" in Flow section and check for "id" which suppose to be added in "Existing Id Lookup" step.

What are the security risks if I disclose database field name to web user interface?

I want make the program more simple, so I use table's field name as name in input html,
And then I can save some time for mapping input name to database field name
But, are there security risks if user know my field name?
(Suppose SQL injection have handled in the server program)
Update 1:
I am not going to around the field name validation
I just don't want to do something like this
$uid=$_POST['user_id'];
$ufname=$_POST['user_first_name'];
$ulname=$_POST['user_last_name'];
If I do this
$user_id=$_POST['user_id'];
$user_first_name=$_POST['user_first_name'];
$user_first_name=$_POST['user_last_name'];
I can save coding time, and don't need to think two names for one data, and reduce bug.
and I can also do something like this to save more time as I just type the name once.
$validField=array("user_id","user_first_name","user_last_name");
foreach ($validField as $field) {
$orm[$field]=$field;
}
This can also valid the field name
so I think that hacks are no way to get my unpublished fields
I can save some time for mapping input name to database field name.
If you save time mapping input names to database field names, you would need to spend a roughly equivalent time validating that the field names are, in fact, among the fields that the users can access in your database. There is no way around this validation, because otherwise your DB is exposed to hacks that try and get your unpublished fields, such as IDs and hashes. This is pretty bad, so you would need to build that validation layer.
On the other hand, if you do a mapping from meaningless IDs to meaningful, then you do not need validation, because it is your program that produced the meaningful IDs. Essentially, the validation step is built into the process.

What exactly is the session_hash?

I have a question regarding one of the shopify data elements.
During the web hook stage, I receive the orders data.
Inside this data, under the "client_details" section, there is a field called "session_hash".
I was wondering what kind of data is it (e.g. the session_id), and which hash function you use to generate it.
The session_hash variable is a secure hash of the session id we use to identify individual customers (as you correctly surmise).
You can use it to identify a particular user session, but that's about it. You can't deconstruct the hash to obtain the original session id for obvious security reasons.

What is the best way to fake a SQL array or list?

I'm building a chatroom application, and I want to keep track of which users are currently in the chatroom. However, I can't just store this array of users (or maybe a list would be better) in a field in one of my records in the Chatroom table.
Obviously one of the SQL data types is not an array, which leads me to this issue: what is the best way to fake/mock array functionality in a SQL database?
It seems there are 3 options:
1: Store the list/array of users as a string separated by commas, and just do some parsing when I want to get it back to an array
2: Since the max amount of users is allowed to be 10, just have 10 extra fields on each Chatroom record representing the users who are currently there
3: Have a new table Userchats, which has two fields, a reference to the chatroom, and a user name
I dunno, which is the best? I'm also open to other options. I'm also using Rails, which seems irrelevant here, but may be of interest.
Option 3 is the best. This is how you do it, in a relational schema. It is also the most flexible and future-proof option.
It can grow easier in width (extra columns say, a date joined, a channel status, a timestamp last talked) and length (extra rows when you decide there now can be 15 users in a room instead of 10).
The proper way to do this is to add an extra table representing an instance of a user being in a chatroom. In most cases, this is probably what you will want to do, since it gives you more flexibility in the types of queries you can do (for instance: list all chatrooms a particular user is in, find the average number of people in each chatroom, etc.) You would just need to add a new table - something like chat_room_users, with a chat_room_id, and a user_id.
If you're deadset on not adding an extra table, then Rails (or more specifically ActiveRecord), does have some functionality to store data structures like arrays in a SQL column. Just set up your column as a string or text type in a Rails migration, and add:
serialize :users
You can then use this column as a normal Ruby array / object, and ActiveRecord will automatically serialize / deserialize this object as you work with it. Keep in mind that's there are a lot of tradeoffs with this approach - you will never be able to query what users are in a particular room using SQL and will instead need to pull all data down to Ruby before working with it.