Any way to use strings as the scores in a Redis sorted set (zset)? - redis

Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.
In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)
Example: Suppose I want to presort all of the names linked to some ID so that I can use ZINTERSTORE to quickly get an ordered list of IDs based on their names (without using redis' SORT command). Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.
Does that make sense? Or am I going about it wrong?

You're trying to use an order preserving hash function to generate a score for each id. While it appears you've written one, you've already found out that the score's range allows you to use only the first 9 characters (it would be interesting to see your function btw).
Instead of this approach, here's a simpler one that would be easier IMO - use set members of the form <name>:<id> and set the score to 0. You'll be able to use lexicographical ordering this way and use something like split(':') to get the id from the set's members.

Related

combine multiple excel files with similar names

I have a somewhat general question about combining multiple excel files together. Normally, I would use pd.read_excel to read the files then concat to join. However, I have some cases where the field names are not exactly the same but similar. For example,
one sheet would have fields like: Apple, Orange, Size, Id
another sheet would be: Apples, orange, Sizes, #
I have use the rename columns function but with this I have to check and compare every names in each files. I wonder if there's any way to combine them without going through all the field names. Any thought? THANKS!
Define what it means for two strings to be the same, then you can do the renaming automatically (you'll also need to determine what the "canonical" form of the string is - the name that you'll actually use in the final dataframe). The problem is pretty general, so you'll have to decide based on the sort of column names that you're willing to consider the same, but one simple thing might be to use a function like this:
def compare_columns(col1: str, col2: str) -> bool:
return col1.lower() == col2.lower()
Here you'd be saying that any two columns with the same name up to differences in case are considered equal. You'd probably want to define the canonical form for a column to be all lowercase letters.
Actually, now that I think about it, since you'll need a canonical form for a column name anyway, the easiest approach would probably be, instead of comparing names, to just convert all names to canonical form and then merge like usual. In the example here, you'd rename all columns of all your dataframes to be their lowercase versions, then they'll merge correctly.
The hard part will be deciding what transforms to apply to each name to get it into canonical form. Any transformation you do has a risk of combining data that wasn't mean to be (even just changing the case), so you'll need to decide for yourself what's reasonable to change based on what you expect from your column names.
As #ako said, you could also do this with something like Levenstein distance, but I think that will be trickier than just determining a set of transforms to use on each column name. With Levenstein or similar, you'll need to decide which name to rename to, but you'll also have to track all names that map to that name and compute the Levenstein distance between the closest member of that group when deciding if a new name maps to that canonical name (e.g. say that you have "Apple" and "Aple" and "Ale" and are merging names with edit distance of 1 or less. "Apple" and "Aple" should be merged, as should "Aple" and "Ale". "Apple" and "Ale" normally shouldn't be (as their distance is 2), but because they both merge with "Aple", they also merge with each other now).
You could also look into autocorrect to try to convert things like "Aple" to "Apple" without needing "Ale" to also merge in; I'm sure there's some library for doing autocorrect in Python. Additionally, there are NLP tools that will help you if you want to do stemming to try to merge things like "Apples" and "Apple".
But it'll all be tricky. Lowercasing things probably works, though =)

How to make criteria with array field in Hibernate

I'm using Hibernate and Postgres and defined a character(1)[] column type.
So I donĀ“t know how to make this criteria to find a value in the array.
Like this query
SELECT * FROM cpfbloqueado WHERE bloqueados #> ARRAY['V']::character[]
I am not familiar with Postgres and its types but you can define your own type using custom basic type mapping. That could simplify the query.
There are many threads here on SO regarding Postres array types and Hibernate, for instance, this one. Another array mapping example that could be useful is here. At last, here is an example of using Criteria with user type.
Code example could be
List result = session.createCriteria(Cpfbloqueado.class)
.setProjection(Projections.projectionList()
.add(Projections.property("characterColumn.attribute"), PostgresCharArrayType.class)
)
.setResultTransformer(Transformer.aliasToBean(Cpfbloqueado.class))
.add(...) // add where restrictions here
.list()
Also, if it is not important for the implementation, you can define max length in the entity model, annotating your field with #Column(length = 1).
Or if you need to store an array of characters with length of 1 it is possible to use a collection type.
I hope I got the point right, however, it would be nice if the problem domain was better described.
So you have array of single characters... Problem is that in PG that is not fixed length. I had this problem, but around 10 years ago. At that time I had that column mapped as string, and that way I was able to process internal data - simply slice by comma, and do what is needed.
If you hate that way, as I did... Look for columns with text[] type - that is more common, so it is quite easy to find out something. Please look at this sample project:
https://github.com/phstudy/jpa-array-converter-sample

Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code.
I would like to use DocValuesField to get values but also for filtering and sorting.
So here I have some questions:
If I'd like to use range filter (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField,
but if i want to use terms filter (FieldCacheTermsFilter) I need to store a value in SortedDocValuesField.
So it looks like if I want to use range and terms filters I need to have two different fields. Am I right? Am I using it correctly?
Another thing is Sort. I can choose between SortedNumericSortField and SortField. First one requires SortedNumericDocValues, another NumericDocValuesField. Is there any(big) difference in performance?
Should I use SortedNumericSortField (adding another field to the index)?
And the last one. Am I right that all corresponding DocValuesField will be removed from index when doc is removed? I saw an IndexWriter method for an update doc value but no delete method for doc value.
Regards
Piotr

Oracle: max value column alphanumeric

The problem I can not get the latest code. I need it to generate the next.
Example:
Last:
D11.0602.166
Next:
D11.0603.166
I've tried:
MAX
TRANSLATE
CONVERT
MID
VAL
What you have there is a "smart key" - one attribute comprising three elements. Smart keys are dumb, because they're a pain in tha neck to work with.
So the correct solution will be to split that attribute into three separate attributes and make it a composite key instead.
In the meantime you could use reguklar expressions to fin dthe highest value of the middle component...
select max(regexp_replace(dumb_key
, '([A-Z][0-9]{2})\.([0-9]{4})\..([0-9]{3})'
, '\2'))
from your_table
/
No doubt there are all sorts of other complexities you haven't explained, which means this probably isn't the compleat solution. But it should be a starter for ten.

How to order by a part of a column?

I want to make an order by stuff using a part of some column.For example,some records of the column user look as below
vip1,vip2,vip21,vip10,vip100
If I do a order by user,the order would be
vip1,vip10,vip100,vip2,vip21
Then how can I make the result as follows
vip1,vip2,vip10,vip21,vip100
Thanks.
What RDBMS?
For SQL Server to get numeric sorting rather than lexicographic sorting it would be something like
ORDER BY CAST(RIGHT(col, LEN(col)-3) AS INT)
For MySQL
ORDER BY CAST(RIGHT(col, LENGTH(col)-3) AS UNSIGNED)
but why are you storing the vip part at all if it is the same for all rows?
You can also replace the user name ('vip') with nothing ('') and add zero, then sort. Not any more efficient, just more generic.
Is the "prefix" part always exactly 3 characters? If not, this gets complicated. Umm, with Postgres you could
order by regexp_matches(userid,"^[a-z]*")[1], substring(userid, regexp_matches(userid,"^[a-z]*")[1])::int
I think that would work,I haven't tried it. Anyway, the point is, if you have some function that will do regular expressions, you couild pull off a leading string of alphas, then peal off what's left and convert it to int.
If you're really embedding a number in an alpha field, a better alternative is: Don't do that. If this is two different logical data items, then make it two fields. It's a lot easier to put fields together than to take them apart.