Best practice when running out of progressive file names - filenames

Since this problem is not tied to any specific language I will keep it as general as possible. I have a function that takes a path P (not a directory) and a number N as parameters, and behaves like so:
If P doesn't exist it creates a file at P;
If P exists it adds a progressive numeric string P_0000 (the number of 0s depends on the number of digits of N) and checks if the new progressive string exists, if it doesn't it goes on until it reaches N.
My question is: is there a standard best practice on how to handle the case where all possible progressive names are already taken? Do I simply non create a file? Do I overwrite the first one telling the user? What's the least dangerous approach?
I know this is a general question, and because of this I don't even know how to look for advice (nor how to properly tag the question) if not by asking here. Thanks in advance to all who will answer.

Related

RedisBloom: Option to add items (bit strings) as is with no hashing?

I'm considering redis for my next project (in-memory, fast) but now I have the problem of figuring out how and if at all it could actually achieve my goal. The goal is to store "large" (millions) amount of fixed-length bit strings and then searching over the database with a input (query) bit string. Search means to return everything which fulfills below condition:
query & value = query
eg. if all bits set in the query are also set in the value return that key eg. bloom-filter albeit in my domain of work it isn't usually called like that.
I found the module RedisBloom but I already have my bloom filter (bit strings) available from external program and would simply like to use RedisBloom for storage of them and searching (exists command). therefore in my case the "Add" command should take the input as is and not hash it again.
Is that possible? And if not other suggestions?
Nope, that isn't possible as RedisBloom is a "black box" in that sense - it manages its own data structures.

Apache Lucene: Creating an index between strings and doing intelligent searching

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.
i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

Getting the exact edited data from SQL Server

I have two Tables:
Articles(artID, artContents, artPublishDate, artCategoryID, publisherID).
ArticleUpdated(upArtID, upArtContents, upArtEditedData, upArtPublishDate, upArtCategory, upArtOriginalArticleID, upPublisherID)
A user logging in to the application and update an article's
contents at (artContents) column. I want to know about:
Which Changes the user made to the article's contents?
I want to store both versions of the Article, Original version and Edited Version!
What should I do for doing above two task:
Any necessary changes into the tables?
The query for getting exact edited data of (artContents).
(The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit)
Note: I am using ASP.NET with C# for Developing
You are not going to be able to do the exact editing using SQL. You need an algorithm such as the Unix diff on files (which works on the line level). At the character level, the algorithm would be some variation of Levenshtein distance. If diff meets your needs, you could download it, write a stored-procedure to call it, and then use it in the database. This would be rather expensive.
The part of your question of maintaining the different versions is much easier. I would add two colmnns EffDate and EndDate onto each record. You can get the most recent version by looking for EndDate is NULL and find the version active at any given time. Merge is generally useful for maintaining such a table.
Basically this type for requirement needs custom logging.
The example what you have provided i.e. "The exact edited data means, that there may 5000 characters in the coloumns, the user may edit 200 characters in the middle or somewhere else in column's characters, I want exact those edited characters, before of edit and after of edit"
Can have a case that user updates particular words from different place from the text.
You can use http://nlog-project.org/ for logging, its a fast and robust tool that normally we use for doing .net logging.
Also you can take a look
http://www.codeproject.com/Articles/38756/Two-Simple-Approaches-to-WinForms-Dirty-Tracking
Asp.net Event for change tracking of entities
What would be the best way to implement change tracking on an object
Above urls will clear some air, on how to do it.
You would obviously need to track down and store every change.

(start, end) vs. (start, length) in API design

I've seen two alternative conventions used when specifying a range of indexes, e.g.
subString(int startIndex, int length);
vs.
subString(int startIndex, int endIndex);
They are obviously equivalent in terms of what you can do with them, the only difference being whether you specify the ending index or the length of the range.
I'm assuming that in all cases startIndex would be inclusive, and endIndex exclusive.
Are there any compelling reasons to prefer one over the other when defining an API?
I'd prefer the length one simply because it gives me one less question to ask/look up in the documentation.
For the endIndex based one - is that an inclusive or exclusive end point?
(For either variant, the same question could be asked about startIndex, but it would be a perverse API that makes it exclusive).
How to disambiguate positional arguments...
use longer names subStringFromUpto( startIndex , stopIndex )
use uniform convention across the whole library
Didn't we find better after all these years ?
Ah yes, in Smalltalk maybe, since the question is tagged language-agnostic...
aString copyFrom: startIndex to: stopIndex.
aString substringOfLength: length startingAt: startIndex.
Less ambiguity, but maybe we'll have to wait another 30 years before larger adoption of such style
(it probably looks too much simple to be serious)
This is a good question and I think the preference for which to use comes down to what are the most common use cases. Most use cases are equally simple using either API, but consider this one:
You want to get a substring that starts at 5 and ends at the end of the string. Using the index based version (assuming it's second index is exclusive), it's as simple as:
str.subString(5, str.length());
With the length based API:
str.subString(5, str.length() - 5);
That second approach is much less succinct and obvious. However, this can be solved by simply stating that if the length will cause an overflow of the remaining string, it will gracefully support that (e.g. str.subString(5, str.length()); would grab everything from index 5 to the end even though it may be asking for more characters than are left). Ruby does this with their String#splice method in addition to supporting advanced things like negative indices.
In my opinion, the index based approach is more concrete, especially when negative indices aren't allowed. This makes it very obvious what to expect from the API, which can be a good thing; making it harder to shoot yourself in the foot. However, a well documented API, like Ruby, makes it easy to empower the programmer and can make for some graceful substring-ing.
I also find that in general, when I'm performing substring operations, that I often know my beginning and end points. With the length based approach, that's going to require an additional calculation when calling the API (e.g. substring(startIndex, endIndex - startIndex)).
Someone should do a study of typical call sites to find out which approach yields more succinct code (and therefore probably correct code).
I like the argument that using 'length' you don't have to look at the documentation, but you may already be looking at the documentation to determine whether the 2nd integer is the 'end' or the 'length'. If you name it endExclusive, then it's just as self-documenting.

Should we put units of measurements in attribute names? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I think most of us agree that it's a good idea to use a descriptive name for variables, object attributes, and database columns. If you want to store something's name, you may as well call the attribute Name so people know what to put in it.
Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
My database administrator, however, just told me that including units of measurement in database column names is “frowned upon”. I think that's just nuts, but perhaps there's some risk DBAs know about that I don't.
Throw me a line, here: should we embed units of measurement in our attribute names? Why? Why not?
If you have a consistent UOM for things, then your DBA's policy is OK.
For example, if timespans are ALWAYS in minutes, etc.
If the UOM could change, then you should store it in another column, alongside the qty.
That said, I tend to side with you on this. Clarity trumps most things, including this. I'd rather see DurationMinutes than Duration and have to guess what the UOM is.
Yes. You should.
The key, as #[Charles Bretana] pointed out, is legibility and that the other users of your table or developers following you know what you're using.
I would absolutely involve the units/measurement in a field name - in my business you can't guess what you'll find from the context or name: a field entitled MarketValue - is that in millions, thousands or units? US Dollars, Euros, pounds, $CURRENCY? Is that value a percentage, a ratio? Absolute or relative? Daily, monthly, calendar year, financial year? That timestamp, what time zone is it?
Your first, last and only task when providing data is to ensure that it isn't used incorrectly because the consumer wasn't able to find out enough about it. As developers, throwing "Metre", "USD", "GMT", "Percent" or whatever into a field name isn't the least bit smelly.
There are enormous smells that need resolving before the tiny whiff of field naming needs standardising.
This is why the Mars Climate Orbiter crashed into the surface at 350 meters/sec when it was planned to only handle 350 ft/sec (or something like that).
Although "Never say 'Never' or 'Always'" is, in general, a good rule of thumb, here I will bend my rule and say I think you should "always" make it clear what units a numeric value is in.
The convention of naming all my columns in the format:
{name}_in_{unit}
helped for one project, since I was using si units it actually ended up allowing me to be able to infer the column data type and generally simplify my writing style.
length_in_m
speed_in_ms-1
color_in_nm
there were a few exceptions that I handled either with at_time or number_of:
started_at_time
updated_at_time
number_of_rotations
I think this is a good idea anywhere since there is always room for ambiguity.
For example, the with high performance timer class we use, I keep having to check if the GetElapsed() method returns seconds or milliseconds or something else. If it were called GetElapsedMilliseconds() that would save the confusion.
The only downside being if you wanted to change your mind ... but in that case any clients would need to know about the change anyway.
F# has an interesting twist on this allowing measurement units to be specified in the type system. See this blog post, and another stackoverflow question discussing Are units of measurement unique to F#?
I've done a lot of database work, and I would not frown upon that at all, nor have I heard of frowning on it.
It's better than the extended properties, which is not apparent to the casual developer. It's better than in a separate document, because many developers won't read them, and certainly not in great detail. If the units are set, then having it in the name sounds like a good idea. If that changes, then when the unit field is added, change the name of the measurement field.
Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
You could go even a step further (in your code, not in the database) and have a Length type, which takes care of the measurement unit and of possible conversions. This is the approach of the "Quantity" pattern in Martin Fowler's "Analysis Patterns" book.
Do not put units of measurement (or column type) in your database column names.
Many Databases have the ability to document/comment columns in some way (in SQL Server it is sp_addextendedproperty), I would suggest that is a more appropriate place.
For Python datetimes, consider using objects from the datetime package. Doing so will capture the unit implicity to microsecond resolution. There is then no basis for including the unit in the variable name.
If you must use an int or float instead, it is strongly recommend to suffix the unit name abbreviation to the variable name. For example, instead of the variable name diff, use diff_secs for seconds, diff_ms for milliseconds, diff_µs for microseconds, or diff_ns for nanoseconds.
We don't put units of measurement in column names in our database. We do, however, have a data dictionary document where all of the columns and relationships are described.
The ideal approach is, if possible, to use a type that leaves no ambiguity as to the measurement. For example in .NET rather than saying int periodInSeconds you'd be much better off using TimeSpan period.
The F# language actually has units of measurement as part of the type system so you can declare types in units such as 10<m/s> and 5<s> and even perform calculations on them so something like 10<m/s> * 5<s> would result in 50<m>. See here for more info.
So I'd say if possible use a type that conveys your intention, but if that isn't possible then you should probably encode the measurement into the name. It's better and more obvious than a comment.
You definitely want units of measurement somewhere. I don't know if the column names are a good place or if the schema is better. Ask your database administrator
Where is the information about units of measure stored?
How can I get access to the units programmatically?
If the answers are "it isn't" or "you can't", complain bitterly---they have no right to deny you your naming convention. Otherwise, all may be happier if you work within the system.
P.S. I really like the support for units of measure that they've put into F#.
I have to say, I hate "descriptive" variable names becoming "incredibly verbose" variable names.
My preferred alternative is to use nothing but the unit-of-measure names in short functions. Eg.
function velocity(m, s) {
return m/s;
}
You don't need to say "length_m" because in this context, it's obvious that only lengths are measurable in metres.
Having said that. If I was writing a system where units of measure errors were really dangerous, I'd probably use the type system and define a Length class which always converted itself into a standard unit for any calculation. Maybe even different sub-classes for Feet, Metres etc.
NO, the name of the attribute is seperate from its unit of measurement.
If you call a variable length_mm then you are tied to mm.
what if you use a 32bit int to store length_mm, eventually the length in mm may get larger then 62,000, or whatever the limit is on 32bit ints. You cant switch over to m cause you tied you length variable to length_mm.
I think putting units in your identifiers is a huge design smell. It almost surely means that you chose the wrong language: if units are so important to the project, you'd better be using a language whose type system is capable of representing them.