Clamav logical signature generation - malware

I am trying to generate clamav signatures for a malware dataset that I have.
Initially I have recognized some strings which are prominent in a class of malware, hence, those are considered and a ldb signature is generated using the below method.
The name of the signature, Engine version, Target as 0. We further have 'x' number of sub-signatures here x is 100 each with logical or. All the strings are converted to hex representation. Below is the example which is generated.
ramnit.Signature;Engine:0-500,Target:0;0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99;636f6e6e6;686b65795;363530393;52656c656;633a5c5c7;436f6e766;313937313;6c6f63616;576169744;363337363;686b65795;353238363;736c65657;633a5c5c7;636f6e6e6;686b65795;633a5c5c7;737663686;363030363;633a5c5c7;313935353;633a5c5c7;636f6e6e6;6765746d6;536574437;313933393;686b65795;633a5c5c7;323232363;353537363;686b65795;686b65795;686b65795;686b65795;686b65795;686b65795;686b65795;686b65795;353130363;64656c657;633a5c5c7;633a5c5c7;686b65795;53656e644;6b7975666;6c6f63616;494d41474;686b65795;686b65795;686b65795;696573716;737663686;313237303;363033353;363039383;686b65795;686b65795;633a5c5c7;686b65795;333139313;686b65795;437265617;686b65795;476574546;353631323;633a5c5c7;686b65795;496e74657;686b65795;686b65795;686b65795;686b65795;3f7365745;633a5c5c7;476574537;527063426;686b65795;686b65795;566572517;353630353;686b65795;4f70656e5;353138343;4c6f6f6b7;633a5c5c7;476574546;363139393;633a5c5c7;686b65795;353638333;676574707;6f6c65333;5065656b4;343230353;536574576;5c5c3f3f5;5265674f7;633a5c5c7;686b65795;686b65795
Now, the problem is in case there are <=65 sub-signatures then everything works fine however, if they increase beyond that, it results in the following error.
LibClamAV Error: cli_loadldb: The number of subsignatures (== 65) doesn't match the IDs in the logical expression (== 100)
LibClamAV Error: Problem parsing database at line 1
LibClamAV Error: Can't load ramnit.ldb: Malformed database
ERROR: Malformed database
Is it that the ldb signatures are limited to only 65 conditions? If not what causes this issue and how to solve it?

Why on earth do you have so many repeated subsignatures?! Once you drop all of those, you will basically cut your rule in half. You could even think about making multiple versions of this rule and break it down to 10-20 subsigs for each.

Related

ftell/fseek fail when near end of file

Reading a text file (which happens to be a PDS Member FB 80)
hFile = fopen(filename,"r");
and have reached up to the point in the file where there is only an empty line left.
FilePos = ftell(hFile);
Then read the last line, which only contains a '\n' character.
fseek(hFile, FilePos, SEEK_SET);
fails with:-
errno=(27) EDC5027I The position specified to fseek() was invalid.
The position specified to fseek() was returned by ftell() a few lines earlier. It has the value 841 in the specific error case I have seen. Checking through the debugger, this is also the value returned by ftell a few lines earlier. It has not been corrupted.
The same code works at other positions in the file, and only fails at the point where there is a single empty line left to read when the position is remembered.
My understanding of how ftell/fseek should work is succinctly captured by another answer on SO.
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
It would seem that I cannot rely on the one thing I should be able to rely on.
My questions is, why does fseek fail in this way?
As z/OS has some file formats that are unique you might find the answer in this Knowledge Center article.
Given that you are processing a PDS member I would suspect that this is record level I/O which is handled differently than stream I/O which is more common in distributed implementations.
I do not know why fseek fails in this way, but if your common usage pattern is to use ftell to get the position and then fseek to go to that position, I strongly suggest using fgetpos and fsetpos instead for data set I/O. Not only will you avoid this problem that you are finding, but it is also better performing for certain data set characteristics.

What is SBLineEntry.GetLine()?

SBLineEntry is a proxy object in LLDB Python interface. SBLineEntry.GetColumn() returns point in a line, but I am not sure what it actually means.
In C++ side source, it resolves to LineEntry.column value, but it also lacks how it is measured in.
At first, I thought it as UTF-8 code unit offset. But it seems it isn't because when I measure it it looks like UTF-16 code unit offset. But I still couldn't find any definition for this value.
What is this value?
Raw byte offset in source code file?
UTF-8 code unit offset?
UTF-16 code unit offset?
Something else?
That's a good question! If the debug information is DWARF (except for Windows systems, it is), lldb is providing the DNS_LNS_set_column data from the DWARF line table as the number returned by SBLineEntry::GetColumn(). The DWARF5 specification doesn't say what this integer is counting -- it says only,
The DW_LNS_set_column opcode takes a single unsigned LEB128 operand and stores it in the column register of the state machine.
You're probably seeing that clang puts the UTF-16 code unit offset in the DWARF, but the standard doesn't require that. This would be a reasonable clarification request to file with the DWARF standards committee, http://dwarfstd.org
For the case of Rust programs, I think it's Unicode Scalar value offset.
Here's an open issue about column number. It says span_start function produces the column number.
span_start calls lookup_char_pos.
lookup_char_pos calls bytepos_to_file_charpos.
bytepos_to_file_charpos
They are repeating the word "char", and in Rust, "char" means Unicode Scalar Value.

SonarLint - questions about some of the rules for VB.NET

The large majority of SonarLint rules that I've come across in Java seemed plausible and justified. However, ever since I've started using SonarLint for VB.NET, I've come across several rules that left me questioning their usefulness or even whether or not they are working correctly.
I'd like to know if this is simply a problem of me using some VB.NET constructs in a suboptimal way or whether the rule really is flawed.
(Apologies if this question is a little longer. I didn't know if I should create a separate question for each individual rule.)
The following rules I found to leave some cases unconsidered that would actually turn up as false-positives:
S1871: Two branches in the same conditional structure should not have exactly the same implementation
I found this one to bring up a lot of false-positives for me, because sometimes the order in which the conditions are checked actually does matter. Take the following pseudo code as example:
If conditionA() Then
doSomething()
ElseIf conditionB() AndAlso conditionC() Then
doSomethingElse()
ElseIf conditionD() OrElse conditionE() Then
doYetAnotherThing()
'... feel free to have even more cases in between here
Else Then
doSomething() 'Non-compliant
End If
If I wanted to follow this Sonar rule and still make the code behave the same way, I'd have to add the negated version of each ElseIf-condition to the first If-condition.
Another example would be the following switch:
Select Case i
Case 0 To 40
value = 0
Case 41 To 60
value = 1
Case 61 To 80
value = 3
Case 81 To 100
value = 5
Case Else
value = 0 'Non-compliant
There shouldn't be anything wrong with having that last case in a switch. True, I could have initialized value beforehand to 0 and ignored that last case, but then I'd have one more assignment operation than necessary. And the Java ruleset has conditioned me to always put a default case in every switch.
S1764: Identical expressions should not be used on both sides of a binary operator
This rule does not seem to take into account that some functions may return different values every time you call them, for instance collections where accessing an element removes it from the collection:
stack.Push(stack.Pop() / stack.Pop()) 'Non-compliant
I understand if this is too much of an edge case to make special exceptions for it, though.
The following rules I am not actually sure about:
S3385: "Exit" statements should not be used
While I agree that Return is more readable than Exit Sub, is it really bad to use a single Exit For to break out of a For or a For Each loop? The SonarLint rule for Java permits the use of a single break; in a loop before flagging it as an issue. Is there a reason why the default in VB.NET is more strict in that regard? Or is the rule built on the assumption that you can solve nearly all your loop problems with LINQ extension methods and lambdas?
S2374: Signed types should be preferred to unsigned ones
This rule basically states that unsigned types should not be used at all because they "have different arithmetic operators than signed ones - operators that few developers understand". In my code I am only using UInteger for ID values (because I don't need negative values and a Long would be a waste of memory in my case). They are stored in List(Of UInteger) and only ever compared to other UIntegers. Is this rule even relevant to my case (are comparisons part of these "arithmetic operators" mentioned by the rule) and what exactly would be the pitfall? And if not, wouldn't it be better to make that rule apply to arithmetic operations involving unsigned types, rather than their declaration?
S2355: Array literals should be used instead of array creation expressions
Maybe I don't know VB.NET well enough, but how exactly would I satisfy this rule in the following case where I want to create a fixed-size array where the initialization length is only known at runtime? Is this a false-positive?
Dim myObjects As Object() = New Object(someOtherList.Count - 3) {} 'Non-compliant
Sure, I could probably just use a List(Of Object). But I am curious anyway.
Thanks for raising these points. Note that not all rules apply every time. There are cases when we need to balance between false positives/false negatives/real cases. For example with identical expressions on both sides of an operator rule. Is it a bug to have the same operands? No it's not. If it was, then the compiler would report it. Is it a bad smell, is it usually a mistake? Yes in many cases. See this for example in Roslyn. Should we tune this rule to exclude some cases? Yes we should, there's nothing wrong with 2 << 2. So there's a lot of balancing that needs to happen, and we try to settle for an implementation that brings the most value for the users.
For the points you raised:
Two branches in the same conditional structure should not have exactly the same implementation
This rule generally states that having two blocks of code match exactly is a bad sign. Copy-pasted code should be avoided for many reasons, for example if you need to fix the code in one place, you'll need to fix it in the other too. You're right that adding negated conditions would be a mess, but if you extract each condition into its own method (and call the negated methods inside them) with proper names, then it would probably improves the readability of your code.
For the Select Case, again, copy pasted code is always a bad sign. In this case you could do this:
Select Case i
...
Case 0 To 40
Case Else
value = 0 ' Compliant
End Select
Or simply remove the 0-40 case.
Identical expressions should not be used on both sides of a binary operator
I think this is a corner case. See the first paragraph of the answer.
"Exit" statements should not be used
It's almost always true that by choosing another type of loop, or changing the stop condition, you can get away without using any "Exit" statements. It's good practice to have a single exit point from loops.
Signed types should be preferred to unsigned ones
This is a legacy rule from SonarQube VB.NET, and I agree with you that it shouldn't be enabled by default in SonarLint. I created the following ticket in our JIRA: https://jira.sonarsource.com/browse/SLVS-1074
Array literals should be used instead of array creation expressions
Yes, it seems to be a false positive, we shouldn't report on array creations when the size is explicitly specified. https://jira.sonarsource.com/browse/SLVS-1075

Terminal Symbol vs Token in Lex or Flex

I am studying YACC and the concept of a terminal symbol vs a token keeps coming up. Could someone explain to me what the difference is or point me to an article or tutorial that might help?
They are really two names for the same thing, but usually "terminal" is used to describe what the parser is working with, while "token" is used to describe the corresponding sequence of symbols in the source.
In a parser generator like yacc, the grammar of the language is defined in terms of an "alphabet" of "terminals". The word "alphabet" is a little confusing because they are strings, not letters. But from the parser's perspective, every terminal is an indivisible unit indistinguishable from any other use of the same kind of terminal. So the source code:
total = 17 + subtotal;
will be presented to the parser as something like:
ID EQUALS NUMBER PLUS ID SEMICOLON
There is a correspondence between the stream of terminals which the parser sees and substrings of the input language. So we say that the "token" total is an instance of the "terminal" ID. There may be an unlimited number of potential tokens corresponding to a given terminal (or they may be just one, as with the terminal EQUALS) but what the parser actually works with is a smallish finite set of terminals.

FORTRAN77 How to throw error for the following: division symbol, timeout, very big floating value:

1 for certain symbols like (/), (,) and (;) while taking input.
2. Timeout error while taking input
3. very big floating value as input
4. and for improper inputs like - 4/3
I found out how to time-out the program after a specific time:
https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gfortran.pdf (find: alarm)
If I interpret your question correctly, you want to check user input for correct values and interpret lists and fractions. I'm assuming you mean from console, a la
read(*,*) character-variable
The other option is to use formatted input, for example
read(*,'(i4)') integer-variable
which would read an integer with four digits.
This method would possibly already remove some of your problems, because the user input has to match the specified format or the program reports a runtime error. It is possible to specify the number of input values as well (separated by whitespace, ',' or ';'). Hence if you know beforehand how many values you are getting, the user can enter lists. If you make the requirements clearer, it will be easier to help. Fortran is a bit finicky for I/O.
If you really need the input to be of a general not-defined-at-compile-time type, you'll have to parse the string. This is also true if you want the user to be able to enter fractions like '4/3'.
I'm not aware of a method to restrict the time which a user has to enter values. It may be possible, but I've never seen it.
For too big or improper values you just can, for example, wrap the read statement in an endless do loop and exit if the number(s) is/are correct
do
read(*,'(i6)') x
if ( (x.lt.1e5).and.(x.ge.0) ) exit
end do
This would request an integer x from the user until the input is smaller than 100 000 and at least 0.
edit after discussion in comments:
The following code may be what you want:
implicit none
integer :: x
character(len=10) :: y
y=''
print*,'Enter one integer:'
do
read(*,'(i10,a)') x,y
if( (y.eq.'').and.(x.lt.1e5) ) exit
print*,'Enter one valid integer, smaller than 100 000, only:'
end do
print*,x
end
It just reads until there is exactly one integer smaller than 100 000 in the input. If you want a better user experience you can catch 'very' invalid input (that the program complains about and stops) with the iostat parameter.
One thing though: on my two available compilers (GCC 4.4.7 and Intel fortran compiler 11.0) the forward slash '/' is not a valid integer input and the program stops. If that is different for your compiler the code above should still work, but I can't test that.