Lexing/tokenization delimited strings - tokenize

I'm writing a hand-written lexer for a small language but have one weird requirement that I'm not sure how to handle.
I need to be able to support the notion of delimited strings where the delimiter could be any char. eg. strings are most likely to be delimited using double quotes (eg. "hello") but it could just as easily be /hello/ or ,hello,
eg. some sample input lines might be:
x = /abc/
y = "abc" + ,def,
z = zabcz
The last case is a bit pathological, but technically possible.
I'm trying work out if there's any way I can do this in the tokenization phase in the general case? Any thoughts or suggestions would be grand.

Here are solutions in c++ and js.
c++
#include "vector"
#include "string"
#include "iostream"
using namespace std;
// Lexically Analyze method
auto lex_argument(string code){
// Define variables
size_t equal_location;
int counter = 0;
auto variable;
string variable_name;
auto variable_info[2]
string code_for_inspection;
/* In the case of a variable , these two characters will hold the beginning and end of the string */
char string_variable_characters[2];
equal_location = code.find("=",0,code.length());
variable_value = code.substr(equal_location + 2,code.length());
variable_name = code.substr(code.begin(),equal_location - 2);
variable_info[0] = variable_name;
string_variable_characters[0] = (char) variable_value.substr(0,1);
string_variable_characters[1] = (char)
variable_value.substr(variable_value.length() - 1,variable_value.length());
if(string_variable_charecters[0] = string_variable_charecters[1]){
variable_name.erase(0,1);
variable_value.erase(variable_value.length() - 1,variable_value.length());
variable_info[1] = variable_value;
}
return variable_info;
}
and in js:
function lex_argument(code){
var equalLocation = code.search("=");
var variableInfo = [null,null];
variableInfo[1] = code.substr(1,equalLocation - 2);
variableInfo[0] = code.substr(equalLocation,code.length);
string_delimeters = [variableInfo[0].substr(1,2),variableInfo[0].substr(variableInfo[0].length - 1,variableInfo[0].length];
return variableInfo;
}

Related

How to differentiate each of the character in string

I got a question in vb.net
can I identify each of the character in string
for an example
i got a string of "Hello!. Good Afternoon!"
from this string can i trim away the period symbol?
Thank you
You should look at the methods of the String class, as they support different forms of string manipulation.
At its simplest, the Replace() method can be used to replace all occurrences of a period character with an empty string.
Alternatively, you can use the IndexOf() method to locate a specific string (e.g. the period) and the Remove() method to remove that character.
According to my 8-ball Magic, you actually want to :
Remove concecutive punctuation from a string:
With a Regex we are going to find all punctuation in the string.
The index of Match will be into an int[].
We will go iterate throught the array to find if the index is concecutive to the last punctuation index.
We will delete all the punctuation starting by the last one. Because starting with the 1rst will modify the index.
Code:
string Input = "....Thalassius! vero ea--*/-*/-- tempestate+- fectus";
string Output = Input;
var regex = new Regex(#"[^\w\s]|_"); // *1.
var matches = regex.Matches(Input) ;
var MatchesIndex = matches .Cast<Match>()
.Select(match => match.Index)
.ToArray(); // *2.
int last = 0;
List<int> toDelete = new List<int>();
for (int i = 0; i < MatchesIndex.Length; i++) // *3.
{
if ( MatchesIndex[i] == last + 1)
toDelete.Add(MatchesIndex[i]);
last = MatchesIndex[i];
}
foreach (int i in toDelete.OrderByDescending(x => x)) // *4.
Output = Output.Remove(i, 1);
Console.WriteLine("Input : " + Input);
Console.WriteLine("Output : " + Output);
C# Snippet
You can learn more about the regex used, thanks to #John Kugelman.

conversion of subject name into x509_name format

I have subject name of CA certificate in CN=CA1, O=DEVANG.
I want to convert into X509_NAME format.
Is there any APIs to help me convert it ?
How can I compare ?
There do not seem to be any helper functions available in OpenSSL to do this. It looks like the only way to achieve what you want is by parsing the string and building up the X509_NAME_ENTRY elements one by one. You could use strsep for that, resulting in something like this code (which does not do any error checking and is error prone with regard to variations in the name format):
#define TEST_NAME "CN=CA1, O=DEVANG"
X509_NAME *x509name = X509_NAME_new();
char *x509nameString = strdup(TEST_NAME);
char *toFree = x509nameString;
char *x509nameEntryString = strsep(&x509nameString, ",");
char *x509nameEntryTypeString;
char *x509nameEntryValueString;
while (NULL != x509nameEntryString) {
x509nameEntryValueString = x509nameEntryString;
x509nameEntryTypeString = strsep(&x509nameEntryValueString, "=");
X509_NAME_add_entry_by_txt(x509name, x509nameEntryTypeString,
MBSTRING_ASC, x509nameEntryValueString, -1, -1, 0);
/* Need to skip spaces */
while ((NULL != x509nameString) &&
(' ' == *x509nameString)) {
x509nameString = &x509nameString[1];
}
x509nameEntryString = strsep(&x509nameString, ",");
}
/* See the result, just FYI */
X509_NAME_print_ex_fp(stdout, x509name, 0, XN_FLAG_ONELINE);
free(toFree);
For comparing two X509_NAME instances, the function X509_NAME_cmp() is available.
I do hope somebody has a better answer...

Determine types from a variadic function's arguments in C

I'd like a step by step explanation on how to parse the arguments of a variadic function
so that when calling va_arg(ap, TYPE); I pass the correct data TYPE of the argument being passed.
Currently I'm trying to code printf.
I am only looking for an explanation preferably with simple examples but not the solution to printf since I want to solve it myself.
Here are three examples which look like what I am looking for:
https://stackoverflow.com/a/1689228/3206885
https://stackoverflow.com/a/5551632/3206885
https://stackoverflow.com/a/1722238/3206885
I know the basics of what typedef, struct, enum and union do but can't figure out some practical application cases like the examples in the links.
What do they really mean? I can't wrap my brain around how they work.
How can I pass the data type from a union to va_arg like in the links examples? How does it match?
with a modifier like %d, %i ... or the data type of a parameter?
Here's what I've got so far:
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include "my.h"
typedef struct s_flist
{
char c;
(*f)();
} t_flist;
int my_printf(char *format, ...)
{
va_list ap;
int i;
int j;
int result;
int arg_count;
char *cur_arg = format;
char *types;
t_flist flist[] =
{
{ 's', &my_putstr },
{ 'i', &my_put_nbr },
{ 'd', &my_put_nbr }
};
i = 0;
result = 0;
types = (char*)malloc( sizeof(*format) * (my_strlen(format) / 2 + 1) );
fparser(types, format);
arg_count = my_strlen(types);
while (format[i])
{
if (format[i] == '%' && format[i + 1])
{
i++;
if (format[i] == '%')
result += my_putchar(format[i]);
else
{
j = 0;
va_start(ap, format);
while (flist[j].c)
{
if (format[i] == flist[j].c)
result += flist[i].f(va_arg(ap, flist[i].DATA_TYPE??));
j++;
}
}
}
result += my_putchar(format[i]);
i++;
}
va_end(ap);
return (result);
}
char *fparser(char *types, char *str)
{
int i;
int j;
i = 0;
j = 0;
while (str[i])
{
if (str[i] == '%' && str[i + 1] &&
str[i + 1] != '%' && str[i + 1] != ' ')
{
i++;
types[j] = str[i];
j++;
}
i++;
}
types[j] = '\0';
return (types);
}
You can't get actual type information from va_list. You can get what you're looking for from format. What it seems you're not expecting is: none of the arguments know what the actual types are, but format represents the caller's idea of what the types should be. (Perhaps a further hint: what would the actual printf do if a caller gave it format specifiers that didn't match the varargs passed in? Would it notice?)
Your code would have to parse the format string for "%" format specifiers, and use those specifiers to branch into reading the va_list with specific hardcoded types. For example, (pseudocode) if (fspec was "%s") { char* str = va_arg(ap, char*); print out str; }. Not giving more detail because you explicitly said you didn't want a complete solution.
You will never have a type as a piece of runtime data that you can pass to va_arg as a value. The second argument to va_arg must be a literal, hardcoded specification referring to a known type at compile time. (Note that va_arg is a macro that gets expanded at compile time, not a function that gets executed at runtime - you couldn't have a function taking a type as an argument.)
A couple of your links suggest keeping track of types via an enum, but this is only for the benefit of your own code being able to branch based on that information; it is still not something that can be passed to va_arg. You have to have separate pieces of code saying literally va_arg(ap, int) and va_arg(ap, char*) so there's no way to avoid a switch or a chain of ifs.
The solution you want to make, using the unions and structs, would start from something like this:
typedef union {
int i;
char *s;
} PRINTABLE_THING;
int print_integer(PRINTABLE_THING pt) {
// format and print pt.i
}
int print_string(PRINTABLE_THING pt) {
// format and print pt.s
}
The two specialized functions would work fine on their own by taking explicit int or char* params; the reason we make the union is to enable the functions to formally take the same type of parameter, so that they have the same signature, so that we can define a single type that means pointer to that kind of function:
typedef int (*print_printable_thing)(PRINTABLE_THING);
Now your code can have an array of function pointers of type print_printable_thing, or an array of structs that have print_printable_thing as one of the structs' fields:
typedef struct {
char format_char;
print_printable_thing printing_function;
} FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING;
FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING formatters[] = {
{ 'd', print_integer },
{ 's', print_string }
};
int formatter_count = sizeof(formatters) / sizeof(FORMAT_CHAR_AND_PRINTING_FUNCTION_PAIRING);
(Yes, the names are all intentionally super verbose. You'd probably want shorter ones in the real program, or even anonymous types where appropriate.)
Now you can use that array to select the correct formatter at runtime:
for (int i = 0; i < formatter_count; i++)
if (current_format_char == formatters[i].format_char)
result += formatters[i].printing_function(current_printable_thing);
But the process of getting the correct thing into current_printable_thing is still going to involve branching to get to a va_arg(ap, ...) with the correct hardcoded type. Once you've written it, you may find yourself deciding that you didn't actually need the union nor the array of structs.

Lex Yacc / Flex Bison variables

I was just wondering how any of you guys would implement multi character variables in c using Flex and Bison / Lex and Yacc ?
Any if so can you provide maybe a simple example?
I am attempting to write an interpreter for a language and I can't seem to find a good way to implement variables, so far the methods I've tried have either failed or causing the execution of any program with a lot of variables become really so (I mean it could take minutes to execute a program that just assigns 1000 variables and does nothing else)
Thanks for your time,
Francis
In a lexer provided by ADAIC for Ada the following method is used, i find it ver useful for lexing multu-character literals such as reserved words and variables. It (along with corresponding Bison grammar and some other stuff) is available at ADAIC docs
%%
[a-zA-Z](_?[a-zA-Z0-9])* return(lk_keyword(yytext));
%%
# define NUM_KEYWORDS 69
KEY_TABLE key_tab[NUM_KEYWORDS] =
{
{"ABORT", ABORT},
{"ABS", ABS},
....
....
....
};
lk_keyword(str)
char *str;
{
int min;
int max;
int guess, compare;
min = 0;
max = NUM_KEYWORDS-1;
guess = (min + max) / 2;
to_upper(str);
for (guess=(min+max)/2; min<=max; guess=(min+max)/2) {
if ((compare = strcmp(key_tab[guess].kw, str)) < 0) {
min = guess + 1;
} else if (compare > 0) {
max = guess - 1;
} else {
return key_tab[guess].kwv;
}
}
return identifier;
}

C preprocessor on Mac OSX/iPhone, usage of the '#' key?

I'm looking at some open source projects and I'm seeing the following:
NSLog(#"%s w=%f, h=%f", #size, size.width, size.height)
What exactly is the meaning of '#' right before the size symbol? Is that some kind of prefix for C strings?
To elaborate on dirkgently's answer, this looks like the implementation of a macro that takes an NSSize (or similar) argument, and prints the name of the variable (which is what the # is doing; converting the name of the variable to a string containing the name of the variable) and then its values. So in:
NSSize fooSize = NSMakeSize(2, 3);
MACRO_NAME_HERE(fooSize);
the macro would expand to:
NSLog(#"%s w=%f h=%f", "fooSize", fooSize.width, fooSize.height);
and print:
fooSize w=2.0 h=3.0
(similar to NSStringFromSize, but with the variable name)
The official name of # is the stringizing operator. It takes its argument and surrounds it in quotes to make a C string constant, escaping any embedded quotes or backslashes as necessary. It is only allowed inside the definition of a macro -- it is not allowed in regular code. For example:
// This is not legal C
const char *str = #test
// This is ok
#define STRINGIZE(x) #x
const char *str1 = STRINGIZE(test); // equivalent to str1 = "test";
const char *str2 = STRINGIZE(test2"a\""); // equivalent to str2 = "test2\"a\\\"";
A related preprocessor operator is the token-pasting operator ##. It takes two tokens and pastes them together to get one token. Like the stringizing operator, it is only allowed in macro definitions, not in regular code.
// This is not legal C
int foobar = 3;
int x = foo ## bar;
// This is ok
#define TOKENPASTE(x, y) x ## y
int foobar = 3;
int x = TOKENPASTE(foo, bar); // equivalent to x = foobar;
Is this the body of a macro definition? Then the # could be used to stringize the following identifier i.e. to print "string" (without the codes).