Conversation
NEW: A) The First and Follow methods in NonTerminals. B) TokenSet.cs C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location) Notes: The newFromUnion() method in TokenSet.cs has two implementations: One, based on the C compiler, is the one currently in use. This one I believe to be more efficient. The other has been commented out. This one is more readable in my opinion. These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.
NonTerminals.cs
Outdated
|
|
||
| uint Count() { | ||
| uint total = 0; | ||
| foreach (String s in Enum.GetNames(typeof(Production))) { |
There was a problem hiding this comment.
In C# we can use the Length() method to get the number of items in an enumeration:
Enum.GetNames(typeof(Production)).Length;
| p += alternateSetOffset; | ||
| } /* end if */ | ||
|
|
||
| return tokenset; |
There was a problem hiding this comment.
This always returns null.
We need initialisers for all the FIRST sets, ideally in an array where the index matches the value of p.
see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-48
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-231
once we have the initialisers, we can get the set from the array
index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = firstSet[index];
...| } /* end if */ | ||
|
|
||
| return tokenset; | ||
| } /* end FOLLOW */ |
There was a problem hiding this comment.
This too, always returns null.
We need initialisers for all the FOLLOW sets, in the same manner as for the FIRST sets.
see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-139
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-325
and accordingly ...
index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = followSet[index];
...|
|
||
| TokenSet FIRST(Production p) { | ||
| TokenSet tokenset = null; | ||
|
|
There was a problem hiding this comment.
we should add uint index = 0;
NonTerminals.cs
Outdated
|
|
||
| if (IsConstParamDependent(p)) | ||
| { | ||
| p += alternateSetOffset; |
There was a problem hiding this comment.
this should be index = Convert.ToUInt32(p) + alternateSetOffset;
There was a problem hiding this comment.
Is there a particular reason you like Convert.ToUInt32(p) over (uint)p?
Should run a find & replace and change all of the code to one or the other?
NonTerminals.cs
Outdated
| } /* end if */ | ||
| if (IsVariantRecordDependent(p) && CompilerOptions.VariantRecords()) | ||
| { | ||
| p += alternateSetOffset; |
|
|
||
| TokenSet FOLLOW(Production p) { | ||
| TokenSet tokenset = null; | ||
|
|
NonTerminals.cs
Outdated
| TokenSet tokenset = null; | ||
|
|
||
| if (IsConstParamDependent(p)) { | ||
| p += alternateSetOffset; |
NonTerminals.cs
Outdated
| p += alternateSetOffset; | ||
| } /* end if */ | ||
| if (IsVariantRecordDependent(p) && !CompilerOptions.VariantRecords()) { | ||
| p += alternateSetOffset; |
| lastOptionDependent = lastNoVariantRecDependent; | ||
| public static int alternateSetOffset = (int)lastOptionDependent - (int)firstOptionDependent + 1; | ||
|
|
||
| /* -------------------------------------------------------------------------- |
There was a problem hiding this comment.
We should have a private array with the initialisers for all FIRST and FOLLOW sets here. Ideally all initialisation will take place at compile time. The FIRST() and FOLLOW() functions should then compute the index of the appropriate set in the private array, fetch it from the array and return it.
There was a problem hiding this comment.
I'm not sure where to pull the values for the first and follow sets from. It's possible I'm just completely missing something obvious. Can you point me where I need to be?
There was a problem hiding this comment.
You need to write standalone programs that generate the code. The C versions are
https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_first_sets.c
https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_follow_sets.c
The C versions of the generated code are
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-first-set-inits.h
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-follow-set-inits.h
In C we pull them in via #include directives, but in C# there is no such thing. We could use templates but for now let's simply copy-paste the generated code in.
| return null; | ||
| } /* end if */ | ||
| return token.ToString(); | ||
| } /* end LexemeForResword */ |
There was a problem hiding this comment.
Although this matters less for reserved words, for we use interned strings for all lexemes within the compiler. For consistency we should also apply this to reserved words.
The reason for interning strings is that at any point within the compiler we can compare two strings with a single identity test instead of having to string-compare them.
For example, the identifiers of modules and functions in Modula-2 are repeated at the end of their declarations:
PROCEDURE Foobar ( baz : Bam );
...
END Foobar;When checking that the END part matches the procedure header, we can simply test for identity:
ident1 = Lexer.CurrentLexeme();
...
ident2 = Lexer.CurrentLexeme();
...
if (ident1 == ident2) {
/* identifiers match */
}If we use the To.String() method, the strings won't be interned.
In the C version, I defined private arrays that contain the strings, then calculate the index for the array, fetch the string and return it. Even though C# has methods for interning strings, it is preferable to use the same approach as in the C version. Ideally, the string arrays should be initialised at compile time.
There was a problem hiding this comment.
I created the array lexemeTable in TokenSet.cs to do the same job as your string arrays. I simply referenced lexemeTable in my fix, but because it's static I could easily move lexemeTable into Terminals.cs and then reference it as Terminals.lexemeTable in TokenSet if you would rather.
trijezdci
left a comment
There was a problem hiding this comment.
Good work thus far, but we need initialisers for the sets. For details see my comments within the code.
TokenSet.cs
Outdated
|
|
||
| /* out-of-range guard */ | ||
|
|
||
| "\0" |
There was a problem hiding this comment.
this guard is only needed in C, not in C#.
| public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
|
|
||
| public struct opaque | ||
| { |
There was a problem hiding this comment.
the style we use throughout the project is ...
foobar {
...
} /* end foobar */
TokenSet.cs
Outdated
|
|
||
| public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
|
|
||
| public struct opaque |
There was a problem hiding this comment.
We should use a more descriptive name here, possibly also change the struct to something else.
An opaque type is a type whose name is defined in a public interface and whose implementation is hidden. Instances of the type are opaque pointers to hidden data, that is, the pointer is exposed to clients but the structure that it points to is unknown to them.
The C syntax to declare opaque types is not self explanatory. It isn't immediately obvious what the type declaration means. For this reason we use a naming convention in C that makes it explicit in the name that the type is an opaque type. Thus a type we might otherwise have called tokenset_t is then called tokenset_opaque_t instead.
Both the structure and the naming needs to be appropriately adjusted in the C# version. In C# the primary means to hide data is the class. The struct is simply a completely hidden type, it is not exposed to the client. The instance of the hidden type is an instance variable.
The advantage of keeping the struct is that it can be mapped to the initialisation data which we want to be struct based so we can initialise it using a structured literal like { 0x0, 0x0, 0x3, 2 }. Otherwise we could simply declare separate instance vars, one for each segment, one for the count.
private struct TokenSetBits {
uint bits64to95;
uint bits32to63;
uint bits0to31;
uint count;
} /* end TokenSetBits */alternatively
private struct TokenSetBits {
uint[SegmentCount] segments;
uint count;
} /* end TokenSetBits */which allows us to address the segments by index (desirable), but then SegmentCount has to be a compile time constant and the value needs to be calculated at compile time (absolute must), not at runtime.
Terminals.cs
Outdated
| } /* end switch */ | ||
| break; | ||
|
|
||
| case /* length 6 */ 6 : |
There was a problem hiding this comment.
I had added reserved word OPAQUE here. You need to copy the section for all the reswords of length 6.
imp/NonTerminals.cs
Outdated
|
|
||
| TokenSet[] followSet = new TokenSet[Count()]; | ||
| TokenSet[] firstSet = new TokenSet[Count()]; | ||
| public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; |
There was a problem hiding this comment.
The constants firstConstParamDependent, lastConstParamDependent, firstNoVariantRecDependent, lastNoVariantRecDependent, firstOptionDependent, lastOptionDependent and alternateSetOffset are only used by the methods of this class, we don't need to expose them to clients. They ought to be private.
imp/NonTerminals.cs
Outdated
| TokenSet[] followSet = new TokenSet[Count()]; | ||
| TokenSet[] firstSet = new TokenSet[Count()]; | ||
| public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
|
There was a problem hiding this comment.
Please set your editor to expand tab into two spaces [ASCII(32)], do not let it insert any TABs [ASCII(9)].
Lines should be broken as follows
public Foobar foobar =
foo = Foobar.Foo,
bar = Foobar.Bar,
baz = Foobar.Baz;
etc
imp/NonTerminals.cs
Outdated
| public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
|
||
| #region followSets | ||
| TokenSet[] followSets = { |
There was a problem hiding this comment.
FIRST sets should come first, FOLLOW sets last.
imp/NonTerminals.cs
Outdated
| new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
| }; | ||
| #endregion | ||
|
|
There was a problem hiding this comment.
Please remove all the //SAM CHANGE comments. They have no documentary value for users and maintainers. For any temporary comments during review, we can use Github annotation features.
imp/TokenSet.cs
Outdated
| } /* end TokenSet */ | ||
|
|
||
| /* --------------------------------------------------------------------------- | ||
| * private constructor TokenSet (uint[]) |
There was a problem hiding this comment.
We already have constructor methods newFromList() and newFromUnion(). As you can see in line 170, I have made the default constructor TokenSet() private in order to prevent clients of the class from ever being able to use any other constructors than newFromList() and newFromUnion().
Your comment says "private" but your code declares it public. But even as a private method it makes no sense because the two constructors newFromList() and newFromUnion() use the private default constructor in lines 201 and 248.
| dataStored.segments[2] = TSB[2]; | ||
| dataStored.elemCount = TSB[3]; | ||
| } | ||
|
|
There was a problem hiding this comment.
line 205 says /* allocate new set */ but that is not true because newSet is already allocated in line 201.
| dataStored.segments[2] = TSB[2]; | ||
| dataStored.elemCount = TSB[3]; | ||
| } | ||
|
|
There was a problem hiding this comment.
line 206 allocates a new array of segments, but the array is already allocated in line 161.
There was a problem hiding this comment.
I do not like the name dataStored. It is entirely non-descriptive.
If you had no idea what this code is about and all you saw was a single line of code that references the struct like
newSet.dataStored.segments[segmentIndex] = 0;
does the "dataStored" part in there add any information or does it add confusion?
Either use a descriptive name, like bits ...
newSet.bits.segments[segmentIndex] = 0;
newSet.bits.elemCount = 0;
... or get rid of the struct and use just segments and elemCount ...
newSet.segments[segmentIndex] = 0;
newSet.elemCount = 0;
The latter is generally preferable UNLESS we need to assign a compound literal directly to the struct, like so
newSet.bits = { 0x0, 0x0, 0x11, 2 };
I have not seen your code representing gen_first_sets.c and gen_follow_sets.c. It depends on how you implement those whether you want segments and elemCount to be within a struct or separate instance variables.
trijezdci
left a comment
There was a problem hiding this comment.
It is not clear to me how you generated the FIRST and FOLLOW set literals. Did you copy-paste them from the C code? If so, that is not what we want. The project needs to be self contained, thus it needs to have its own standalone programs to generate the FIRST and FOLLOW set literals.
|
|
||
| public static const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
| public const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1; | ||
|
|
There was a problem hiding this comment.
Constants and variables within an implementation of a class should always be private. If a client of the class needs to read the constant or variable, then there should be a public function method to return its value. If a client of the class needs to write to a variable, then there should be a public mutator method to overwrite its value. Such public methods will then need to be added to the public interface of the class.
| * Expects TSB[segmentCount] to be the number of tokens, and returns null if | ||
| * the count does not match. | ||
| * ------------------------------------------------------------------------ */ | ||
|
|
There was a problem hiding this comment.
/* --------------------------------------------------------------------------
* constructor newFromRawData( bits95to64, bits63to32, bits31to0, counter )
* --------------------------------------------------------------------------
* Returns a newly allocated tokenset object from raw data passed in as
* three data segments of 32 bits from highest to lowest, followed by a bit
* counter. Returns null if the input data is inconsistent.
* ----------------------------------------------------------------------- */There was a problem hiding this comment.
from highest to lowest
do you want me to change my code to run through the segments in reverse, or do you just want the bit values to be printed out in a different order?
| } | ||
| TokenSet newSet = new TokenSet(); | ||
| uint counter = 0; | ||
|
|
There was a problem hiding this comment.
First we need to test if the number of arguments is correct, return null if it is not.
There was a problem hiding this comment.
Then, we check the last argument against the largest number of bits that can be represented by segmentCount*32 bits. If the counter is greater, then we can already tell it is wrong, forego any further expensive checks and return null.
| uint counter = 0; | ||
|
|
||
| for (int i = 0; i < segmentCount; i++) { | ||
|
|
There was a problem hiding this comment.
Next, we iterate only over the first count-1 arguments, because the last argument is the counter.
There was a problem hiding this comment.
Added a check;
if(args.Length != segmentCount + 1) {
return null;
} /* end if */
so segmentCount should always be the same as args.Length - 1.
Regardless, changed the code to:
for (int segmentIndex = 0; segmentIndex < args.Length - 1; segmentIndex++) {
| uint counter = 0; | ||
|
|
||
| for (int i = 0; i < segmentCount; i++) { | ||
|
|
There was a problem hiding this comment.
Next, we only iterate over the first count-1 arguments because the last argument is the counter.
There was a problem hiding this comment.
Further, we should use self explanatory names. The loop variant is an index to address segments. A self explanatory name would be segmentIndex.
imp/TokenSet.cs
Outdated
|
|
||
| for (int i = 0; i < segmentCount; i++) { | ||
|
|
||
| for (int j = 0; j < 32; j++) { |
There was a problem hiding this comment.
This loop variant is an index to address individual bits, a self explanatory name would be bitIndex.
imp/TokenSet.cs
Outdated
| * ------------------------------------------------------------------------ */ | ||
|
|
||
| public TokenSet(params uint[] TSB) | ||
| public static TokenSet newFromRawData(params uint[] TSB) |
There was a problem hiding this comment.
TSB is a non-descriptive name, also it goes against C# naming conventions to have a parameter in all-uppercase. Since the variadic argument list includes both the bit patterns and a bit counter, there isn't really any single name that will be specific, so we have to settle for a generic name. I generally use args in such cases.
imp/NonTerminals.cs
Outdated
|
|
||
| new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
| TokenSet.newFromRawData( /* bits: */ 0xb0000008, 0x00000484, 0x00000140, /* counter: */ 9 ), /* typeDeclarationTail */ | ||
| }; |
There was a problem hiding this comment.
All the bit pattern values have doubled relative to the C version, indicating that they were left-shifted by one. A probably cause might be an off-by-one index to address the bits. That is to say, any bit stored at position n should have been stored at position n-1.
There was a problem hiding this comment.
Unless I'm mistaken, this is because the ARGLIST token does not exist in the C compiler. The addition of said token early on in the enumeration shifts every following token's index in the enum by one.
Am I correct?
There was a problem hiding this comment.
It appears that the values also shift left again after the index of Backslash, another token which isn't included in the C compiler.
Moving utils into master
| MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD, | ||
| REPEAT, RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, | ||
| MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD, REPEAT, | ||
| RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH, |
There was a problem hiding this comment.
please keep the original formatting -- the sources are formatted to 79 columns to ensure that there will be no vertical scrollbars even when viewed on a small screen laptop.
| public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1; | ||
|
|
||
| #region followSets | ||
| TokenSet[] followSets = { |
There was a problem hiding this comment.
FIRST sets should come first, FOLLOW sets last.
|
|
||
| namespace org.m2sf.m2sharp { | ||
|
|
||
| using System; |
There was a problem hiding this comment.
please follow the original formatting:
- strictly NO TABs, only spaces.
- two spaces indentation.
- maximum 79 characters per line
NEW:
A) The First and Follow methods in NonTerminals.
B) TokenSet.cs
C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location)
Notes:
The newFromUnion() method in TokenSet.cs has two implementations:
One, based on the C compiler, is the one currently in use. This one I believe to be more efficient.
The other has been commented out. This one is more readable in my opinion.
These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.