Ubytec Language Overview
Ubytec is a structured, statically-typed language that compiles to a low-level bytecode (and currently to x86-64 assembly). It combines high-level constructs (modules, types, functions) with a stack-based instruction set. This overview describes the syntax, grammar, semantics, opcodes, and type rules as implemented in the current codebase. Features present in the grammar or schema but not fully functional in the interpreter are marked as WIP (work in progress) or planned for future support.
Module Structure
Modules are the top-level compilation units. A source file generally defines a single module using the module
keyword. Modules can be nested (sub-modules declared inside a parent module). The module declaration includes a header with metadata and required dependencies. The syntax is:
[global] module(name: "ModuleName", version: "x.y", author: "AuthorName"[, requires: ["Dep1", "Dep2", ...]]) { ... }
The
global
modifier on a module is optional and only this modifier is allowed at module level. (If present, it may designate an entry-point or special visibility, though currently it has no effect beyond passing validation.)Header arguments inside
(...)
provide the module’s name, version, author, and an optional requires list of other module names. Therequires
field is a comma-separated list within square brackets. All header keys are required exceptrequires
.The module’s body is enclosed in
{ }
. Inside, you can declare types, functions, and even nested sub-modules. Specifically, a module may contain:- Type definitions: classes, structs, records, interfaces, enums.
- Function and action definitions (similar to free functions).
- Field and property declarations (module-level variables).
- At most one each of a global context and local context block (see below).
- Nested
module
declarations for sub-modules.
For example:
module(name: "Example", version: "0.1", author: "A.User", requires: ["LibX"]) {
global {
field t_int32 GlobalCounter 0 // global variable
func InitGlobals() { ... }
}
// Module-level function
func Main() -> t_int32 {
// Function body...
return
}
// Nested type
class Helper { ... }
}
In the above, GlobalCounter
is a module-global field initialized to 0, and Main
is the entry function. Main Function: If a module defines a function named Main
, the compiled program will treat it as the entry point. At runtime, _start
will call Main
and then exit. (Currently, the return value of Main
is not propagated to the process exit code – the program always exits with 0, which is a planned improvement.)
Global and Local Context Blocks
Ubytec provides context blocks to define groups of declarations with specific storage duration or linkage:
Global Context (
global { ... }
): Inside a module or a type, aglobal {}
block defines static members – data or functions associated with the module/type rather than instances. For a module, this is a place to declare truly global data or initialization functions. For a class or struct, the global context contains static fields, properties, or functions belonging to the class (like static members). Only oneglobal
block is allowed per module or per type. In the parser, only fields, properties, functions, and actions are allowed inside aglobal
block. Example:class Config { global { field t_char8 Delimiter ',' // static field func LoadDefaults() { ... } // static method } // ... instance members ... }
In code generation, global-context fields are placed in the data segment and properties in BSS (zero-initialized) segment, and global-context functions are output in the text segment.
Local Context (
local { ... }
): Alocal {}
block defines local (automatic) variables or inner functions that exist per instance or scope. At most one local block is allowed per module or type. In a module, alocal
block might hold state that is allocated at program startup and freed on exit (though currently it behaves similarly to a function’s local frame). In classes/structs, alocal
block can declare instance-specific helper variables or inner functions. The parser allows only variable declarations (<type> name ...
), innerfunc
, andaction
definitions inside a local block. Example:module(name:"M", version:"1.0", author:"X") { local { t_int32 tempBufferSize 1024 func helper() { ... } } // ... }
In code generation, a
local
context causes allocation on the stack at runtime for its variables: when the local context is entered, space for all declared variables is reserved on the stack (by decrementing RSP). The variables are referenced at fixed offsets from the stack frame. Anyfunc
oraction
inside the local block is treated as an inner function that can use these variables (though closures of this form are not fully implemented, they are parsed and compiled as nested functions). The local context ends by restoring the stack pointer (in current output, the stack is not explicitly restored until function exit or program end, which is an area to be refined – WIP).
Note: The
global
andlocal
modifiers can also appear before individual member declarations (e.g.global field x
orlocal func f
). In the current implementation, such modifiers are parsed and stored but have the same effect as placing the member in the corresponding context block. For example, aglobal
field in a class is effectively static. However, using explicit context blocks is recommended for clarity. The interpreter’s validation ensures you do not combineGlobal
andLocal
flags on the same member.
Type Definitions
Ubytec supports several kinds of user-defined types: classes, structs, records, interfaces, and enums. All types can be nested within modules or other types (except interfaces cannot contain nested type definitions).
Classes
A class
defines a reference type with support for inheritance-related modifiers (though inheritance of classes is not implemented yet, these modifiers are parsed for future use). Class syntax:
[modifiers] class ClassName {
[local { ... }]
[global { ... }]
members...
}
Classes can include an optional leading
local {}
orglobal {}
block (but not both in the current parser – if both appear, only the first is recognized). A local block here might define per-instance helper data (though instance fields can also just be declared as normal fields), and a global block defines static members.Class Members: Inside the class body (after any initial context block), you can declare:
- Fields (
field
declarations) – instance fields by default. - Properties (typed variables with optional get/set accessors).
- Methods (
func
functions) – instance methods. - Actions (
action
functions) – instance routines with no return value. - Nested types: classes, structs, records, interfaces, enums (classes can be nested arbitrarily).
- Interfaces (if any) – though embedding an interface inside a class is parsed, it is unusual. (Interfaces inside classes are allowed by grammar as nested types.)
The parser in
ParseClass
iterates over tokens and dispatches to the appropriate parse routine based on keywords. For example,field
introduces a field, a recognized type name introduces a property,func
a method, etc.- Fields (
Modifiers: Classes and class members support a range of modifiers:
- Access control:
public
,private
,protected
,internal
,secret
. At most one access modifier can be applied. By default (no modifier), class and members are internal to the module. - Inheritance:
abstract
,virtual
,override
,sealed
. These are parsed and validated for consistency (e.g. an abstract class cannot be marked sealed, an override method must be in an abstract class etc.), but actual base class inheritance is not yet implemented (WIP). In practice, you can declare an abstract class with abstract methods (which must have no body) and the compiler will enforce that the class cannot besealed
and that non-abstract members in an abstract class are markedvirtual
. However, since there is no mechanism to derive classes yet, these serve mainly as placeholders for future extension. - Storage:
global
(static) orlocal
can mark a class or member as described earlier. A class cannot be both global and local. A member cannot have both modifiers simultaneously. - Other:
const
andreadonly
can apply to fields (with the same meaning as in C# – aconst
is a compile-time constant, and areadonly
field can only be assigned once). A field cannot be both const and readonly. If a field isconst
, it must have an initializer (the current compiler doesn’t explicitly check this, but it will error at compile time if no value is provided). Classes themselves cannot be const or readonly (invalid).
- Access control:
Example:
public class Player { field t_int32 id // an instance field field readonly t_int32 maxScore 100 property t_char8 grade { // property with custom getter func get { // returns grade based on maxScore if maxScore >= 100 -> t_char8 { return 'A' } return 'B' } } func ResetScore() { this.maxScore = 0 // (assignment WIP – see notes) } }
In this example,
Player
is a public class. It has anid
field, a readonly fieldmaxScore
initialized to 100, and a propertygrade
with a customget
accessor (noset
orinit
, so it's read-only). TheResetScore
method attempts to resetmaxScore
– however, becausemaxScore
is readonly, this would be disallowed (the compiler would throw an error if it tried to assign it outside of a constructor or initialization, which currently would appear as a runtimeException
from validation logic).
Note on Assignment (WIP): The example above highlights that direct assignment to variables in code (e.g.,
this.maxScore = 0
) is not fully implemented. The Ubytec interpreter does parse the=
operator as an assignment, but storing new values to variables at runtime currently requires using the low-levelSTORE
opcode (which is defined but not yet functional in code). As of now, modifying a variable after declaration is limited – this is a planned feature. In practice, one would use field initializers or compute values within expressions until assignment statements are supported.
Structs
A struct
is similar to a class but represents a value type. Structs are declared with the struct
keyword and share much of the class syntax:
[modifiers] struct StructName {
[local { ... }]
[global { ... }]
members...
}
Structs can contain fields, properties, funcs, actions (but not other struct/class/record types in the current parser – any attempt to declare a nested type inside a struct is skipped or error, as nested types are not explicitly allowed in structs). They also may have at most one
local
and/or oneglobal
block (the parser will accept one of each in any order at the start of the struct body).Structs do not allow the declaration of a constructor method explicitly (there is no special constructor syntax in Ubytec yet).
Inheritance-related modifiers (
abstract
, etc.) are not applicable to structs (they would be flagged as invalid if used).The default behavior of a struct is to be allocated on the stack (when used as a local) or in-line inside other structures. However, since the runtime is still stack-machine oriented, the distinction between value type and reference type is not strongly enforced beyond not supporting inheritance. You should treat structs as intended for small plain-old-data. (Copy semantics, etc., are not explicitly implemented yet.)
Example:
struct Point { t_int32 x t_int32 y func magnitudeSquared() -> t_int32 { return x*x + y*y } }
This struct
Point
has two fields and a method. A struct can be instantiated by value (for example, as a local variable), but Ubytec has nonew
operator – a variable of struct type can be declared and its fields set directly (though again, direct assignment is limited).
Records
A record
is a special kind of class designed for immutable data containers (like a tuple or a data-transfer object). Records in Ubytec allow a positional parameter list that generates auto-properties. Syntax:
[modifiers] [type] record RecordName( Type1 name1, Type2 name2, ... ) {
[local { ... }]
[global { ... }]
// additional properties, funcs, actions (no fields allowed)
}
You can optionally prefix a record with the
type
keyword (e.g.type record R(...)
) when defining it at the module level. This is accepted by the parser as a “module-level alias” marker, but currently it has no semantic effect. It’s essentially ignored during parsing.Positional parameters: The constructor-like parameter list after the record name declares a set of positional properties. For each
(Type Name)
in the list, the record will automatically include a public property of that type with that name. These are analogous to primary constructor parameters or auto-implemented properties. They are by default read-only (since no setter is provided) and cannot be fields (records do not allow explicit fields). For each such parameter, aProperty
is created internally with an empty AccessorContext (meaning it has an implicit getter).A record may have an optional body in
{ }
. If present, the body cannot containfield
declarations – any attempt to putfield
inside a record will cause a compile exception. The body can contain:- Additional properties (explicitly declared property blocks).
- Functions (
func
) and actions. - At most one
local
and/orglobal
context (similar to classes/structs).
These additional properties in the body are often called non-positional or “extra” properties. They might be used for derived values or other data that wasn’t captured positionally.
Immutability: Ubytec doesn’t enforce immutability of records at runtime yet, but by convention records should be used with either only getters or no methods that mutate state. The parser explicitly prohibits fields in records to encourage use of properties (which at least require deliberate accessor code to mutate).
Example:
record Pair(t_int32 A, t_int32 B) { // A and B are auto-properties (read-only by default) func sum() -> t_int32 { return A + B } }
Here
Pair
is a record with positional propertiesA
andB
. It also has a methodsum
. There is no explicit constructor; to instantiate a record, you would typically declare it as a literal if that becomes supported, or simply treat it as a value container (since records don’t have identity or behaviors beyond carrying data).Records are meant to be compared by value (in the future, equality might be defined in terms of property values). The compiler currently flags if all enum member values in an enum are powers of two (bit-field), but for records there is no special handling yet.
(Planned:) In the future, records may get additional generated methods (like ToString
, Equals
) and allow a concise construction syntax. For now, they behave as plain classes with restrictions.
Interfaces
Interfaces in Ubytec declare abstract method signatures and properties without implementation. Syntax:
[modifiers] interface InterfaceName {
// property signatures (Type Name;)
// method signatures (func Name(...)[-> ReturnType];)
// action signatures (action Name(...);)
}
Interface members can only be properties or function/action signatures. They cannot include any
field
declarations or nested type definitions. If the parser encountersfield
or a type keyword inside aninterface
, it will throw an error.A property in an interface is written the same way as in a class except with no body. For example:
t_int32 Size;
would declare a propertySize
of typeint32
with no implementation. The Ubytec parser actually treats any bare type token followed by an identifier in an interface as a property declaration.Functions in interfaces are declared with the
func
keyword and a signature, but no body (not even an empty{}
). The parser enforces that interface methods haveDefinition is null
(no body). Syntactically, you writefunc Name(params...)[-> ReturnType];
– in practice, the parser will accept a function header and then require that the next token is not a{
but a terminator (which can be implicit end of block or a semicolon).Actions in interfaces are similar: declared with
action Name(params...);
and no body.Modifiers: Interfaces can have access modifiers (public, etc.) and also the
global
orlocal
modifier (though global/local on an interface is not particularly meaningful – if used, it’s validated but doesn’t change behavior). Other modifiers like abstract/virtual are redundant for interfaces (all interface members are implicitly abstract). The compiler will reject an interface markedsealed
,override
, etc., as invalid.Interfaces cannot have
global
orlocal
context blocks inside them (if attempted, it would be seen as an unknown token and likely error).Example:
interface IReadable { func Read() -> t_int32; property t_int32 Position; }
This interface declares a method
Read()
and a propertyPosition
. Any class implementingIReadable
(Ubytec doesn’t have animplements
syntax yet, but presumably a future version would allow class declarations to state they implement interfaces) would need to provide afunc Read { ... }
and aproperty t_int32 Position { get; set; }
(or at least a getter) to satisfy the interface. Currently, interface usage is limited since there is no mechanism to enforce implementation or to use an interface as a type constraint – these are planned features.
Enums
An enum
defines a set of named constant values. Syntax:
[modifiers] enum EnumName [:: UnderlyingType] {
Name1 [= const_expr],
Name2 [= const_expr],
...
}
You can specify an underlying integral type after a
::
. If omitted, the default underlying type ist_byte
(8-bit unsigned). The underlying type must be one of the built-in integer types (e.g.,t_int32
,t_uint16
, etc.). In the parse, if an underlying type is provided, the first token after::
is consumed as a type and the primitive part is used as the underlying type.The enum members are listed within
{ }
. Each member is a name with an optional= value
. If a value is not provided, it will be auto-assigned. Auto-assignment starts from 0 for the first member and increments by 1 for each subsequent member that doesn’t have an explicit value. If a member has an explicit value, subsequent auto-values continue from that value + 1. The parser uses along
(autoVal
) to track the next implicit value.The value expressions in enums can be decimal, hex (
0x
), or binary (0b
) numeric literals. They are parsed as constants at compile time (no references to other constants are allowed, just literal numbers). If a literal is out of range for a 64-bit signed, it might overflow thelong
(the implementation currently doesn’t guard overflow explicitly).A trailing comma after the last enum member is allowed (the parser will simply skip over a comma if it finds one before the closing
}
).After parsing the members, the compiler performs a check: if every member’s value is either 0 or a power of two, it marks the enum as a bit flag enum (a flag that it’s a bitfield). This is stored as a boolean
isBitField
on the Enum object. This can be used by tools or future semantic checks to allow bitwise combination of enum values. (If any value is not a power of two (and not zero),isBitField
will be false.)Enums cannot include methods or other members – only the value list. Any attempt to put a function or other declaration inside the enum would result in a parser error or be ignored.
Example:
enum Color :: t_uint16 { Red = 1, Green = 2, Blue = 4, White }
Here
Color
is an enum with underlying typeuint16
. We explicitly set Red=1, Green=2, Blue=4. The memberWhite
has no explicit value, so it gets auto-assigned to 5 (the last explicit value + 1). Since not all values are powers of two (White=5 is not), this enum would not be flagged as a bitfield (isBitField
would be false). If White had been 8 instead, all values 1,2,4,8 are powers of two (plus 0 implicitly if not used) andisBitField
would be true.Modifiers: Enums can have access modifiers (public, etc.). They cannot be
abstract
,sealed
, or havelocal/global
(those don’t make sense for enums). The parser will error if illegal modifiers are present. Const/readonly have no meaning on an enum either (each member is implicitly const).At runtime, an enum is essentially a primitive value of the underlying type. There is currently no enforcement of using an enum type in place of an integer; however, in the AST each enum member is represented as a pair of name and
long
value and the enum itself as a distinct type category. Future versions may introduce type-checking to prevent mixing enums and ints without casts.
Variables and Properties
Fields (Variables)
Fields are named variables that can be defined at module scope, in classes/structs (instance or static), or in global contexts. They are declared with the field
keyword, a type, and a name, plus an optional initializer. For example: field t_int32 count 5
.
A field declared at module level or in a class with no
global
modifier is an instance field (for classes/structs) or a module variable.Fields in a
global {}
block or with theglobal
modifier are static (there is a single storage for the field, not tied to an instance).The syntax is always
field <Type> <Name> [initial_value]
. The initial value is optional; if omitted, the field’s value defaults to 0 / false / null depending on type. If provided, it must be a constant expression (literal) because it’s assigned at compile time.Type notation: All types in Ubytec source are prefixed with
t_
for built-in primitives. For examplet_int32
for a 32-bit int,t_bool
for boolean,t_float64
for a double precision float. (User-defined class/struct names and enum names are used as-is without prefix.) Thefield
parser consumes a token with scopestorage.type.*
. This means the type can also include nullability/array annotations (see Type System below).Initializers: If the initializer is present, the parser captures it as a token (it doesn’t evaluate expressions here – only simple constants are allowed). The field is then stored in the AST with an initial value string if provided. At code generation, if a field has an initializer:
- For numeric or boolean values, the value is embedded in the data section.
- For a string literal, the compiler emits the string bytes and a null terminator in the data section.
- If no initializer, the field is allocated in BSS (zero-initialized memory) if it’s a property or left as 0 in data for a field.
In the output assembly, a field named
X
gets a label likeX_guid: dq 5
(for a 64-bit value initialized to 5, as an example) or appropriate data size directive based on type. For example, at_int32
field initialized to 10 would compile tofieldName_guid: dd 10
. The compiler picksdb
,dw
,dd
, ordq
for 8-bit, 16-bit, 32-bit, or 64-bit storage respectively. Larger types (like 128-bit) are stored as multipledq
entries. A string initializer"Hello"
would producefieldName_guid: db "Hello", 0
.Const fields: If a field is marked
const
, it must have an initializer, and that field’s value is considered a compile-time constant. The compiler will replace references to that field with the constant value (this is not fully implemented yet, but conceptually).const
fields are placed in the data section like normal fields (they are not stored in read-only section specifically, though that could be a future enhancement). Note that currently there’s no separate syntax for constants; you just useconst
modifier on a field.Readonly fields: Marking a field
readonly
means it should only be assigned in an initializer or constructor. The compiler enforces that by raising an exception if areadonly
is assigned outside of those contexts (since constructors aren’t explicitly supported yet, effectively any attempt to assign to a readonly field after definition would be invalid). This is partially enforced in the Validate logic for Field (no direct enforcement yet beyond disallowing conflicting const+readonly).Memory layout: All fields in the same module or type are stored sequentially in memory (in the order declared). The exact layout in memory is handled by the assembler; Ubytec does not support explicit alignment directives in the source, but the compiler’s
MemoryRegionTemplate
definitions show that fields and variables are intended to be aligned according to their size (e.g., 4-byte alignment for int32). In practice, the assembler will aligndd
anddq
on natural boundaries by default in NASM (and the compiler emits each field on its own line, so alignment is handled).Field references: To use a field inside code, currently the language lacks a high-level syntax. One must either rely on future support for variables in expressions or use opcodes like
LOAD
/STORE
(see Bytecode section). For example, if you have a global fieldX
, and you want to push it to the stack, you would useLOAD
with X’s label or address (this is low-level; a planned feature is to allow simply usingX
in expressions and have the compiler insert the load). This is work in progress.
Properties
A property in Ubytec is like a high-level variable with encapsulated accessors (similar to properties in C#). Properties are declared by specifying a type and name without the field
keyword, which the parser distinguishes as a property if it’s not preceded by field
. For instance: t_int32 count
(at module or class scope) declares a property named count
of type int32
. You can then optionally provide a property body with get
, set
, or init
accessor functions.
The parser rule is: if it encounters a token of scope
storage.type.*
and the next token is an identifier, not preceded byfield
, in a context where a field could appear, it treats it as a property declaration. It then looks ahead: if the next non-whitespace token after the identifier is a{
, it will parse a property body; otherwise, it treats it as an auto-property with no explicit accessors.Auto-properties: If you write
t_type Name;
with no{ }
, the compiler will create a property with default getters and setters. Internally, it synthesizes anAccessorContext
with no custom functions, which means:- If this property is in a class/struct (instance property), it behaves like a field that can be gotten and set freely.
- If it’s in an interface or if declared with no body in a class and no modifiers, by default it is like a read-write property. (However, since we don’t generate a backing field automatically yet, an auto-property is currently functionally equivalent to a public field in terms of generated code – it is stored as a field in
.bss
segment if at module/class level, and simple get/set would just read/write that memory. This is an implementation detail; conceptually it’s a property.)
Accessor definitions: If you include a
{ ... }
after the property name, you can define one or more accessor functions:func get { ... }
– defines the getter. This must be a function with no parameters and return the property’s type. The name must be exactlyget
(case-insensitive), otherwise an error is thrown.func set { ... }
– defines the setter. This must be a function taking one parameter of the property’s type and returningt_void
. The parameter name is arbitrary (often not used). The name must beset
.func init { ... }
– defines an initializer accessor, intended to be like a setter that can only be called once during object initialization. It follows the same signature rules asset
(one parameter of property type, returns void). Use ofinit
is by convention; currently it’s treated similarly toset
with a restriction that it cannot coexist with aset
accessor on the same property.
Inside the property body, no other members are allowed except these
func
accessors. The parser will skip any other tokens or throw an error if something else appears there.The compiler wraps the get/set/init functions into an
AccessorContext
object. During validation, it ensures:- At most 2 accessor functions are present (you can have
get
alone,get
+set
, orget
+init
, orset
alone, etc. but not all three, and not both set and init together). - If a
get
is present, its return type must match the property’s type. - If a
set
is present, it must return void and take exactly one argument of the property’s type. - If an
init
is present, it must return void, take one argument of the property’s type, and you cannot also have aset
(init and set are mutually exclusive since both serve to assign). - Duplicate accessors (two gets, etc.) are not allowed.
These rules are enforced in the
AccessorContext.ValidateInternal
logic. Any violation will throw an exception at compile time.- At most 2 accessor functions are present (you can have
Backing storage: In the current implementation, each property, even with custom accessors, has an implicit backing field allocated in the BSS section (for uninitialized storage). The property’s get/set functions operate on that storage. The compiled output for a property is essentially a pair of labeled blocks: one in BSS for the value, and one in text for the get/set code. For example, a property
Foo
in a class will produce something like:Foo_GUID: resq 1 ; reserve 8 bytes (if type is 64-bit) for backing store (.bss) ... prop_Foo_GUID_start: ; Property: Foo (ID: GUID), Type: <TypeName> ... get function code ... ... set/init function code ... prop_Foo_GUID_end:
(The property is compiled similarly to a function, with labeled start/end and its accessors in between.)
For auto-properties (no explicit get/set), the compiler still creates a backing field and would generate default get/set if needed. Currently, an auto-property is treated as having an AccessorContext with no functions, which the
Property.Compile
will still output as a labeled section with just start and end and no code in between. This results effectively in allocating space but not providing any code – meaning reads/writes to it are done as if it were a field. This is a bit of an artifact of the current compiler design; future versions may optimize auto-properties differently.Usage: Accessing a property uses the same syntax as a variable (just use its name). The compiler will route reads through the get accessor and writes through the set. However, note that currently in the bytecode, property access isn’t fully distinguished from field access unless you manually call the accessor. In practice, if you refer to
obj.Foo
in code, the compiler might inline that as a field reference or call, depending on context (this area is WIP; right now there’s no direct method call for a get in generated code – a read of a property is effectively compiled as aLOAD
of its backing field, unless the get accessor has custom code, in which case the custom code is compiled but you would have to explicitly call it). This means that custom accessor logic is actually emitted but not automatically invoked. For now, to ensure custom logic runs, you might call the accessor function explicitly (likeobj.getFoo()
which is not a high-level syntax, but could be done by low-level call). This is another area for planned improvements in the code generator.Example:
property t_int32 Counter { func get { return [this].value // pseudocode: load backing store } func set { if (value < 0) { return // ignore negative } [this].value = value // set backing store } }
This property has a custom getter and setter that ensure
Counter
never goes negative (any attempt to set a negative value is ignored). In the compiled output, a backing field forCounter
will be reserved (for instance asCounter_GUID: dq 0
in BSS) and the get/set code will be placed in the text section. The syntax[this].value
above is not actual Ubytec syntax – it represents what the get/set would do in assembly (load or store the underlying memory). Ubytec currently doesn’t provide a direct way to reference “the value being set” except that in aset
/init
accessor, the single parameter represents the incoming value (the compiler internally names it and enforces its type). In the code, you could just write:func set { if (Arguments[0] < 0) return // using the first argument implicitly (not actual syntax) // else store Arguments[0] into backing }
There isn’t a clean syntax for accessing the backing field from within the property; however, since the compiler is responsible for generating the setter’s final assembly, you conceptually just use the parameter and assign it. This limitation in syntax will likely be improved (e.g., allowing
this.Counter
or similar within the set).Summary: A property without accessors acts like a public field (with potential for future encapsulation), and a property with accessors acts like a pair of methods with an implicit backing store. The implemented validation ensures properties don’t use invalid modifier combinations (e.g., marking a property
const
orsealed
is disallowed as those don’t apply to properties) and that they have at most one visibility modifier.
Local Variables
Local variables (as opposed to fields) are those declared within function bodies or within a local {}
context. In function bodies, you can declare a variable by simply writing a type and name, similar to a property, but inside a function. For example, in a function: t_int32 i = 0
declares a local variable i
.
However, Ubytec’s current grammar does not use a separate keyword for local declarations; instead, it treats any statement starting with a type as either a declaration of a local variable or an inline opcode (depending on context). In the statement parser, if a line begins with a type token, it is recognized as an inline variable declaration opcode (VAR
). Essentially, t_type name value
inside a function becomes a VAR
operation that allocates space on the stack for that variable and optionally initializes it.
Key points:
- Local variable syntax: same
t_<Type> <name> [initialValue]
. For instance,t_int32 x 5
within a function will be parsed as a local variable definition. - Locals are allocated on the stack when encountered. In the bytecode, encountering a local var triggers the generation of a
VAR
opcode which reserves space and possibly sets an initial value. The interpreter merges this information into the function’s active local variable list (so subsequent references tox
can be resolved). - Implementation detail: When the statement parser sees a type token at the start of a line, it collects the type, variable name, and an immediate value token if present. It also accumulates any modifiers on the declaration (like
readonly
, etc.) into a bitflag for that var (most of which are not yet meaningfully enforced for locals). It then constructs aVariableExpressionFragment
containing the type, name, and value, and wraps that in aVAR
opcode object. ThisVAR
is marked as anIOpCode
with opcode 0x10 (the byte forVAR
) and stored. - All local variables in a function are allocated together at function entry in the current codegen. The function compiler calculates total size needed for locals (summing the sizes of each declared local) and subtracts that from RSP once at the top of the function. It also generates label references for each local so they can be used within the function body (essentially treating them like fixed offsets in the stack frame).
- The
VAR
opcodes in the AST are used to track variable presence and initial value. When generating assembly, the compiler doesn’t output a specific instruction forVAR
(there’s no runtime opcode doing allocation; it’s handled by the function prologue). Instead, it uses the info to lay out the stack frame. The initial value, if any, is then stored to that stack slot. However, at the moment, initializers for local variables are not automatically emitted as moves – this appears to be incomplete (theVAR
opcode itself holds the initial value, but the assembly generator does not explicitly output code to set the stack memory to that value). This means a local liket_int32 x 5
might allocate space but not actually initialize it to 5 unless the compiler handles it by converting that into aPUSH 5
or similar. This is a known gap (WIP). As a result, be cautious: local initializers might be ignored in the current state, effectively leavingx
undefined. This will likely be fixed so that the constant is stored. - Local variables are by default mutable (unless you use
const
orreadonly
modifiers, which the parser will capture similarly to fields). Aconst
local is conceptually possible (the compiler could treat it like an immediate), but since local assignment isn’t fully implemented, this distinction is minor. - To use local variables in expressions, you simply write their name. The compiler needs to translate that into load/store operations. At present, referencing a local variable name in an expression will be parsed as either an opcode or an operand depending on context. If you write
x = 10;
, the parser seesx
(not a keyword or type), tries to find it in the opcode map (fails) and likely throws an error (assignment statements to locals are not properly handled yet). If you write an expression likex + 1
, the parser would treatx
as an operand (possibly an “entity.name.var.reference”) but there is no code to actually load x’s value from the stack – the current implementation doesn’t generate that, as the high-level expression parser is incomplete. Therefore, local variable usage is very limited at the moment. You can declare them, but using them in calculations may not do what you expect unless combined with explicit opcodes (like usingPUSH
andPOP
around them).
In summary, local variables are recognized and allocated, but assignment to them and use in expressions is only partially implemented. They primarily serve as named stack slots. Future improvements will likely allow =
to assign and reading them as part of expressions.
Expressions and Operators
Ubytec supports a variety of operators, largely modeled on C/C#/Java syntax, for use in expressions and control flow conditions. The expression evaluation is based on a stack machine model: operands are pushed to an evaluation stack and operations consume them and push results. In the current implementation, infix expressions are parsed but then internally converted to postfix sequence of opcodes. For example, an expression 2 + 3 * 4
would be translated to push 2, push 3, push 4, multiply, then add (if the expression parser were fully implemented).
However, it’s important to note that full expression parsing is still a work in progress. Simple literal comparisons and arithmetic in conditions do work, but assignment and function calls in expressions are not yet functional.
Below is a list of the operators and their semantics as per the current design (with opcodes where applicable).
Literal Constants
- Integers: An integer literal written in decimal (e.g.
42
) is by default interpreted as a 32-bit signed integer. The compiler will try to parse it into anint
orbyte
, and if it doesn’t fit those, it currently fails (it does not yet automatically pick a 64-bit type, which is a limitation). You can also specify integers in hexadecimal (prefix0x
) or binary (prefix0b
). Hex and binary literals are parsed and converted to an integer internally. For example,0xFF
-> 255,0b1010
-> 10. Negative numbers are typically expressed with a unary negation operator (e.g.-5
is parsed as the constant5
with aNEG
opcode applied). - Booleans: The literals
true
andfalse
(case-insensitive) are recognized. They are stored as 1 (true) or 0 (false) of typet_bool
. For instance, aconstant.boolean.ubytec
token is turned into numeric 1 or 0 in operands. - Characters: A character literal is written in single quotes, e.g.
'A'
. This will be converted to its ASCII/Unicode code (e.g. 65 for 'A'). In the AST, it’s treated as an integer of typet_char8
ort_char16
depending on context (default ischar8
if it fits). - Strings: A string literal is written in double quotes, e.g.
"Hello"
. In the current compiler, string literals can appear as initializers for fields or possibly as operands for certain ops, but there is not a full string type. A string literal in a field initializer is stored in memory (as described in Fields section). In code, if a string literal appears, the parser flags it (scopestring.quoted.double.ubytec
), butProcessOperand
does not yet enqueue it to the operand queue (no direct handling). Practically, you would not use string literals in expressions yet except to assign to a field or pass to a syscall, etc. They are effectively pointers to static data (char arrays terminated by 0). - Null:
null
is a literal representing a null reference. It is a keyword that corresponds to theNULL
opcode (0x0F) which likely pushes a null pointer (0) onto the stack. The parser treatsnull
as a keyword and generates aNULL
opcode with no operands. In contexts like comparing to null or assigning null, it functions as expected (0 for pointer). The type of null can be any reference type; currently there is no type inference for null, but semantically it can be assigned to any class type or to a nullable value type.
Arithmetic Operators
- Addition (
+
), Subtraction (-
), Multiplication (*
), Division (/
), Modulo (%
): These work on numeric types (integer or float). Each maps to an opcode:ADD
(0x20),SUB
(0x21),MUL
(0x22),DIV
(0x23),MOD
(0x24) respectively. The expectation is that the two top values on the stack are taken, the operation applied, and the result pushed. The type of the result follows usual rules (two ints give an int, two floats give a float, etc.). Mixed-type arithmetic might require explicit casts – the compiler does have conversion rules (widening, etc.) defined inValidateImplicitCast
andValidateExplicitCast
, but automatic promotion is not fully implemented. So currently, you should operate on matching types. - Negation (Unary
-
): This uses theNEG
opcode (0x27), which negates the top-of-stack number (two’s complement for integers, arithmetic negation for float). You write-x
in source. The parser will typically treat the-
in front of a literal or variable as a unary operator and generate aNEG
after pushing the operand. - Increment
++
and Decrement--
: These exist as tokens (scopesoperator.increment.ubytec
andoperator.decrement.ubytec
). They would map toINC
(0x25) andDEC
(0x26) respectively in bytecode. However, their usage in source is tricky since the grammar does not explicitly have pre/post fix rules implemented. They are recognized if used in a statement like++x;
orx++
but the actual effect depends on parse. The current compiler likely only supports them in a simple statement context (and would just emit an INC/DEC opcode targeting that variable). Because variable reference handling is incomplete,++x
might not properly increment a local variable at runtime (it would push x, increment the value, and push the incremented value to stack, but not store it back unless combined with a store). This is WIP. So, whileINC
andDEC
opcodes exist, use of++/--
is not reliably functioning on user variables yet. - Absolute value (
ABS
): There is anABS
opcode (0x28) to get absolute value. There’s no high-level operator symbol for it; it would be used via a function or intrinsic call (perhaps a plannedabs()
function). If needed, one could directly use the opcode in inline assembly form. In code, you might not useABS
explicitly, but know it exists for generating absolute values of numbers (turning negative to positive). - Exponentiation (
**
): The grammar defines anoperator.exponentiation.ubytec
for a power operator (perhaps**
or^
in some contexts). The opcode map does not list a dedicated opcode for exponentiation (it might be intended to be handled by a runtime library call or an Extended opcode in the future). Currently, using**
in source might produce tokens but no direct handling – consider this feature planned but not implemented. For now, raising to a power would require calling a library function or using repeated multiplication.
Bitwise and Logical Operators
Bitwise AND (
&
), OR (|
), XOR (^
), NOT (~
ornot
): These operate on integral types (and booleans for bitwise ops can serve as logical ops since true/false can be treated as 1/0). Opcodes:AND
(0x30),OR
(0x31),XOR
(0x32),NOT
(0x33). The parser actually treats the textual keywordsand
,or
,xor
as bitwise operators (scoped askeyword.operator.bitwise.ubytec
) and symbols&
,|
,^
likely asoperator.bitwise-and
, etc. Either form may be accepted. For example,a & b
ora and b
should both result in anAND
opcode. Similarly~x
ornot x
would result inNOT
opcode (bitwise complement of all bits) on x.Bit shifts: Left shift
<<
and right shift>>
are supported.SHL
(0x34) is shift-left,SHR
(0x35) is shift-right. It’s not explicitly specified, but likelySHR
is an arithmetic right shift (preserving sign for signed numbers). The language also defines tokens for unsigned shifts (perhaps<<<
and>>>
or some syntax). Scope namesoperator.unsigned-left-shift.ubytec
andoperator.unsigned-right-shift.ubytec
appear, but there are no distinct opcodes for them in 0x30-0x3F range. This suggests either a plan to combine a flag withSHL/SHR
or to handle>>>
as a special case ofSHR
for unsigned operands. As of now, using>>>
in code might simply map toSHR
as well. We mark unsigned shift as WIP. Use<<
and>>
as normal; if you need a logical (zero-fill) right shift on a signed number, ensure you cast to an unsigned type (e.g., a larger type) before shifting.Logical AND (
&&
) and OR (||
): These are short-circuiting boolean operators. In the grammar,&&
is recognized asoperator.logical-and.ubytec
and||
asoperator.logical-or.ubytec
. There is no specific opcode for logical-and/or because short-circuit logic is typically implemented with jumps. The current compiler likely handles&&
and||
during condition compilation by branching:a && b
is compiled like: evaluatea
; if false, skip evaluatingb
and result false; if true, then evaluateb
and that result is the result.a || b
similarly: ifa
is true, result true immediately; otherwise evaluateb
.
In practice, the AST builder might translate
&&
into something like: if first operand is 0, skip second and push 0; else evaluate second. This would involve theIF
,BRANCH
, andEND
opcodes. Indeed, the presence of structuredIF
andEND
opcodes facilitates short-circuit implementation. So logically,&&
and||
do work inif
conditions. For example, anif (x && y)
might generate something akin to:IF x != 0 { IF y != 0 { ...body... } }
with appropriate
else
to set condition false. This is not explicitly in code, but the design allows it. As an expression (for assignment),&&
/||
returning a boolean isn’t fully tested but should yield 1 or 0 as well.Logical NOT (
!
): The logical negation operator for booleans is!
. It is recognized asoperator.negation.ubytec
. Likely, the compiler will implement!expr
by comparing expr to 0 and pushing 1 if zero, 0 if nonzero. There is no singleNOT
opcode for logical not (sinceNOT
is bitwise). So!
might be implemented as a small sequence (e.g.,EQZ
if it existed, or using== 0
). In fact, the grammar could treat!x
asx == false
and output anEQ
(equality) opcode with 0. The simplest way is to use the comparison opcodes:EQ
with operand 0. (AnEQZ
opcode could be an extended idea but not present.) For now, you can use!
in conditions and it should produce the correct result.
Comparison Operators
The comparison operators yield a boolean result (t_bool
, represented as 1 or 0). They are:
Equality
==
and Inequality!=
: Map to opcodesEQ
(0x40) andNEQ
(0x41). These compare the two top stack values (after pushing operands) and push 1 if equal (EQ
) or 1 if not equal (NEQ
), otherwise 0. They work for numeric types, booleans, and also for pointers/references (where equality means same address or both null). For user-defined reference types,==
currently just compares pointers (no deep equality override yet).Less than
<
, Less or equal<=
, Greater than>
, Greater or equal>=
: Map toLT
(0x42),LE
(0x43),GT
(0x44),GE
(0x45). These perform signed comparisons for numeric types (and presumably work for char as numbers). The result is 1 if the comparison is true, 0 if false.All these comparison opcodes are designed to consume two values and push a boolean. In terms of short-circuit, they don’t short-circuit (they are binary operators that always evaluate both sides).
The parser identifies these via tokens like
operator.less-than.ubytec
, etc. During parsing of a condition likea < b
, it will likely produce: pusha
, pushb
, then anLT
opcode. If combined in complex expressions, the order of evaluation and insertion of these is governed by operator precedence (which the grammar likely encodes in the regex patterns or via the AST builder’s logic). Precedence in Ubytec follows typical rules: arithmetic* / + -
> comparisons > logical && > logical ||. The compiler’s internalParse
function uses a single-pass to create opcodes, and the existence of anOpcodeMap
with e.g.LT
etc. suggests it directly inserts theLT
instruction when encountering<
.Example:
if (x != 0 && y < 5) { ... }
. The condition involves!=
and<
and&&
. The compiler will:- Parse
x != 0
to a sequence: push x, push 0,NEQ
. - Parse
y < 5
to: push y, push 5,LT
. - Parse the
&&
between them to implement short-circuit. Likely it wraps these in anIF
structure under the hood. The final bytecode might be something like:
PUSH x; PUSH 0; NEQ ; result of first condition IF (pop value as condition) PUSH y; PUSH 5; LT ELSE PUSH 0 ; if first was false, second is skipped -> result false END
and then an
IF
outside to decide the wholeif
statement. While this is not explicitly shown in code, the design using structuredIF/ELSE/END
opcodes supports this pattern.- Parse
Note: There is currently no direct support for comparisons of strings or complex types – those would require user code (e.g., iterating or custom function). Enums are compared by their underlying values (so you can use
==
etc. on enums directly). Booleans compared with==
/!=
just compare 0/1.
Other Operators
Assignment (
=
): As mentioned earlier, the assignment operator is recognized (operator.assign.ubytec
) but not fully implemented in code generation. Normally,=
would not have its own opcode but rather translate to a store operation. The plan is thata = b
results in evaluatingb
then storing it intoa
’s location. In bytecode, this would be done via aSTORE
instruction. Indeed, there are opcodes reserved:LOAD
(0x50) andSTORE
(0x51) for memory access. The idea is:LOAD var
pushes the value of variablevar
onto the stack.STORE var
pops the top of stack and writes it intovar
.
The compiler currently does not automatically emit
LOAD
/STORE
for variables when you use them in expressions or assignments – this is a work in progress. If you tryx = y;
in code, the parser will catch it but the resulting opcodes may be incomplete. The implementation of the commented-out block inProcessOperand
hints at how it might work: they planned to use an@
symbol to indicate a variable reference and replace it with the variable’s value, but that is not finalized. For now, think of=
as a placeholder. In practice, to set a variable you would use aSTORE
: e.g., you might have to resort to inline assembly like:// hypothetical example PUSH 10 STORE i
to assign 10 to
i
. This low-level approach might be the only way until assignment is finished.Member access (
.
) and scope resolution (::
): Ubytec uses.
for accessing members of objects (likeobj.field
orobj.method()
), and::
is seen in the grammar mostly for enum underlying types and possibly in future for static accesses (Type::Member
). The scope namepunctuation.scope.ubytec
is used for the::
after enum name. Member access.
is not explicitly shown in grammar scopes, but likely it’s handled contextually (as part of parsing an identifier or function call). At runtime, non-static field access would involve an offset from an object pointer – that support depends on the implementation of classes and is incomplete. Since inheritance and virtual dispatch are not there, and all fields in an object are at fixed offsets, the compiler could compute those offsets for field access. But currently, there is no codegen forobj.field
beyond treating it as perhaps aLOAD
from the object’s base pointer plus offset. This is WIP. Similarlyobj.method()
would require calling the method withobj
as context (like passingthis
). In short, member access works for static references (like calling a global function, or accessing a global field via module name if multi-module support were there), but for actual object instances, there’s missing infrastructure (since allocating objects and calling methods on them isn’t fully implemented).Conditional operator (
? :
): There is no mention of a ternary conditional operator in the grammar or code. It’s likely not supported in this version of Ubytec.Pipeline (
|>
), Pipe-In (<|
), Pipe-Out (|<
), Spread (...
), Optional chaining (?.
), Null coalescence (??
), etc.: These appear as tokens in the grammar (scopes likeoperator.pipe.ubytec
,operator.pipe-in.ubytec
,operator.spread.ubytec
,operator.optional-chaining.ubytec
,operator.nullable-coalescence.ubytec
). They hint at higher-level language features (functional pipelining, variadic spread, safe navigation, etc.). At present, these are not implemented in the interpreter logic. They were likely reserved for planned features:|>
and|<
could be for function composition or data piping (as seen in F# or Elixir).??
would return the left operand if it’s not null, otherwise the right operand.?.
for safe navigation if an object is null (to avoid NullReference, returning null instead).- Spread (
...
) for expanding arrays/tuples into arguments.
None of these have corresponding opcodes or handling in ASTCompiler (they would require multiple steps or variadic handling). They are considered planned/WIP. Using them in code currently will not work (they might be tokenized but the compiler wouldn’t know how to compile them to assembly).
Exception handling and others: The grammar lists
keyword.exception.ubytec
,keyword.threading.ubytec
,keyword.system.ubytec
,keyword.ml.ubytec
,keyword.quantum.ubytec
,keyword.power.ubytec
,keyword.audio.ubytec
,keyword.security.ubytec
,keyword.vector.ubytec
etc.. These suggest potential domains for future keywords (for example,throw
/try
for exceptions, threading operations, system calls, machine learning, quantum computing primitives, vector/SIMD operations, etc.). Currently, none of these keywords have concrete syntax defined or implemented semantics. They are placeholders in the grammar likely to reserve those words or categorize tokens if they appear. In the current state:- There is no
try/catch
orthrow
. The only “exception” we have is theTRAP
opcode which causes a runtime trap (like anud2
illegal instruction). - No threading keywords are active; concurrency isn’t implemented.
syscall
is a keyword recognized (for making system calls). In assembly output, the compiler usessyscall
instruction to exit. It’s possible to invoke a system call in Ubytec by writingsyscall
in code – the grammar would recognize it (scopekeyword.syscall.ubytec
), and the ASTCompiler would treat it likely as a direct assembly instruction. In fact,syscall
appears as just a word that would be passed through as an I/O operation. There is no higher-level API around it, so it’s essentially inline assembly. This is how the program exits (they literally outputmov eax, 60; xor edi, edi; syscall
for exit). Similarly, other raw instructions could be exposed via keywords in the future (likeinterrupt
or privileged instructions underkeyword.system
).
All these specialized areas are planned and currently inactive.
- There is no
Bytecode and Opcode Reference
Under the hood, Ubytec code is compiled into a sequence of opcodes (bytecode instructions), some of which correspond directly to high-level statements or operations, and others are used internally for structured control flow. The bytecode is designed such that it can be assembled to real machine code (as is done now targeting NASM x86-64) or potentially interpreted by a virtual machine. Each opcode is a 1-byte value, sometimes followed by operands (immediate data like constants or indices). The implemented opcodes and their functionality are listed below. (If an opcode is marked WIP, it is defined but not fully utilized in code generation yet.)
Control flow opcodes (structured):
0x00
TRAP
: Triggers a trap/interrupt. This is used to signal an unrecoverable situation. In assembly it compiles to an undefined instruction (ud2
) which will crash the program if executed. (Think of it as a deliberate crash or breakpoint – e.g., can be used for debug or as a stub for “not implemented”.)0x01
NOP
: No operation. Does nothing and continues execution. Can be used as a filler or placeholder.0x02
BLOCK
: Starts a structured block.BLOCK
may take a type operand indicating the block’s result type (ort_void
if it yields nothing). It opens a new scope for control flow. In high-level terms,block
is used to group a series of instructions (like a{}
in high-level, but also used to implement things like switch-case or try-catch structures). In assembly output, aBLOCK
might correspond to a label marking the start of a block.0x03
LOOP
: Starts a loop block. This is likeBLOCK
but indicates a loop construct, typically used withBREAK
/CONTINUE
. In awhile
orfor
loop, the compiler might useLOOP
to mark the point to jump back to. It also can have a type like BLOCK does.0x04
IF
: Begins anif
conditional block. It expects a condition value (boolean) on the stack – if the value is nonzero (true), the block executes; if zero (false), execution jumps to the correspondingELSE
orEND
. TheIF
opcode internally holds the condition expression parsed after it (the compiler sets up the condition before emitting IF). In assembly,IF
will be implemented by a conditional jump past the block if false.0x05
ELSE
: Begins theelse
portion of anif
block. It is only valid after anIF
(structured). When execution hits anELSE
, it jumps to after theEND
of the if-block (skipping the else-part) if coming from theif
part. In assembly,ELSE
corresponds to an unconditional jump that is taken if theif
part was executed, and a label that marks the start of else-block. The interpreter tracks the matching IF for each ELSE.0x06
END
: Closes aBLOCK
,LOOP
,IF
or other block structure. Every structured block opener (BLOCK, LOOP, IF, ELSE, SWITCH, etc.) has a matching END. END may also end aBRANCH
construct (see below). In assembly,END
corresponds to a label that marks the exit of the structure.0x07
BREAK
: Exits from a loop or switch. This causes an immediate jump to the instruction after the matchingEND
of the nearest encosingLOOP
orSWITCH
. It’s the implementation ofbreak
in C-like languages. It will only function properly inside aLOOP
orSWITCH
block; if used elsewhere, it might be ignored or throw (the current compiler doesn’t explicitly check context before emitting, but logically it should be within a loop). Status: Implemented in opcode map and parser (it’s recognized), but no special compile-time check is done beyond structured control handling.0x08
CONTINUE
: Jumps to the next iteration of the nearest loop (to the loop’sLOOP
point). It causes execution to jump to just before theEND
of aLOOP
, effectively. In awhile
, this means go to the condition check again. Like BREAK, it should be used inside a loop. Status: In opcode map and grammar; structured handling uses it similarly to break.0x09
RETURN
: Returns from the current function, optionally with a value. Semantically, it breaks out of the entire function’s execution. In implementation, encountering a RETURN would generate a jump to the function’s end label (func_name_end
). In the current assembly output, the compiler always appends aret
instruction at function end. ARETURN
opcode encountered in bytecode should therefore jump to that ret. The interpreter’s design currently doesn’t automatically output a jump for RETURN opcode (no special compile inCompileBlockNode
aside from grouping it) – this is an area to check. Likely, the function compiler should convertRETURN
opcodes into actualjmp
to end. At least conceptually, usereturn expr;
in code to break out. Status: present in bytecode, works in simple cases (the sample main uses an implicit return).0x0A
BRANCH
: This is a general “branch table” construct for switches. It opens a block like a combination ofIF
andLOOP
specifically forSWITCH
handling. The idea is thatSWITCH
pushes a value, and then a series ofBRANCH
opcodes could be used to jump to the matching case. In practice,BRANCH
likely holds an array of label offsets (LabelIDxs
in the AST schema) for case labels. The first operand might be the number of cases or something. The compiler, upon aSWITCH
, would generate one or moreBRANCH
instructions representingcase
labels to jump to. EachBRANCH
acts like anIF
on the switch value (or a jump table using the value as index). However, the current parser doesn’t explicitly handle acase
keyword (there is none in grammar), so howBRANCH
is used is not obvious. It might be that the compiler intended to use multipleIF
/ELSE
to implement switch, andBRANCH
is reserved for a future jump table optimization. In the opcode handling,BRANCH
is treated similarly toIF
(it inherits variables to its scope, etc.). Status: Present but not triggered by any high-level syntax in current implementation (nocase
keyword), so effectively WIP.0x0B
SWITCH
: Marks the start of a switch-case structure. In source,switch
would be followed by a value and a block of cases. The Ubytec grammar does have akeyword.control.flow.ubytec
forswitch
anddefault
and presumably would parse aswitch(cond) { ... }
into aSWITCH
opcode and some internal structure for cases. TheSWITCH
opcode likely takes the switch condition as an input and maybe prepares for branch table. In the AST, aSWITCH
node is a block node that will have children for each case (withBRANCH
opcodes) and a child forDEFAULT
if present, and then anEND
. The current compiler does putSWITCH
in the opcode map and will push it to block stack like other blocks. But sincecase
handling isn’t explicitly coded, usingswitch
might not fully work. Possibly one could simulate a switch with if-else ladder for now. Status: Partially implemented;switch
keyword is recognized and the structure exists, but case labels are WIP.0x0C
WHILE
: Opens a structured loop with a condition at the start (a typicalwhile
loop). In high-level code, awhile(cond) { body }
is handled by a singleWHILE
opcode that includes the condition expression, and an implicit loop-back. Internally, the compiler likely transforms awhile
into a combination ofLOOP
andIF
opcodes, but it actually has a dedicatedWHILE
opcode for convenience. TheWHILE
opcode in the AST carries the loop condition (it has aCondition
field). The typical lowering might be:WHILE cond { ... }
becomes:LOOP (block) IF cond { ...body... BRANCH (back to LOOP start) } END
However, they chose to treat
WHILE
as a first-class opcode, possibly simplifying codegen. The interpreter pushesWHILE
on the block stack like anIF
/LOOP
combo, and on encountering the matchingEND
, it will pop it. In assembly output, aWHILE
might generate both the loop condition check and a jump back atEND
. The actual implementation ofWHILE
in assembly is not explicitly printed; presumably, they handle it in the Compile step by outputting a label atWHILE
start and a conditional jump at end. Status: Implemented in parser and opcode map; should function for basic loops. For example,while x != 0 { ...body... }
would compile to something logically like:
start_loop: cmp [x], 0 je end_loop ...body... jmp start_loop end_loop:
The
WHILE
andEND
opcodes encapsulate that pattern.0x0D
CLEAR
: This opcode clears the evaluation stack. It’s a utility to discard any temporary values and reset the stack depth to what it was at block entry. In practice, the compiler might emitCLEAR
at the end of a block that is supposed to have void type, to make sure nothing is left on the stack. Or it could be used by user to drop all runtime stack data (though typically not exposed directly). In the current code,CLEAR
is listed in the map and treated as a neutral instruction (no stack effect besides clearing). In assembly, it would translate to adjusting RSP if needed. Since expression stack management is mostly static,CLEAR
might rarely be needed; it’s likely WIP or used defensively. Use ofclear
in source is not documented, but one could imagine doingclear
to abandon any computed results (like a stack reset).0x0E
DEFAULT
: Marks the default case in a switch. It would be used inside aSWITCH
block to indicate the start of the default block (executed if none of the cases matched). The parser likely expectsdefault:
label and translates it to aDEFAULT
opcode. Similar toELSE
for if,DEFAULT
for switch does not take a condition but serves as a jump target. The OpcodeMap includesDEFAULT
, andCompileBlockNode
treats it like a neutral operation (likely just a label for the default section). Status: The keyworddefault
is recognized and the structure exists, but sinceswitch
/case
isn’t fully operational,DEFAULT
is effectively WIP as well.0x0F
NULL
: Not actually a control-flow, but listed here in sequence.NULL
is an opcode that pushes a null reference (zero) onto the stack. It’s used to implement thenull
literal. In assembly, it might be justxor rax, rax; push rax
(or similar). It takes no operands. This opcode is fully implemented; usingnull
in an expression will generate it.
Stack and data opcodes:
- 0x10
VAR
: Declares an inline local variable (already covered in Variables section).VAR
opcode carries the variable’s type, name, and initial value (packed in aVariableExpressionFragment
). At runtime, it doesn’t produce machine instructions except reserving stack space (which is handled in function prologue). It’s more like metadata. MultipleVAR
in a row can appear to allocate multiple locals. As a result,VAR
is primarily handled at compile time. Status: Implemented in parser and used for local declarations. - 0x11
PUSH
: Pushes an immediate value onto the stack. ThePUSH
opcode takes a byte array operand (which can represent an arbitrary literal of any size). It is used for constants that aren’t handled by specific opcodes. For example, pushing a large constant or a pointer might usePUSH
. In current usage, numeric literals are often enqueued as operands to other opcodes (likeADD
or comparisons) rather than a standalonePUSH
. But you can explicitly writepush 5
in Ubytec code to push the value 5 (though that’s more akin to assembly embedding). Status: Defined, but the code generation of expressions might inline pushes implicitly. The compile of aVAR
with initial value usesPUSH
internally on the value then stores it. - 0x12
POP
: Pops the top of the stack, discarding a value. It’s used to remove unwanted results. For instance, if a function returns a value you don’t need, the compiler might emit aPOP
to remove it. Or at the end of a block returning void but with stack residue, it might pop. You can also explicitly callpop
in code to drop one stack value. Status: Present (thoughPOP
’sCompile
isNotImplemented
stub, the assembly equivalent would be an actualpop
from CPU stack). - 0x13
DUP
: Duplicates the top stack value (pops one value and pushes it twice, effectively). Used when you need to use a value twice without reloading it. For example, computingx + x
might push x then DUP it then ADD. Status: Implemented as opcode but compile logic not yet generating it automatically in any known case (since expressions are simple). It can be used manually. - 0x14
SWAP
: Swaps the two topmost stack values. If the stack has... A B
(A at lower address, B on top), afterSWAP
it will have... B A
. Use case: to reorder arguments or results without using memory. - 0x15
ROT
: Rotates the top three values: it takes the third value down and brings it to top, pushing the others down. Stack... A B C
becomes... B C A
(where A was at position 3, moved to top). - 0x16
OVER
: Copies the second value to top. Stack... A B
becomes... A B A
. (This is likeDUP
of the second element down.) - 0x17
NIP
: Removes the second value, leaving only the top. Stack... A B
becomes... B
(A is removed). Essentially it “nips out” the one below the top. - 0x18
DROP
: Removes an element at a given stack index. In Ubytec,DROP(byte index)
is defined. LikelyDROP 0
would drop the top (same as POP),DROP 1
would drop the second-from-top (like NIP),DROP 2
would drop the third, etc. This is a generalized NIP. If not needed, one can use POP or NIP macros, butDROP n
gives flexibility. - 0x19
TwoDUP
: Duplicates the top two values as a pair. Stack... A B
->... A B A B
. - 0x1A
TwoSWAP
: Swaps the two top pairs. Stack... A B C D
->... C D A B
(where A B is pair1, C D is pair2). - 0x1B
TwoROT
: Rotates the top three pairs (six values). For instance... P Q A B C D
->... A B C D P Q
(assuming P Q is pair1, A B pair2, C D pair3). - 0x1C
TwoOVER
: Copies the second pair to top. Stack... A B C D
->... A B C D A B
. - 0x1D
PICK
: Copies an arbitrary stack element to the top.PICK n
takes the value at depth n (0 = top, 1 = second, etc.) and pushes a copy of it on top without removing the original. E.g., stack... X Y Z
,PICK 2
(assuming 0=Z,1=Y,2=X) will result in... X Y Z X
. Commonly, Forthpick
is 0-indexed from top or 1-indexed – here likely 0 means top (like duplicating top =PICK 0
equalsDUP
). - 0x1E
ROLL
: Moves an arbitrary stack element to the top, removing it from its original place.ROLL n
takes the value at depth n and lifts it to top, shifting down all values that were above it. E.g., stack... A B C D
,ROLL 2
(if 0=D,1=C,2=B,3=A) would take B out and push it on top ->... A C D B
.
(All the above stack ops are analogous to those in Forth/Stack VM terminology. These opcodes are implemented in the bytecode specification, but many are not yet utilized by high-level code generation – they are available for manual stack manipulation and will be used by the compiler as needed once expression optimization and more complex codegen is implemented. Currently, you mostly see simpler ones like POP or DUP in potential output.)
Arithmetic opcodes:
- 0x20
ADD
: Pops two values, adds them, pushes result. Works on numeric types (integers, floats). If the operands are integer and there’s risk of overflow, that is not trapped (no overflow checking in bytecode). For pointer types, addition could be defined (e.g., pointer + int to offset), but that’s not explicitly supported now. - 0x21
SUB
: Subtract second-from-top minus top, pushes result. (If stack was ... X Y, does X - Y.) - 0x22
MUL
: Multiply two top values. - 0x23
DIV
: Divide second-from-top by top, pushes quotient. For integers this is truncated division toward 0 (like C). Division by zero will cause a trap at runtime (currently not caught except it’ll likely produce a CPU exception or a later check could throw). - 0x24
MOD
: Remainder of second-from-top divided by top (X mod Y). Sign of result follows C/C++ semantics (for positive divisors, remainder has sign of dividend; for negative divisors or negative dividend, behavior should mimic hardware idiv remainder). - 0x25
INC
: Increment top of stack by 1 (pop, add one, push result). Or if implemented as an in-place op, it might just adjust memory if directly tied to a variable (but in stack machine, it’s easier to treat like a normal add of constant 1). - 0x26
DEC
: Decrement top of stack by 1. - 0x27
NEG
: Negate (arithmetic negation) of top of stack. If top is an int X, it pushes -X. - 0x28
ABS
: Absolute value. If top is negative, negates it, otherwise leaves it. So result is non-negative.
(These arithmetic ops, when used on floats vs ints, require knowing the type. The bytecode doesn’t carry type info in the opcode itself. The responsibility is on the compiler and eventually on the code generator to use the correct machine instruction. For example, ADD
will produce either an integer add (add
) or floating add (adds-sd
etc.) depending on the types of the operands. The AST carries type info (UType of operands), so the code generator uses that. The validator ensures type compatibility – e.g., you can’t add an int32 and a float64 without a cast. Implicit numeric widening (e.g., int to float) might be allowed; the CanConvert
logic in Types
suggests it differentiates implicit vs explicit, but currently the parser does not insert conversion opcodes. Probably, if you add a float and int, the int would be up-cast to float by the parser inserting a type code in the operand (the operand queue can carry type bytes for a literal). In the future, an explicit CAST
opcode may be introduced for runtime conversions. For now, ensure operands are of the same type to avoid issues.)
Bitwise opcodes:
- 0x30
AND
: Bitwise AND of two top integers/booleans (pop two, push result bitwise AND). - 0x31
OR
: Bitwise OR. - 0x32
XOR
: Bitwise XOR (exclusive or). - 0x33
NOT
: Bitwise NOT (ones' complement) of top value. - 0x34
SHL
: Arithmetic left shift. Takes two top values: second is value, top is shift amount (this is likely the order). So if stack has X (value), Y (shift), afterSHL
it pushes (X << Y). Bits shifted out are discarded. Left shift of a signed int is same as unsigned (just bits moving left). - 0x35
SHR
: Right shift. Probably arithmetic (preserve sign bit for signed values). If X is unsigned or considered as such, it’s logical. The distinction isn’t encoded in opcode, so presumably the compiler would choose to interpret based on type (if type is unsigned, do logical, if signed, do arithmetic). - (No direct opcodes for rotate or bit extraction; those could be done via combinations or extended opcodes if needed.)
Comparison opcodes:
- 0x40
EQ
: Checks equality of two values. Pops two, pushes 1 if equal, 0 if not. For numeric types, does usual comparison. For reference types (pointers, object references), it checks if the addresses are equal. Note: to compare strings by content, one would need to loop or use a library function;EQ
on two string references only tells if they point to the same string. - 0x41
NEQ
: Inequality (not equal). Pushes 1 if the two values are not equal, else 0. - 0x42
LT
: Less-than (signed). Pops two (X, Y) and pushes 1 if X < Y, else 0. The likely order is second-from-top < top? Actually, consider typical stack usage: if you push X then Y then do LT, you'd want to check X < Y. In assembly, one would docmp X, Y
and set if X<Y. But on stack, if Y was pushed last, Y is at top, X below it. If the implementation does top as right operand, then it would do (second < top). So yes, it likely does (first_pushed < second_pushed). So the code should interpret that correctly. - 0x43
LE
: Less-or-equal. - 0x44
GT
: Greater-than. - 0x45
GE
: Greater-or-equal. - (There is no explicit opcode for unsigned comparisons. The approach might be: if comparing as unsigned, the compiler could adjust values or use a special instruction via extended opcode. Presently, all comparisons assume signed interpretation for integers. Floats are treated as their own category but
LT
etc. would be used for them too, using float compare instructions at machine level.)
Memory access opcodes (planned):
0x50
LOAD
: Load from memory. In context, likelyLOAD
takes an operand indicating what to load:- Could be an address on stack: perhaps the design is
LOAD
with no immediate will pop an address and push the value at that address (size depends on type info from context). - Or
LOAD varIndex
to load a local or global variable by index or reference. The AST schema showsOperation
can have aVariables
field referencing aSyntaxExpression
, which could be howLOAD
/STORE
refer to variable addresses or names. The commented-out code inProcessOperand
for@variable
suggests an intention that@name
might get replaced with the actual value. Possibly they scrapped that in favor of explicit opcodes. In any case,LOAD
is meant to push the value from a memory location. Example usage: if you have a pointer or array base and offset on stack, a future extendedLOAD
might pop both and push that memory content (likeLOAD
could handle both address and offset as multi-word operand). Status: Present in OpcodeMap but not actively used by compiler yet.
- Could be an address on stack: perhaps the design is
0x51
STORE
: Store to memory. Likely pops a value and an address and writes the value to that address. Alternatively, might use some immediate or context to know where to store. In typical stack machine, one might doaddr value STORE
(store value at addr). Or if it's oriented to variables, it might have an immediate reference. The ASTOperation
has an Extended form for opcodes 0xFF withExtensionGroup
andExtendedOpCode
. It’s possible that user-defined or external memory operations will use extended opcodes, whereas 0x50/0x51 might be reserved for simple pointer-based load/store. Status: In map, not used in high-level code generation yet.
(The memory model of Ubytec is still evolving. Eventually, one can expect the ability to allocate memory blocks, access array elements, etc., using these opcodes. For now, memory access is mostly through global and local variables handled by the compiler directly, without explicit LOAD/STORE in the bytecode except possibly for unstructured pointer usage.)
Runtime Execution Model
When Ubytec bytecode is executed (either by the generated native code or a future VM), the following model applies:
Evaluation Stack: There is a LIFO stack used for evaluating expressions and holding temporary values (distinct from the call stack). Most opcodes manipulate this stack – pushing or popping values. The stack has a defined size (the compiler can compute max depth needed). Overflowing it or underflowing it would be an error (in a VM, trap; in generated code, it would mean mis-managing CPU stack).
Call Stack & Frames: Each function call creates a new frame (activation record). Parameters and local variables are allocated within that frame. In the generated assembly, the CPU stack is used for both the evaluation stack and locals (the implementation simply moves RSP for locals and uses it as evaluation area too). The
LocalContext
and function prologue manage RSP accordingly. A VM implementation might separate evaluation stack from call stack for simplicity.Modules and Globals: Global variables (module fields/properties) reside in a separate memory area (data segment). They are identified by labels and accessed directly by address in assembly. In a VM, one could assign each global an index or memory offset. The module’s
global
context, if present, is executed at program start (in the current code, the global context is output as a block of code labeled and referenced, but notice: the module compile doesn’t automatically call global context functions at startup except for using_start
to callMain
. If initialization code is needed, the user might call it in Main or it should be integrated – possibly WIP that global context code could be invoked before main).Function Calls: A function call in Ubytec (syntax
funcName(args...)
) is currently handled much like in C: the caller pushes arguments, then does acall
. The generated assembly reserves stack space for arguments in the function prologue and then writes each argument into that space (for simplicity, they actually subtract space and move each argument into [RSP+offset]). Then they jump to function code. On return, it’s the caller’s job to clean up (the function itself restores RSP). This is akin to cdecl calling convention (caller cleans stack). In the bytecode, a function call would ideally be an opcode likeCALL funcIndex
or so, but currently the compiler does not define aCALL
opcode. Instead, a call is inlined as assemblycall
. They treat calls at a higher level rather than as an opcode in AST. There is a hint:keyword.function.call.ubytec
scope exists which might correspond to encountering an identifier followed by(
(so the grammar flags it as a call). The compiler likely handles it by outputting an assembly call to the function’s label. In a VM, one would implement calls by pushing a return address and jumping, or by a special opcode. For now, calls are not represented in the JSON AST explicitly except as part of the syntax tree for functions (the AST likely just has aSyntaxNode
with anOperation
that has$type: "FuncCall"
or similar – though we haven’t seen that schema, it might be under ExpressionFragments).Return Values: Functions can return a value which is left on the evaluation stack by a
RETURN
opcode or by falling off the end with a value on stack. The calling code expects that on return, the result is available (in assembly, often in a register or at top of stack). In the current assembly output, they conventionally move returned values into the RAX register (if any) beforeret
. Actually, the compiled code does:... function body ... ; Function body end ret
(They always put a
ret
and they don’t explicitly move anything to RAX, which implies that if a function is meant to return something, they probably ensure that value is in RAX at the end. The code that appendsASTCompiler.CompileAST(new SyntaxTree(Definition))
inside the function compile will produce assembly for the function body. If the function’s Definition (SyntaxSentence) yields a value on the stack, they did not explicitly handle moving it to RAX or cleaning the stack. Possibly, by x86-64 calling convention, the responsibility is on function to place return in RAX. The simplest approach is if the last operation of the function leaves one value on top of stack, and then they doret
, in their calling convention that wouldn’t automatically return it. They would have needed apop rax
or similar.This is a discrepancy: possibly they rely on the fact that the value might be in RAX if it was computed last – for example, an arithmetic might naturally leave something in RAX if using certain instructions. But with many stack operations, one would need to ensure it. This might be a current limitation – we should mark that returning values from non-void functions might not be fully wired up (maybe the
RETURN
opcode would be used to move a value into RAX and jump).The
Func.Validate
enforces that if a function is not abstract, it must have a body (Definition
not null), and presumably if a function is supposed to return non-void, the compiler should ensure a return is executed. If you omit a return in a non-void function, currently the code will fall off and doret
with whatever in RAX (likely whatever value from last expression or 0).So as of now, the convention is likely: to return a value, ensure the last operation leaves it in RAX or use an explicit
return value;
. The explicitreturn
will compile to the necessary moves. If that is not done, it's a WIP area.Program Startup and Termination: The main module (the one compiled to an executable) will have an
_start
entry generated. This sets up any required sections and callsMain
if present. After Main returns, or if no Main, it executes an OS exit system call (mov eax, 60; xor edi, edi; syscall
in Linux to exit process with code 0). This means the program ends when main returns (or immediately if no main). Memory allocated on stack is freed by normal process teardown. There is no GC or heap unless the program explicitly invokes syscalls or library calls for allocation (future library support might come). Modules can be compiled as libraries too; in that case,Main
would not be invoked.Error handling: If an invalid operation occurs (division by zero, trap, etc.), the current runtime will just crash (or yield a CPU exception). There’s no catch/throw yet. The
TRAP
opcode can be used to intentionally abort.
In summary, the runtime model currently is a straightforward stack machine executing in a single thread, with structured control flow mapping to labels and jumps in generated code. As the implementation matures, more safety checks and advanced features (like a garbage-collected heap for objects, multi-module linking, exceptions, etc.) will be added. This overview reflects the current state where features not fully implemented are marked as such, and the existing functionality can be used as a guide for compiler and toolchain authors targeting Ubytec.
References:
- Ubytec grammar and token scopes define the syntax for all constructs (e.g.,
keyword.control.flow.ubytec
coversif
,else
,while
,switch
, etc., andstorage.type.single.ubytec
covers basic types). - The interpreter’s parsing logic in
HighLevelParser.cs
shows how modules, contexts, and members are recognized and enforces certain rules (only one global/local context, etc.). - The AST schema (
.ubc.ast.json
) documents the structure of the syntax tree, including how expressions and operations are represented (e.g.,ConditionExpressionFragment
for binary conditions withOperand
like "==" or "<", and the presence of fields likeBlockType
,Condition
, andLabelIDxs
in structured opcodes). - The bytecode mapping (
OpcodeFactory
andASTCompiler
) provides the opcode assignments and the intended behavior for each operation (for instance, mapping 0x04 toIF
, 0x0C toWHILE
, etc., and illustrating stack ops likePUSH
0x11,POP
0x12, etc.). Many opcodes are defined as placeholders and currently throwNotImplementedException
if used (they exist so the design is visible, but not all produce assembly yet). - The function compilation in
Func.Compile
demonstrates how local variables and the function body are compiled to assembly, including reserving stack space for locals and arguments and appending the assembled body instructions, with a finalret
. Similarly,Action.Compile
shows an approach for action (void function) which is simpler (no return value to handle, it just ensures aret
). - Module compilation in
Module.Compile
shows how different sections are laid out: data (.data for fields and global context fields), BSS for uninitialized props, text (.text for code), followed by global label_start
which callsMain
and exits. This illustrates the overall program structure the compiler produces.
This overview should serve as a technical reference to the current Ubytec language implementation. It is intended for compiler and tool developers who need exact details on syntax and semantics actually supported. As Ubytec is under active development, some aspects marked as WIP will evolve – keep an eye on repository updates and the schema for changes. With this information, one can generate or analyze Ubytec code confident that it will reflect the real behavior of the interpreter/compiler in its current state.