28 KiB
| layout | title | permalink |
|---|---|---|
| page | Internals | /internals/ |
- toc {:toc}
High-Level Design
{: class="thumbnail center-block img-responsive" }
On the diagram above is shown interaction of major components of software system: Parser and Runtime. Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to Bytecode and Parser page for details). Prepared bytecode is executed by Runtime engine that performs interpretation (refer to Virtual Machine and ECMA pages for details).
Parser
The parser is implemented as recursive descent parser. The parser does not build any type of Abstract Syntax Tree. It converts the source JavaScript code directly into the byte-code.
The parser consists of three major parts:
- lexer
- parser
- opcodes dumper
- syntax errors checker
- serializer
These four (except the parser itself) components are initialized during parser_init call (jerry-core/parser/js/parser.cpp).
This initializer requires two following subsystems to be initialized: memory allocator and serializer. The need for allocator is clear. The serializer resets internal bytecode_data structure(jerry-core/parser/js/bytecode-data.h). Currently bytecode_data is singleton. During parsing it is filled by the data which is needed for the further execution:
- Byte-code - array of opcodes (
bytecode_data.opcodes). - Literals - array of literals (
bytecode_data.literals). - Strings buffer (
bytecode_data.strings_buffer) - literals of typeLIT_STRcontain pointers to strings, which are located in this buffer.
The following is brief review of the mentioned components. See more concise description in the following chapters.
- Lexer The lexer splits input file (given as the first parameter of the parser_init call) into sequence of tokens. These tokens are then matched on demand.
- Opcodes dumper This component does necessary checks and preparations, and dumps opcodes using serializer.
- Serializer The serializer puts opcodes, prepared by the dumper, to a continuous array that represents current scope's code. Also it provides API for accessing byte-code.
- Syntax error checker This is bunch of simple die-on-error checks.
After initialization parser_parse_program (./jerry-core/js/parser.cpp) should be called. This function performs the following steps (so-called parsing steps) for all scopes (global code and functions):
- Initialize a scope.
- Do pre-parser stage.
- Parse the scope code.
After every scope is processed, parser merges all scopes into the single byte-code array.
Two new entities were introduced - scopes and pre-parser.
- There are two types of scopes in the parser: global scope and function declaration scope. Notice that function expressions do not create a new scope in terms of the parser. The reason why is described below. Parsing process starts on global scope. If a function declaration occurs string the process, new scope is created, this new scope is pushed to a stack of current scopes; then steps 1-3 of parsing are performed. Note, that only global scope parsing shall merge all scopes into a byte-code. All scopes are stored in a tree to represent a hierarchy of them.
- Pre-parser. This step performs hoisting of variable declarations. First, it dumps
reg_var_declopcodes. Then it goes through the script and looks for variable declaration lists. For every found variable in the scope (not in a sub-scope or function expression) it dumps var_decl opcode. After this step byte-code in the scope starts with optional'use strict'marker, thenreg_var_decland several (optional)var_decls.
Due to some limitations of the parser, some parsing functions take this_arg and/or prop as parameters. They are further used to dump prop_setter opcode. During parsing all necessary data is stored in either stacks or scope trees. After parsing of the whole program, the parser merges all scopes into a single byte-code, hoisting function declarations in process. This task, so-called post-parser, is performed by scopes_tree_raw_data (jerry-core/js/scopes-tree.c) function. For the further information about post-parser, check opcodes dumper section.
Lexer
The lexer splits input string into the set of tokens. The token structure (./jerry-core/parser/js/lexer.h) consists of three elements: token type, location of the token and optional data:
{% highlight cpp %} typedef struct { locus loc; token_type type; literal_index_t uid; } token; {% endhighlight %}
Location of token (locus). It is just an index of the first token's character at a string that represents the program. Token types are listed in lexer.h header file (token_type enum). Depending on token type, token specific data (uid field) has the different meaning.
Token type | 'uid' meaning
TOK_KEYWORD | Keyword id, like KW_DO, KW_CONST, etc. (see 'keyword' enum in lexer.h).
TOK_NAME, TOK_STRING, TOK_NUMBER | Literal index in the stack of literals.
TOK_BOOL | 0 - 'false'
1 - 'true'
TOK_SMALL_INT | Value of small integer (0-255).
Other (punctuators) | Not used.
Token matching algorithm is straightforward - look at the first character of the new token, recognize the type, and then just match the rest. Comments and space characters (except new line) are ignored, so they produce no token. The algorithm uses two pointers: buffer and token_start. The first one points to the next character of the input, the other one points to the first character of token, being matched, so-called current token.
The lexer remembers two tokens during scan: current and previously seen. It also allows buffering one token to be rescanned (lexer_save_token) and setting scan position to any location in the file (lexer_seek).
The parser uses lexer to scan file two times - during pre-parsing and parsing stages.
Currently the lexer does not support any encoding except ASCII. Also the lexer does not support regular expressions.
Opcodes dumper
It is a quite high level wrapper for the serializer. It was introduced to split functionality of parsing and dumping opcodes. To understand how opcodes dumper works, one should be acquainted with the byte-code layout (see the corresponding description).
The main data structure of the dumper is an operand (jerry-core/parser/js/opcodes-dumper.h). Operand can represent either variable (i.e. literal) or temporary register (tmp). The most annoying thing of the dumper is a difference between these types.
Byte-code is divided into blocks of fixed size (BLOCK_SIZE in jerry-core/parser/js/bytecode-data.h) and each block has independent encoding of variable names, which are represented by 8 bit numbers - uids.
Operands are encoded as uids in each opcode (see the opcode_t structure).
As byte-code decomposition into blocks is not possible until parsing is finished, uids can't be calculated on the fly. Therefore literal operands are encoded by literal indexes (literal_index_t - index in the global literals array) during parsing. In the post-parser stage these indexes are converted to block specific uids.
During parsing scopes tree structure is constructed (see scopes_tree_int in the jerry-core/parser/js/scopes-tree.h). Each tree node comprises of its byte-code and list of child scopes. While final byte-code is the plain array of opcode_t structures, byte-code in tree nodes is represented by the list of op_meta structures. Op_meta structure wraps opcode_t with an array of 3 values (result, operand_1 and operand_2), which holds literal indexes, so that literal operands could be encoded.
In each dump_* function (jerry-core/parser/js/opcodes-dumper.h) the dumper checks for the operand type and dumps appropriate op_meta to the scopes tree using serializer. The dumper also keeps opcode counters of rewritable opcodes inside a bunch of stacks. It dumps an op_meta and pushed an opcodes counter of the op_meta to a stack in functions with a name like dump_*_for_rewrite, then pops an opcode counter from the stack, retrieves op_meta by the dematerializer and rewrites necessary fields of opcodes in functions with names like rewrite_*.
The post-parser merges scopes into a single byte-code. For each scope it first dumps a header of the scope, which consists of optional func_decl with function_end opcode pair, optional ‘use strict’ marker, reg_var_decl and optional var_decls. Then it recursively dumps sub-scopes. Finally, it dumps the remainder of opcodes. The byte-code is split into blocks with fixed size; each block has its own counter of literals. While dumping opcodes the post-parser replaces LITERAL_TO_REWRITE markers with this counter’s value.
Serializer
Serializer dumps literals collected by the lexer to bytecode_data, is used by the dumper to dump or rewrite op_metas to a current scope.
Syntax Errors Checker
This component is just checks for syntax errors defined in the specification. It uses stacks to store necessary data, for example arguments names.
Byte-code
Every instruction of bytecode consists of opcode and up to three operands. Operand (idx) can be either a "register" or a string
literal, specifying identifier to evaluate (i.e. var //Storage idx). General structure of instruction is shown on the picture.
| opcode | idx | idx | idx |
Special kinds of instructions are described below.
Arithmetic/bitwise-logic/logic/comparison/shift
Arithmetic instruction can have the following structure:
|opcode|dst|left|right|
|opcode|dst|value|-|
where dst/left/right/value identify an operand.
Control (jumps)
Control instructions utilize two bytes to encode jump location. Destination offset is contained inside offset_high and offset_low fields.
|opcode|offset-high|offset-low|-|
|opcode|cond value|offset-high|offset-low|
Condition jump checks cond value field, which identifies an operand, and performs a jump if the operand has true value.
Assignment
Assignment instructions perform assignment of immediate value (contained inside instruction) to the operand, which is marked as idx on the picture.
|op_assignment|dst|type|value|
where
dst - "storage idx", identifies where to store the value
type - specifies value type
value - depends on type field
Type of the immediate value is encoded in the type field of instruction. The following values are supported:
- "simple value" (see ECMA types encoding)
- small integer/negative small integer
- number literal/negative number literal
- sring value, initialized by string literal ("literal idx")
- "Srorage idx"
Exit
Exit instruction serves to stop the execution and exit with a specified status.
|op_exit|status(0/1)|-|-|
Exit instruction is employed in following cases:
- at script end (exit with "succesful" stats);
- in script assertion fail handling code (exit with "fail" status)
Native call (intrinsic call)
Native call instruction is used to call intrinsics. Arguments are not encoded directly inside this instruction, instead they follow it as special "meta" instructions (see the according section). Id of desired intrinsic is encoded in the intrinsic id field.
|op_native_call|dst|intrinsic_id|arg_list|
where
dst - "storage idx"
arg_list - number of arguments
Function call/Constructor call
Function/constructor call are utilized to perform calls to functions and constructors. Destination operand is encoded in dst field. Operand name_idx specifies the name of the function to call. Arguments are encoded the same way as in the native call instruction.
|opcode|dst|name_idx|arg_list|
where
dst - "storage idx"
name_idx - "storage idx" (which value to call)
arg_list - number of arguments
Function declaration
Function declarations are represented by the special kind of instructions. Function name and number of arguments are located in name_idx and arg_list fields respectively.
|opcode|name_idx|arg_list|-|
where
name_idx - literal idx
arg_list - number of arguments
Function expression
Very similar to function declaration. But additionally contains destination (dst) field and name operand is optional, because anonymous functions are possible.
|opcode|dst|name_idx|arg_list|
where
dst - "storage idx"
name_idx - literal idx (can be unspecified for anonymos function expression)
arg_list - number of arguments
Return from function/eval
Return instructions perfrom unconditional return from function/eval code. Return value can be specified (idx field).
|op_ret|-|-|-|
|op_retval|idx|-|-|
where
idx - "storage idx"
"Meta" (special marker opcode)
|op_meta|type|arg1|arg2|
Meta instructions are usually utilized as continuations of other instructions. Depending on type field, meta instruction can have the following meaning:
- 'this' argument (for calls in a.f() form, a = this), put right after call opcode
varg(encodes an argument for calls and array declarations (arg1- storage idx) / parameters name for function decl/expr (arg1- literal idx, i.e. string))- carg_prop_data / varg_prop_getter / varg_prop_setter - name (literal idx) and value/getter/setter (storage idx) of a property (see also: object declaration)
- end_with / function_end / end_of_try_catch_finally - end offset of 'with' block/function/try_catch_finally sequence
- catch / finally - start of catch/finally block and offset to the end of the block
- strict code - placed at the start of a scope's code if the source code contains 'use strict' at the beginning
Delete
JavaScript delete operator is represented with delete instruction in the bytecode. There are two types of delete instruction, applied either to element of lexical environment or to object's property.
|op_delete_var|dst|name|-|
|op_delete_prop|dst|base_value|name|
where
dst - "storage idx"
name - literal idx
base_value - "storage idx"
This binding (evaluate "this")
This binding instruction writes value of "this" to the dst operand.
|op_this|dst|-|-|
where
dst - "storage idx"
typeof (typeof operation)
Typeof instruction executes JavaScript operator with the same name. Result is written to the dst operand.
|op_typeof|dst|value|-|
where
dst and value - "storage idx"
with block
To specify bounds of "with" block, a pair of instructions is used. "With" instruction specifies its start.
| op_with | value | - | - |
where
value - "storage idx" (evaluated expression - argument of with)
Followed by a number of arbitrary instructions, the block ends with end_with meta instruction.
|op_with| || || |...| |op_meta (end_with)|
try block
Try block consists of try instruction, followed by a number of arbitrary instructions, meta instruction catch or finally or both of them, separating catch and finally blocks respectively and meta instruction end_try_catch_finally, which finishes the whole construction.
| op_try_block | offset_high | offset_low | - |
where
offset_high and offset_low - offset of the end of try block
|op_try_block| |...| |op_meta (catch)| |...| |op_meta (finally)| |...| |op_meta (end_try_catch_finally)|
Object declaration
Obect declaration instruction represents object literal in JavaScript specification. It consists of op_obj_decl instruction, followed by the list of prop_data, prop_getter and prop_setter meta instructions. A series of instructions which evaluate property values can precede meta instructions. Number of meta instructions, e.g. number of properties, is specified in the prop_num field.
| op_obj_decl | dst | prop_num | - |
where
dst - "storage idx" (where to save the created object)
prop_num - number of properties
|op_obj_decl|
|...
(intermediate evaluation of value/function expression, etc.)|
|op_meta (prop_data/ prop_getter/ prop_setter)|
Arguments and array declarartion
The strategy descibed in previous section is also used for encoding of arguments in function/constructor calls and elements in array declarations. See the according pictures.
| op_with | value | - | - |
where
value - "storage idx" (evaluated expression - argument of with)
|op_with| || || |...| |op_meta (end_with)|
Virtual machine
Virtual machine executes bytecode by interpreting instructions one by one. Bytecode is a continuous array of instructions, divided into blocks of fixed size. Main loop of interpreter calls opfunc_* for every instruction. This function returns completion value and position of the next instruction.
{: class="thumbnail center-block img-responsive" }
Instruction can have up to three operands which are represented by idx values. Meaning of idx value depends on opcode and can be the following:
- id of a temporary variable (register)
- id of literal (quiried form serializer, specific to every block of bytecode)
- type of assigned value, id of number/string literal or simple value in
op_assignment - type of meta and corresponding arguments in
op_meta - idx pair may represent opcode position
During the execution every function of the source code has associated interpreter context, which consists of the following items:
- current position (byte-code instruction to execute)
- 'this' binding (ecma-value)
- lexical environment
is_strictflag (is current execution code strict)is_eval_code_lag(is current execution mode eval)min_reg_num,max_reg_num- range ofidx's used for "registers"- stack frame (array of "register" values)
Main routines of the virtual machine are:
run_int- starts execution of Global code (main program).run_int_from_pos- executes specified code scope (global/function/eval), expects the following arguments: starting position, 'this' binding, lexical environment.run_int_loop- interpretation loop.
ECMA
ECMA component of the engine is responsible for the following notions:
- Data representation
- Runtime representation
- GC
Data representation
The major structure for data representation is ECMA_value. Lower two bits of this structure encode value tag, which determines the type of the value:
- simple
- number
- string
- object
{: class="thumbnail center-block img-responsive" }
The immediate value is placed in higher bits. "Simple value" is an enumeration, which consists of the following elements:
- undefined
- null
- true
- false
- empty
- array_redirect (implementation defined, currently unused, for array storage optimization)
For other value types higher bits of ECMA_value structure contain compressed pointer to the real value.
Compressed pointers
Compressed pointers were introduced to save heap space. They are possible because heap size is currently limited by 256 KB, which requires 18 bits to cover it. ECMA values in heap are aligned by 8 bytes and this allows to save three more bits, so that compressed pointer consumes 15 bits only.
{: class="thumbnail center-block img-responsive" }
ECMA data elements are allocated in pools (pools are allocated on heap) Chunk size of the pool is 8 bytes (reduces fragmentation).
Number
There are two possible representation of numbers:
- 4-byte (float, compact profile - no memory consumption, but hardware limitations)
- 8-byte (double, full profile)
Several references to single allocated number are not supported. Each reference holds its own copy of a number.
String
String values are encoded by 8-byte structure, which contains the following fields:
- references counter - each stack (and non_stack) reference is counted (upon overflow, string is duplicated)
- is_stack_allocated - some temporary strings are stack_allocated to reduce loading of memory (perf)
- container - type of actual string storage/encoding
- hash - hash, calculated from two last characters (for faster comparison (perf))
- literal identifier - actual string is in the literal storage
- magic_string_id - string is equal to one of engine's magic strings
- uint32 - string is represented with unsigned integers (useful for array indexing)
- number_cp (compressed pointer to number) - string is represented with floating point number
- collection_cp - string is stored in one or several pool's chunks (see also: chars collection, collection header, collection chunk)
- concatenation_1_cp, concatenation_2_cp - pointers to two strings (parts of concatenation)
Object / Lexical environment
Object and lexical environment structures, 8 bytes each, have common (GC) header:
- Stack refs counter
- Next object/lexical environment in list of objects/lexical environments
- GC's visited flag
- is_lexenv flag
Remaining fields of these structures are different and are shown on the picture.
{: class="thumbnail center-block img-responsive" }
Property of an object / description of a lexical environment variable
While objects comprise of properties, lexical environments consist of variables. Both of these units are tied up into lists. Unit types could be different:
- named data (property or variable)
- named accessor (property)
- internal (implementation defined)
All these units occupy 8 bytes and have common header:
- type - 2 bit
- next property/variable in the object/lexical environment (compressed pointer)
The remaining parts are differnt: {: class="thumbnail center-block img-responsive" }
Collections
ECMA runtime utilizes collections for intermediate calculations. Collection consists of a header and a number of linked chunks, which hold collection values.
Header occupies 8 bytes and consists of:
- compressed pointer to the next chunk
- number of elements
- rest space, aligned down to byte, is for the first chunk of data in collection
Chunk's layout is following:
- compressed pointer to the next chunk
- rest space, aligned down to byte, is for data stored in corresponding part of the collection
Internal properties:
- Class - class of the object (ECMA-defined)
- Prototipe - is stored in object description
- Extensible - is stored in object description
- CScope - lexical environment (function's variable space)
- ParametersMap - arguments object -0 code of the function
- Code - where to find bytecode of the function
- native code - where to find code of native unction
- native handle - some uintptr_t assosiated with the objec
- FormalParameters - collection of pointers to ecma_string_t (the list of formal parameters of the function)
- PrimitiveValue for String - for String object
- PrimitiveValue for Number - for Number object
- PrimitiveValue for Boolean - for Boolean object
- built-in related:
- built-in id - id of built-in object
- built-in routine id - id of built-in routine
- "non-instantiated" mask - what built-in properties where notinstantiated yet (lazy instantiation)
- extention object identifier
LCache
LCache is a cache for property variable search requests.
{: class="thumbnail center-block img-responsive"}
Entry of LCache has the following layout:
- object pointer
- property name (pointer to string)
- property pointer
Caches's row is defined by string's hash. When a property access occurs, all row's entries are searched by comparing object pointer and property name according entry's fields, full comparison is used for property name.
If corresponding entry was found, its property pointer is returned (may be NULL - in case when there is no property with specified name in given object). Otherwise, object's property set is iterated fully and corresponding record is registered in LCache (with property pointer if it was found or NULL otherwise).
Runtime
ECMA-defined runtime operations are implemented mostly with routine having the following signature:
ecma_completion_value_t ecma_op_* ([ecma_value_t arguments])
or
ecma_property_t * ecma_op_[find/get]*_property (objs, name string, ...)
However, there could be some combinations.
Completion value
Many algorithms/routines described in ECMA return a value of "completion" type, that is triplet of the following form:
{: class="thumbnail center-block img-responsive" }
Jerry introduces two additional completion types:
- exit - produced by
exitvalopcode, indicates request to finish execution - meta - produced by meta instruction, used to catch meta opcodes in interpreter loop without explicit comparison on every iteration (for example: meta 'end_with')
Value management and ownership
Every value stored by engine is associated with virtual "ownership" (that is responsibility to manage the value and free it when is is not needed, or pass ownership of the value to somewhere)
| Value type| "Alloc" op | "Free" op| | Number| ecma_alloc_number| ecma_dealloc_number| | String| ecma_copy_or_ref_ecma_string| ecma_deref_ecma_string| | Object|ecma_ref_object (for on_stack references) on_heap references are managed by GC|ecma_deref_object (for on_stack references) on_heap references are managed by GC| | Property|(ownership is strongly connected with corresponding object)|(ownership is strongly connected with corresponding object)| | Simple value|no memory management|no memory management| | ecma_value (including values contained in completion value)|ecma_copy_value| ecma_free_value|
Initially, value is allocated by its owner (i.e with ownership). Value, passed in argument is passed without ownership. Value, returned from function is returned with ownership. Rules for completion value are the same as rules for values, contained in them.
Opcode handler structure
Most opcode handlers consists of the following steps:
- Decode instruction (i.e. extract
idx-s) - Read input values from variables/registers
- Perform necessary type conversions
- Perform calls to runtime
- Save execution result to output variable
- Increment opcode position counter
- Return completion value.
Steps 2-5 can produce exceptions. In this case execution is continued after corresponding FINALIZE mark, and completion value is set to throw exception.
Exception handling
Operations that could produce exceptions should be performed in one of the following ways:
- wrapped into ECMA_TRY_CATCH block:
ECMA_TRY_CATCH (value_returned_from_op, op (... ),ret_value_of_the_whole_routine_handler)``...ECMA_FINALIZE(value_returned_from_op);return ret_value;` ret_value = op(...);- manual handling (for special cases like interpretation of opfunc_try_block).