jerryscript/04.internals.md
Sung-Jae Lee 2f9f4dba85 Fix invalid link in 'Internals' page.
JerryScript-DCO-1.0-Signed-off-by: Sung-Jae Lee sjlee@mail.com
2015-06-29 19:42:18 +09:00

28 KiB
Raw Blame History

layout title permalink
page Internals /internals/
  • toc {:toc}

High-Level Design

![High-Level Design]({{ site.baseurl }}/img/engines_high_level_design.jpg){: class="thumbnail center-block img-responsive" }

On the diagram above is shown interaction of major components of software system: Parser and Runtime. Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to Bytecode and Parser page for details). Prepared bytecode is executed by Runtime engine that performs interpretation (refer to Virtual Machine and ECMA pages for details).

Parser

The parser is implemented as recursive descent parser. The parser does not build any type of Abstract Syntax Tree. It converts the source JavaScript code directly into the byte-code.

The parser consists of three major parts:

  • lexer
  • parser
  • opcodes dumper
  • syntax errors checker
  • serializer

These four (except the parser itself) components are initialized during parser_init call (jerry-core/parser/js/parser.cpp).

This initializer requires two following subsystems to be initialized: memory allocator and serializer. The need for allocator is clear. The serializer resets internal bytecode_data structure(jerry-core/parser/js/bytecode-data.h). Currently bytecode_data is singleton. During parsing it is filled by the data which is needed for the further execution:

  • Byte-code - array of opcodes (bytecode_data.opcodes).
  • Literals - array of literals (bytecode_data.literals).
  • Strings buffer (bytecode_data.strings_buffer) - literals of type LIT_STR contain pointers to strings, which are located in this buffer.

The following is brief review of the mentioned components. See more concise description in the following chapters.

  • Lexer The lexer splits input file (given as the first parameter of the parser_init call) into sequence of tokens. These tokens are then matched on demand.
  • Opcodes dumper This component does necessary checks and preparations, and dumps opcodes using serializer.
  • Serializer The serializer puts opcodes, prepared by the dumper, to a continuous array that represents current scope's code. Also it provides API for accessing byte-code.
  • Syntax error checker This is bunch of simple die-on-error checks.

After initialization parser_parse_program (./jerry-core/js/parser.cpp) should be called. This function performs the following steps (so-called parsing steps) for all scopes (global code and functions):

  1. Initialize a scope.
  2. Do pre-parser stage.
  3. Parse the scope code.

After every scope is processed, parser merges all scopes into the single byte-code array.

Two new entities were introduced - scopes and pre-parser.

  • There are two types of scopes in the parser: global scope and function declaration scope. Notice that function expressions do not create a new scope in terms of the parser. The reason why is described below. Parsing process starts on global scope. If a function declaration occurs string the process, new scope is created, this new scope is pushed to a stack of current scopes; then steps 1-3 of parsing are performed. Note, that only global scope parsing shall merge all scopes into a byte-code. All scopes are stored in a tree to represent a hierarchy of them.
  • Pre-parser. This step performs hoisting of variable declarations. First, it dumps reg_var_decl opcodes. Then it goes through the script and looks for variable declaration lists. For every found variable in the scope (not in a sub-scope or function expression) it dumps var_decl opcode. After this step byte-code in the scope starts with optional 'use strict' marker, then reg_var_decl and several (optional) var_decls.

Due to some limitations of the parser, some parsing functions take this_arg and/or prop as parameters. They are further used to dump prop_setter opcode. During parsing all necessary data is stored in either stacks or scope trees. After parsing of the whole program, the parser merges all scopes into a single byte-code, hoisting function declarations in process. This task, so-called post-parser, is performed by scopes_tree_raw_data (jerry-core/js/scopes-tree.c) function. For the further information about post-parser, check opcodes dumper section.

Lexer

The lexer splits input string into the set of tokens. The token structure (./jerry-core/parser/js/lexer.h) consists of three elements: token type, location of the token and optional data:

{% highlight cpp %} typedef struct { locus loc; token_type type; literal_index_t uid; } token; {% endhighlight %}

Location of token (locus). It is just an index of the first token's character at a string that represents the program. Token types are listed in lexer.h header file (token_type enum). Depending on token type, token specific data (uid field) has the different meaning.

Token type | 'uid' meaning TOK_KEYWORD | Keyword id, like KW_DO, KW_CONST, etc. (see 'keyword' enum in lexer.h). TOK_NAME, TOK_STRING, TOK_NUMBER | Literal index in the stack of literals. TOK_BOOL | 0 - 'false'
1 - 'true' TOK_SMALL_INT | Value of small integer (0-255). Other (punctuators) | Not used.

Token matching algorithm is straightforward - look at the first character of the new token, recognize the type, and then just match the rest. Comments and space characters (except new line) are ignored, so they produce no token. The algorithm uses two pointers: buffer and token_start. The first one points to the next character of the input, the other one points to the first character of token, being matched, so-called current token.

The lexer remembers two tokens during scan: current and previously seen. It also allows buffering one token to be rescanned (lexer_save_token) and setting scan position to any location in the file (lexer_seek).

The parser uses lexer to scan file two times - during pre-parsing and parsing stages.

Currently the lexer does not support any encoding except ASCII. Also the lexer does not support regular expressions.

Opcodes dumper

It is a quite high level wrapper for the serializer. It was introduced to split functionality of parsing and dumping opcodes. To understand how opcodes dumper works, one should be acquainted with the byte-code layout (see the corresponding description).

The main data structure of the dumper is an operand (jerry-core/parser/js/opcodes-dumper.h). Operand can represent either variable (i.e. literal) or temporary register (tmp). The most annoying thing of the dumper is a difference between these types.

Byte-code is divided into blocks of fixed size (BLOCK_SIZE in jerry-core/parser/js/bytecode-data.h) and each block has independent encoding of variable names, which are represented by 8 bit numbers - uids. Operands are encoded as uids in each opcode (see the opcode_t structure). As byte-code decomposition into blocks is not possible until parsing is finished, uids can't be calculated on the fly. Therefore literal operands are encoded by literal indexes (literal_index_t - index in the global literals array) during parsing. In the post-parser stage these indexes are converted to block specific uids.

During parsing scopes tree structure is constructed (see scopes_tree_int in the jerry-core/parser/js/scopes-tree.h). Each tree node comprises of its byte-code and list of child scopes. While final byte-code is the plain array of opcode_t structures, byte-code in tree nodes is represented by the list of op_meta structures. Op_meta structure wraps opcode_t with an array of 3 values (result, operand_1 and operand_2), which holds literal indexes, so that literal operands could be encoded.

In each dump_* function (jerry-core/parser/js/opcodes-dumper.h) the dumper checks for the operand type and dumps appropriate op_meta to the scopes tree using serializer. The dumper also keeps opcode counters of rewritable opcodes inside a bunch of stacks. It dumps an op_meta and pushed an opcodes counter of the op_meta to a stack in functions with a name like dump_*_for_rewrite, then pops an opcode counter from the stack, retrieves op_meta by the dematerializer and rewrites necessary fields of opcodes in functions with names like rewrite_*.

The post-parser merges scopes into a single byte-code. For each scope it first dumps a header of the scope, which consists of optional func_decl with function_end opcode pair, optional use strict marker, reg_var_decl and optional var_decls. Then it recursively dumps sub-scopes. Finally, it dumps the remainder of opcodes. The byte-code is split into blocks with fixed size; each block has its own counter of literals. While dumping opcodes the post-parser replaces LITERAL_TO_REWRITE markers with this counters value.

Serializer

Serializer dumps literals collected by the lexer to bytecode_data, is used by the dumper to dump or rewrite op_metas to a current scope.

Syntax Errors Checker

This component is just checks for syntax errors defined in the specification. It uses stacks to store necessary data, for example arguments names.

Byte-code

Every instruction of bytecode consists of opcode and up to three operands. Operand (idx) can be either a "register" or a string literal, specifying identifier to evaluate (i.e. var //Storage idx). General structure of instruction is shown on the picture.

| opcode | idx | idx | idx |


Special kinds of instructions are described below.

Arithmetic/bitwise-logic/logic/comparison/shift

Arithmetic instruction can have the following structure:

|opcode|dst|left|right|

|opcode|dst|value|-|

where dst/left/right/value identify an operand.

Control (jumps)

Control instructions utilize two bytes to encode jump location. Destination offset is contained inside offset_high and offset_low fields.

|opcode|offset-high|offset-low|-|

|opcode|cond value|offset-high|offset-low|

Condition jump checks cond value field, which identifies an operand, and performs a jump if the operand has true value.

Assignment

Assignment instructions perform assignment of immediate value (contained inside instruction) to the operand, which is marked as idx on the picture.

|op_assignment|dst|type|value|

where dst - "storage idx", identifies where to store the value type - specifies value type value - depends on type field

Type of the immediate value is encoded in the type field of instruction. The following values are supported:

  • "simple value" (see ECMA types encoding)
  • small integer/negative small integer
  • number literal/negative number literal
  • sring value, initialized by string literal ("literal idx")
  • "Srorage idx"

Exit

Exit instruction serves to stop the execution and exit with a specified status.

|op_exit|status(0/1)|-|-|

Exit instruction is employed in following cases:

  • at script end (exit with "succesful" stats);
  • in script assertion fail handling code (exit with "fail" status)

Native call (intrinsic call)

Native call instruction is used to call intrinsics. Arguments are not encoded directly inside this instruction, instead they follow it as special "meta" instructions (see the according section). Id of desired intrinsic is encoded in the intrinsic id field.

|op_native_call|dst|intrinsic_id|arg_list|

where dst - "storage idx" arg_list - number of arguments

Function call/Constructor call

Function/constructor call are utilized to perform calls to functions and constructors. Destination operand is encoded in dst field. Operand name_idx specifies the name of the function to call. Arguments are encoded the same way as in the native call instruction.

|opcode|dst|name_idx|arg_list|

where dst - "storage idx" name_idx - "storage idx" (which value to call) arg_list - number of arguments

Function declaration

Function declarations are represented by the special kind of instructions. Function name and number of arguments are located in name_idx and arg_list fields respectively.

|opcode|name_idx|arg_list|-|

where name_idx - literal idx arg_list - number of arguments

Function expression

Very similar to function declaration. But additionally contains destination (dst) field and name operand is optional, because anonymous functions are possible.

|opcode|dst|name_idx|arg_list|

where dst - "storage idx" name_idx - literal idx (can be unspecified for anonymos function expression) arg_list - number of arguments

Return from function/eval

Return instructions perfrom unconditional return from function/eval code. Return value can be specified (idx field).

|op_ret|-|-|-|

|op_retval|idx|-|-|

where idx - "storage idx"

"Meta" (special marker opcode)

|op_meta|type|arg1|arg2|

Meta instructions are usually utilized as continuations of other instructions. Depending on type field, meta instruction can have the following meaning:

  • 'this' argument (for calls in a.f() form, a = this), put right after call opcode
  • varg (encodes an argument for calls and array declarations (arg1 - storage idx) / parameters name for function decl/expr (arg1 - literal idx, i.e. string))
  • carg_prop_data / varg_prop_getter / varg_prop_setter - name (literal idx) and value/getter/setter (storage idx) of a property (see also: object declaration)
  • end_with / function_end / end_of_try_catch_finally - end offset of 'with' block/function/try_catch_finally sequence
  • catch / finally - start of catch/finally block and offset to the end of the block
  • strict code - placed at the start of a scope's code if the source code contains 'use strict' at the beginning

Delete

JavaScript delete operator is represented with delete instruction in the bytecode. There are two types of delete instruction, applied either to element of lexical environment or to object's property.

|op_delete_var|dst|name|-|

|op_delete_prop|dst|base_value|name|

where dst - "storage idx" name - literal idx base_value - "storage idx"

This binding (evaluate "this")

This binding instruction writes value of "this" to the dst operand.

|op_this|dst|-|-|

where dst - "storage idx"

typeof (typeof operation)

Typeof instruction executes JavaScript operator with the same name. Result is written to the dst operand.

|op_typeof|dst|value|-|

where dst and value - "storage idx"

with block

To specify bounds of "with" block, a pair of instructions is used. "With" instruction specifies its start.

| op_with | value | - | - |

where value - "storage idx" (evaluated expression - argument of with)

Followed by a number of arbitrary instructions, the block ends with end_with meta instruction.

|op_with| || || |...| |op_meta (end_with)|

try block

Try block consists of try instruction, followed by a number of arbitrary instructions, meta instruction catch or finally or both of them, separating catch and finally blocks respectively and meta instruction end_try_catch_finally, which finishes the whole construction.

| op_try_block | offset_high | offset_low | - |

where offset_high and offset_low - offset of the end of try block

|op_try_block| |...| |op_meta (catch)| |...| |op_meta (finally)| |...| |op_meta (end_try_catch_finally)|

Object declaration

Obect declaration instruction represents object literal in JavaScript specification. It consists of op_obj_decl instruction, followed by the list of prop_data, prop_getter and prop_setter meta instructions. A series of instructions which evaluate property values can precede meta instructions. Number of meta instructions, e.g. number of properties, is specified in the prop_num field.

| op_obj_decl | dst | prop_num | - |

where dst - "storage idx" (where to save the created object) prop_num - number of properties

|op_obj_decl| |...
(intermediate evaluation of value/function expression, etc.)| |op_meta (prop_data/ prop_getter/ prop_setter)|

Arguments and array declarartion

The strategy descibed in previous section is also used for encoding of arguments in function/constructor calls and elements in array declarations. See the according pictures.

| op_with | value | - | - |

where value - "storage idx" (evaluated expression - argument of with)

|op_with| || || |...| |op_meta (end_with)|

Virtual machine

Virtual machine executes bytecode by interpreting instructions one by one. Bytecode is a continuous array of instructions, divided into blocks of fixed size. Main loop of interpreter calls opfunc_* for every instruction. This function returns completion value and position of the next instruction.

![Bytecode storage]({{ site.baseurl }}/img/bytecode_storage.jpg){: class="thumbnail center-block img-responsive" }

Instruction can have up to three operands which are represented by idx values. Meaning of idx value depends on opcode and can be the following:

  • id of a temporary variable (register)
  • id of literal (quiried form serializer, specific to every block of bytecode)
  • type of assigned value, id of number/string literal or simple value in op_assignment
  • type of meta and corresponding arguments in op_meta
  • idx pair may represent opcode position

During the execution every function of the source code has associated interpreter context, which consists of the following items:

  • current position (byte-code instruction to execute)
  • 'this' binding (ecma-value)
  • lexical environment
  • is_strict flag (is current execution code strict)
  • is_eval_code_lag (is current execution mode eval)
  • min_reg_num, max_reg_num - range of idx's used for "registers"
  • stack frame (array of "register" values)

Main routines of the virtual machine are:

  • run_int - starts execution of Global code (main program).
  • run_int_from_pos - executes specified code scope (global/function/eval), expects the following arguments: starting position, 'this' binding, lexical environment.
  • run_int_loop - interpretation loop.

ECMA

ECMA component of the engine is responsible for the following notions:

  • Data representation
  • Runtime representation
  • GC

Data representation

The major structure for data representation is ECMA_value. Lower two bits of this structure encode value tag, which determines the type of the value:

  • simple
  • number
  • string
  • object

![ECMA value representation]({{ site.baseurl }}/img/ecma_value.jpg){: class="thumbnail center-block img-responsive" }

The immediate value is placed in higher bits. "Simple value" is an enumeration, which consists of the following elements:

  • undefined
  • null
  • true
  • false
  • empty
  • array_redirect (implementation defined, currently unused, for array storage optimization)

For other value types higher bits of ECMA_value structure contain compressed pointer to the real value.

Compressed pointers

Compressed pointers were introduced to save heap space. They are possible because heap size is currently limited by 256 KB, which requires 18 bits to cover it. ECMA values in heap are aligned by 8 bytes and this allows to save three more bits, so that compressed pointer consumes 15 bits only.

![Heap and ECMA elements]({{ site.baseurl }}/img/ecma_compressed.jpg){: class="thumbnail center-block img-responsive" }

ECMA data elements are allocated in pools (pools are allocated on heap) Chunk size of the pool is 8 bytes (reduces fragmentation).

Number

There are two possible representation of numbers:

  • 4-byte (float, compact profile - no memory consumption, but hardware limitations)
  • 8-byte (double, full profile)

Several references to single allocated number are not supported. Each reference holds its own copy of a number.

String

String values are encoded by 8-byte structure, which contains the following fields:

  • references counter - each stack (and non_stack) reference is counted (upon overflow, string is duplicated)
  • is_stack_allocated - some temporary strings are stack_allocated to reduce loading of memory (perf)
  • container - type of actual string storage/encoding
  • hash - hash, calculated from two last characters (for faster comparison (perf))
  • literal identifier - actual string is in the literal storage
  • magic_string_id - string is equal to one of engine's magic strings
  • uint32 - string is represented with unsigned integers (useful for array indexing)
  • number_cp (compressed pointer to number) - string is represented with floating point number
  • collection_cp - string is stored in one or several pool's chunks (see also: chars collection, collection header, collection chunk)
  • concatenation_1_cp, concatenation_2_cp - pointers to two strings (parts of concatenation)

Object / Lexical environment

Object and lexical environment structures, 8 bytes each, have common (GC) header:

  • Stack refs counter
  • Next object/lexical environment in list of objects/lexical environments
  • GC's visited flag
  • is_lexenv flag

Remaining fields of these structures are different and are shown on the picture.

![Object/Lexicat environment structures]({{ site.baseurl }}/img/ecma_object.jpg){: class="thumbnail center-block img-responsive" }

Property of an object / description of a lexical environment variable

While objects comprise of properties, lexical environments consist of variables. Both of these units are tied up into lists. Unit types could be different:

  • named data (property or variable)
  • named accessor (property)
  • internal (implementation defined)

All these units occupy 8 bytes and have common header:

  • type - 2 bit
  • next property/variable in the object/lexical environment (compressed pointer)

The remaining parts are differnt: ![Object property/lexcial environment variable]({{ site.baseurl }}/img/ecma_object_property.jpg){: class="thumbnail center-block img-responsive" }

Collections

ECMA runtime utilizes collections for intermediate calculations. Collection consists of a header and a number of linked chunks, which hold collection values.

Header occupies 8 bytes and consists of:

  • compressed pointer to the next chunk
  • number of elements
  • rest space, aligned down to byte, is for the first chunk of data in collection

Chunk's layout is following:

  • compressed pointer to the next chunk
  • rest space, aligned down to byte, is for data stored in corresponding part of the collection

Internal properties:

  • Class - class of the object (ECMA-defined)
  • Prototipe - is stored in object description
  • Extensible - is stored in object description
  • CScope - lexical environment (function's variable space)
  • ParametersMap - arguments object -0 code of the function
  • Code - where to find bytecode of the function
  • native code - where to find code of native unction
  • native handle - some uintptr_t assosiated with the objec
  • FormalParameters - collection of pointers to ecma_string_t (the list of formal parameters of the function)
  • PrimitiveValue for String - for String object
  • PrimitiveValue for Number - for Number object
  • PrimitiveValue for Boolean - for Boolean object
  • built-in related:
    • built-in id - id of built-in object
    • built-in routine id - id of built-in routine
    • "non-instantiated" mask - what built-in properties where notinstantiated yet (lazy instantiation)
    • extention object identifier

LCache

LCache is a cache for property variable search requests.

![LCache]({{ site.baseurl }}/img/ecma_lcache.jpg){: class="thumbnail center-block img-responsive"}

Entry of LCache has the following layout:

  • object pointer
  • property name (pointer to string)
  • property pointer

Caches's row is defined by string's hash. When a property access occurs, all row's entries are searched by comparing object pointer and property name according entry's fields, full comparison is used for property name.

If corresponding entry was found, its property pointer is returned (may be NULL - in case when there is no property with specified name in given object). Otherwise, object's property set is iterated fully and corresponding record is registered in LCache (with property pointer if it was found or NULL otherwise).

Runtime

ECMA-defined runtime operations are implemented mostly with routine having the following signature:

ecma_completion_value_t ecma_op_* ([ecma_value_t arguments])
or
ecma_property_t * ecma_op_[find/get]*_property (objs, name string, ...)

However, there could be some combinations.

Completion value

Many algorithms/routines described in ECMA return a value of "completion" type, that is triplet of the following form:

![ECMA completion]({{ site.baseurl }}/img/ecma_completion.jpg){: class="thumbnail center-block img-responsive" }

Jerry introduces two additional completion types:

  • exit - produced by exitval opcode, indicates request to finish execution
  • meta - produced by meta instruction, used to catch meta opcodes in interpreter loop without explicit comparison on every iteration (for example: meta 'end_with')

Value management and ownership

Every value stored by engine is associated with virtual "ownership" (that is responsibility to manage the value and free it when is is not needed, or pass ownership of the value to somewhere)

| Value type| "Alloc" op | "Free" op| | Number| ecma_alloc_number| ecma_dealloc_number| | String| ecma_copy_or_ref_ecma_string| ecma_deref_ecma_string| | Object|ecma_ref_object (for on_stack references) on_heap references are managed by GC|ecma_deref_object (for on_stack references) on_heap references are managed by GC| | Property|(ownership is strongly connected with corresponding object)|(ownership is strongly connected with corresponding object)| | Simple value|no memory management|no memory management| | ecma_value (including values contained in completion value)|ecma_copy_value| ecma_free_value|

Initially, value is allocated by its owner (i.e with ownership). Value, passed in argument is passed without ownership. Value, returned from function is returned with ownership. Rules for completion value are the same as rules for values, contained in them.

Opcode handler structure

Most opcode handlers consists of the following steps:

  1. Decode instruction (i.e. extract idx-s)
  2. Read input values from variables/registers
  3. Perform necessary type conversions
  4. Perform calls to runtime
  5. Save execution result to output variable
  6. Increment opcode position counter
  7. Return completion value.

Steps 2-5 can produce exceptions. In this case execution is continued after corresponding FINALIZE mark, and completion value is set to throw exception.

Exception handling

Operations that could produce exceptions should be performed in one of the following ways:

  • wrapped into ECMA_TRY_CATCH block: ECMA_TRY_CATCH (value_returned_from_op, op (... ), ret_value_of_the_whole_routine_handler)`` ... ECMA_FINALIZE(value_returned_from_op); return ret_value;`
  • ret_value = op(...);
  • manual handling (for special cases like interpretation of opfunc_try_block).