Updated section ECMA of JerryScript Internals documentation ending with subsection LCache. JerryScript-DCO-1.0-Signed-off-by: István Kádár ikadar@inf.u-szeged.hu
21 KiB
| layout | title | permalink |
|---|---|---|
| page | Internals | /internals/ |
- toc {:toc}
High-Level Design
{: class="thumbnail center-block img-responsive" }
On the diagram above is shown interaction of major components of JerryScript: Parser and Virtual Machine (VM). Parser performs translation of input ECMAScript application into the byte-code with the specified format (refer to Bytecode and Parser page for details). Prepared bytecode is executed by the Virtual Machine that performs interpretation (refer to Virtual Machine and ECMA pages for details).
Parser
The parser is implemented as a recursive descent parser. The parser converts the JavaScript source code directly into byte-code without building an Abstract Syntax Tree. The parser depends on the following subcomponents.
Lexer
The lexer splits input string (ECMAScript program) into sequence of tokens. It is able to scan the input string not only forward, but it is possible to move to an arbitrary position. The token structure described by structure lexer_token_t in ./jerry-core/parser/js/js-lexer.h.
Scanner
Scanner (./jerry-core/parser/js/js-parser-scanner.h) pre-scans the input string to find certain tokens. For example, scanner determines whether the keyword for defines a general for or a for-in loop. Reading tokens in a while loop is not enough because a slash (/) can indicate the start of a regular expression or can be a division operator.
Expression Parser
Expression parser is responsible for parsing JavaScript expressions. It is implemented in ./jerry-core/parser/js/js-parser-expr.c.
Statement Parser
JavaScript statements are parsed by this component. It uses the [Expression parser](#expression parser) to parse the constituent expressions. The implementation of Statement parser is located in ./jerry-core/parser/js/js-parser-statm.c.
Function parser_parse_source carries out the parsing and compiling of the input EcmaScript source code. When a function appears in the source parser_parse_source calls parser_parse_function which is responsible for processing the source code of functions recursively including argument parsing and context handling. After the parsing, function parser_post_processing dumps the created opcodes and returns an ecma_compiled_code_t* that points to the compiled bytecode sequence.
The interactions between the major components shown on the following figure.
{: class="thumbnail center-block img-responsive" }
Byte-code
This section describes the compact byte-code (CBC) byte-code representation. The key focus is reducing memory consumption of the byte-code representation without sacrificing considerable performance. Other byte-code representations often focus on performance only so inventing this representation is an original research.
CBC is a CISC like instruction set which assigns shorter instructions for frequent operations. Many instructions represent multiple atomic tasks which reduces the byte code size. This technique is basically a data compression method.
Compiled code format
The memory layout of the compiled byte code is the following.
{: class="thumbnail center-block img-responsive" }
The header is a cbc_compiled_code structure with several fields. These fields contain the key properties of the compiled code.
The literals part is an array of ecma values. These values can contain any EcmaScript value types, e.g. strings, numbers, function and regexp templates. The number of literals is stored in the literal_end field of the header.
CBC instruction list is a sequence of byte code instructions which represents the compiled code.
Byte-code Format
The memory layout of a byte-code is the following:
{: class="thumbnail center-block img-responsive" }
Each byte-code starts with an opcode. The opcode is one byte long for frequent and two byte long for rare instructions. The first byte of the rare instructions is always zero (CBC_EXT_OPCODE), and the second byte represents the extended opcode. The name of common and rare instructions start with CBC_ and CBC_EXT_ prefix respectively.
The maximum number of opcodes is 511, since 255 common (zero value excluded) and 256 rare instructions can be defined. Currently around 230 frequent and 120 rare instructions are available.
There are three types of bytecode arguments in CBC:
-
byte argument: A value between 0 and 255, which often represents the argument count of call like opcodes (function call, new, eval, etc.).
-
literal argument: An integer index which is greater or equal than zero and less than the
literal_endfield of the header. For further information see next section Literals (next). -
relative branch: An 1-3 byte long offset. The branch argument might also represent the end of an instruction range. For example the branch argument of
CBC_EXT_WITH_CREATE_CONTEXTshows the end of a with statement. More precisely the position after the last instruction.
Argument combinations are limited to the following seven forms:
- no arguments
- a literal argument
- a byte argument
- a branch argument
- a byte and a literal arguments
- two literal arguments
- three literal arguments
Literals
Literals are organized into groups whose represent various literal types. Having these groups consuming less space than assigning flag bits to each literal.
(In the followings, the mentioned ranges represent those indicies which are greater than or equal to the left side and less than the right side of the range. For example a range between ident_end and literal_end fields of the byte-code header contains those indicies, which are greater than or equal to ident_end
and less than literal_end. If ident_end equals to literal_end the range is empty.)
The two major group of literals are identifiers and values.
-
identifier: A named reference to a variable. Literals between zero and
ident_endof the header belongs to here. All of these literals must be a string or undefined. Undefined can only be used for those literals which cannot be accessed by a literal name. For examplefunction (arg,arg)has two arguments, but theargidentifier only refers to the second argument. In such cases the name of the first argument is undefined. Furthermore optimizations such as CSE may also introduce literals without name. -
value: A reference to an immediate value. Literals between
ident_endandconst_literal_endare constant values such as numbers or strings. These literals can be used directly by the Virtual Machine. Literals betweenconst_literal_endandliteral_endare template literals. A new object needs to be constructed each time when their value is accessed. These literals are functions and regular expressions.
There are two other sub-groups of identifiers. Registers are those identifiers which are stored in the function call stack. Arguments are those registers which are passed by a caller function.
There are two types of literal encoding in CBC. Both are variable length, where the length is one or two byte long.
- small: maximum 511 literals can be encoded.
One byte encoding for literals 0 - 254.
byte[0] = literal_index
Two byte encoding for literals 255 - 510.
byte[0] = 0xff
byte[1] = literal_index - 0xff
- full: maximum 32767 literal can be encoded.
One byte encoding for literals 0 - 127.
byte[0] = literal_index
Two byte encoding for literals 128 - 32767.
byte[0] = (literal_index >> 8) | 0x80
byte[1] = (literal_index & 0xff)
Since most functions require less than 255 literal, small encoding provides a single byte literal index for all literals. Small encoding consumes less space than full encoding, but it has a limited range.
Byte-code Categories
Byte-codes can be placed into four main categories.
Push Byte-codes
Byte-codes of this category serve for placing objects onto the stack. As there are many instructions representing multiple atomic tasks in CBC, there are also many instructions for pushing objects onto the stack according to the number and the type of the arguments. The following table list a few of these opcodes with a brief description.
| byte-code | description | | CBC_PUSH_LITERAL | Pushes the value of the given literal argument. | | CBC_PUSH_TWO_LITERALS | Pushes the value of the given two literal arguments. | | CBC_PUSH_UNDEFINED | Pushes an undefined value. | | CBC_PUSH_TRUE | Pushes a logical true. | | CBC_PUSH_PROP_LITERAL | Pushes a property whose base object is popped from the stack, and the property name is passed as a literal argument. |
Call Byte-codes
The byte-codes of this category perform calls in different ways.
| byte-code | description | | CBC_CALL0 | Calls a function without arguments. The return value won't be pushed onto the stack. | | CBC_CALL1 | Calls a function with one argument. The return value won't be pushed onto the stack. | | CBC_CALL | Calls a function with n arguments. n is passed as a byte argument. The return value won't be pushed onto the stack. | | CBC_CALL0_PUSH_RESULT | Calls a function without arguments. The return value will be pushed onto the stack. | | CBC_CALL1_PUSH_RESULT | Calls a function with one argument. The return value will be pushed onto the stack. | | CBC_CALL2_PROP | Calls a property function with two arguments. The base object, the property name, and the two arguments are on the stack. |
Arithmetic, Logical, Bitwise and Assignment Byte-codes
The opcodes of this category perform arithmetic, logical, bitwise and assignment operations according to the different
| byte-code | description | | CBC_LOGICAL_NOT | Negates the logical value that popped from the stack. The result is pushed onto the stack. | | CBC_LOGICAL_NOT_LITERAL | Negates the logical value that given in literal argument. The result is pushed onto the stack. | | CBC_ADD | Adds two values that are poped from the stack. The result is pushed onto the stack. | | CBC_ADD_RIGHT_LITERAL | Adds two values. The left one popped from the stack, the right one is given as literal argument. | | CBC_ADD_TWO_LITERALS | Adds two values. Both are given as literal arguments. | | CBC_ASSIGN | Assigns a value to a property. It has three arguments: base object, property name, value to assign. | | CBC_ASSIGN_PUSH_RESULT | Assigns a value to a property. It has three arguments: base object, property name, value to assign. The result will be pushed onto the stack. |
Branch Byte-codes
Branch byte-codes are used to perform conditional and unconditional jumps in the byte-code. The arguments of these instructions are 1-3 byte long relative offsets. The number of bytes is part of the opcode, so each byte-code with a branch argument has three forms. The direction (forward, backward) is also defined by the opcode since the offset is an unsigned value. Thus, certain branch instructions has six forms. Some examples can be found in the following table.
| byte-code | description | | CBC_JUMP_FORWARD | Jumps forward by the 1 byte long relative offset argument. | | CBC_JUMP_FORWARD_2 | Jumps forward by the 2 byte long relative offset argument. | | CBC_JUMP_FORWARD_3 | Jumps forward by the 3 byte long relative offset argument. | | CBC_JUMP_BACKWARD | Jumps backward by the 1 byte long relative offset argument. | | CBC_JUMP_BACKWARD_2 | Jumps backward by the 2 byte long relative offset argument. | | CBC_JUMP_BACKWARD_3 | Jumps backward by the 3 byte long relative offset argument. | | CBC_BRANCH_IF_TRUE_FORWARD | Jumps if the value on the top of the stack is true by the 1 byte long relative offset argument. |
Virtual Machine
Virtual machine is an interpreter which executes byte-code instructions one by one. The function that starts the interpretation is vm_run in ./jerry-core/vm/vm.c. vm_loop is the main loop of the virtual machine, which has the peculiarity that it is non-recursive. This means that in case of function calls it does not calls itself recursively but returns, which has the benefit that it does not burdens the stack as a recursive implementation.
ECMA
ECMA component of the engine is responsible for the following notions:
- Data representation
- Runtime representation
- Garbage collection (GC)
Data representation
The major structure for data representation is ECMA_value. The lower two bits of this structure encode value tag, which determines the type of the value:
- simple
- number
- string
- object
{: class="thumbnail center-block img-responsive" }
In case of number, string and object the value contains an encoded pointer, and simple value is a pre-defined constant which can be:
- undefined
- null
- true
- false
- empty (uninitialized value)
Compressed pointers
Compressed pointers were introduced to save heap space. {: class="thumbnail center-block img-responsive" }
These pointers are 8 byte alligned 16 bit long pointers which can address 512 Kb of memory which is also the maximum size of the JerryScript heap.
ECMA data elements are allocated in pools (pools are allocated on heap) Chunk size of the pool is 8 bytes (reduces fragmentation).
Number
There are two possible representation of numbers according to standard IEEE 754:
- 4-byte (float, compact profile)
- 8-byte (double, full profile)
{: class="thumbnail center-block img-responsive" }
Several references to single allocated number are not supported. Each reference holds its own copy of a number.
String
Strings in JerryScript are not just character sequences, but can hold numbers and so-called magic ids too. For common character sequences there is a table in the read only memory that contains magic id and character sequence pairs. If a string is already in this table, the magic id of its string is stored, not the character sequence itself. Using numbers speeds up the property access. These techniques save memory.
Object / Lexical environment
An object can be a conventional data object or a lexical environment object. Unlike other data types, object can have references (called properties) to other data types. Because of circular references, reference counting is not always enough to determine dead objects. Hence a chain list is formed from all existing objects, which can be used to find unreferenced objects during garbage collection. The gc-next pointer of each object shows the next allocated object in the chain list.
Lexical environments (link) are implemented as objects in JerryScript, since lexical environments contains key-value pairs (called bindings) like objects. This simplifies the implementation and reduces code size.
{: class="thumbnail center-block img-responsive" }
The objects are represented as following structure:
- Reference counter - number of hard (non-property) references
- Next object pointer for the garbage collector
- GC's visited flag
- type (function object, lexical environment, etc.)
Properties of objects
{: class="thumbnail center-block img-responsive" }
Objects have a linked lists that contains their properties. This list actually contains property pairs, in order to save memory described in the followings: A property is 7 bit long and its type field is 2 bit long which consumes 9 bit which does not fit into 1 byte but consumes 2 bytes. Hence, placing together two properties (14 bit) with the 2 bit long type field fits into 2 bytes.
If the number of property pairs reach a limit (currently this limit defined to 16), the first element of the property pair list is a hashmap (called property hashmap), which is used to find a property instead of finding it by linear search.
Collections
Collections are array-like data structures, which are optimized to save memory. Actually, a collection is a linked list whose elements are not single elements, but arrays which can contain multiple elements.
Internal properties
Internal properties are special properties that carry meta-information that cannot be accessed by the JavaScript code, but important for the engine itself. Some examples of internal properties are listed below:
- Class - class (type) of the object (ECMA-defined)
- Code - points where to find bytecode of the function
- native code - points where to find the code of a native function
- PrimitiveValue for Boolean - stores the boolean value of a Boolean object
- PrimitiveValue for Number - stores the numeric value of a Number object
LCache
LCache is a hashmap for finding a property specified by an object and by a property name. The object-name-property layout of the LCache presents multiple times in a row as it is shown in the figure below.
{: class="thumbnail center-block img-responsive"}
When a property access occurs, a hash value is extracted form the demanded property name and than this hash is used to index the LCache. After that, in the indexed row the specified object and property name will be searched.
It is important to note, that if the specified property is not found in the LCache, it does not mean that it does not exist. If the property is not found, it will be searched in the property-list of the object, and if it is found there, the property will be placed into the LCache.
Completion value
Many algorithms/routines described in ECMA return a value of "completion" type, that is triplet of the following form:
{: class="thumbnail center-block img-responsive" }
Jerry introduces two additional completion types:
- exit - produced by
exitvalopcode, indicates request to finish execution - meta - produced by meta instruction, used to catch meta opcodes in interpreter loop without explicit comparison on every iteration (for example: meta 'end_with')
Value management and ownership
Every value stored by engine is associated with virtual "ownership" (that is responsible for manage the value and free it when is is not needed, or pass ownership of the value to somewhere)
| Value type| "Alloc" op | "Free" op| | Number| ecma_alloc_number| ecma_dealloc_number| | String| ecma_copy_or_ref_ecma_string| ecma_deref_ecma_string| | Object|ecma_ref_object (for on_stack references) on_heap references are managed by GC|ecma_deref_object (for on_stack references) on_heap references are managed by GC| | Property|(ownership is strongly connected with corresponding object)|(ownership is strongly connected with corresponding object)| | Simple value|no memory management|no memory management| | ecma_value (including values contained in completion value)|ecma_copy_value| ecma_free_value|
Initially, value is allocated by its owner (i.e with ownership). Value, passed in argument is passed without ownership. Value, returned from function is returned with ownership. Rules for completion value are the same as rules for values, contained in them.
Opcode handler structure
Most opcode handlers consists of the following steps:
- Decode instruction (i.e. extract
idx-s) - Read input values from variables/registers
- Perform necessary type conversions
- Perform calls to runtime
- Save execution result to output variable
- Increment opcode position counter
- Return completion value.
Steps 2-5 can produce exceptions. In this case execution is continued after corresponding FINALIZE mark, and completion value is set to throw exception.
Exception handling
Operations that could produce exceptions should be performed in one of the following ways:
- wrapped into ECMA_TRY_CATCH block:
ECMA_TRY_CATCH (value_returned_from_op, op (... ),- `ret_value_of_the_whole_routine_handler)``
...ECMA_FINALIZE(value_returned_from_op);return ret_value;
ret_value = op(...);- manual handling (for special cases like interpretation of opfunc_try_block).