This site will work and look better in a more modern browser, but it is still accessible to any browser or Internet device. You should upgrade your browser, if possible.

Attention: to write MG sentences and for the glyphs' translation tooltips to work, you need to enable Javascript on your browser.

MG Encoding - Technical document v4.4

(specifications - rationale - parsing algorithm)

Instructions on how to encode and decode MediaGlyphs sentences is provided, together with specifications on file locations for display and linking of images and explanation pages. An algorithm and sample perl code are added at the bottom of the document.

Purpose of the encoding

Storage and transmission of codes representing glyphs, glyph-combinations and phonetic names.

Total Alphabet

[0-9] [a-z] [A-Z] [] {} @ ^ + = _

Alphabet explained

the 10 digits "0-9"
the 26 lowercase letters "a-z"
the 26 uppercase letters "A-Z"
LEFT SQUARE BRACKET "["
RIGHT SQUARE BRACKET "]"
LEFT CURLY BRACKET "{"
RIGHT CURLY BRACKET "}"
COMMERCIAL AT "@"
CIRCUMFLEX ACCENT "^"
PLUS SIGN "+"
EQUALS SIGN "="
LOW LINE "_"

Glyph subset alphabet

[0-9] [a-z] [A-Z] {} @

Special symbols subset alphabet

[] ^ + =

Punctuation symbols that can appear in MG sentences

, . ; : ( ) ' " -

Additional symbols used

The " " (NO BREAK SPACE) can be used for human legibility but is squashed and ignored when parsing.

The "_" (LOW LINE) is used for compatibility with filesystems that do not differentiate between uppercase and lowercase letters. It is used in this way: all uppercase letters are followed by "_" to differentiate the filenames.
Hence "aa.png" is different from "A_A_.png" which is different from "aA_.png" and so on.
For transmission and storage of codes, the "_" is not needed, but for filenames (html pages, png files...) it is necessary.
Hence all MG encoded strings will be "escaped" with "_" and "unescaped" removing it, as needed.

When parsing MG codes, " " and "_" are eliminated.

Rationale

Two symbols from the "glyph subset alphabet" are required to specify a glyph.
E.g.: 7O qc w6 eN @f {j l{
all specify single glyphs.
NOTE: "@" will be used only as first symbol specifying a glyph, not appearing in second position.

There are hence 4160 (64*64 + 1*64) possible combinations of the "glyph subset alphabet" to encode a maximum of 4160 single glyphs. We don't expect to reach this maximum number, and instead we plan to keep the number of single glyphs around 2000.

The first symbol (of the two that specify a glyph) indicates the category that the glyph belongs to.
Hence "ja" and "ji" are glyphs in the same category ("numerals").

The symbols from the "special symbols subset alphabet" all have a meaning affecting parsing, because they are involved in specifying composites, glyphs being shifted of category, phrases...

The trivial case: A MG string containing only symbols from the "glyph subset alphabet" would be easily parsed by splitting it in consecutive substrings of length 2, and these would be the codes specifying the glyphs and directly pointing to the image files (.png).

E.g.: "@baH@kQC@bbt" (MG)
(equivalent to "@b aH @k QC @b bt" and to "@baH_@kQ_C_@bbt")
would encode 6 consecutive glyphs (5 unique) whose images are located in the "l/" directory, with filenames: "@b.png" "aH_.png" "@k.png" "Q_C_.png" "bt.png"
('l' stands for 'library', short for 'image library').

Things become slightly more complicated with the special symbols.

Explanation of special symbols and their syntax

[ ... ]

The square brackets specify a "phrase" inside the main sentence.
The nested "phrase" needs to be isolated and parsed in the same way as the main sentence. Phrases are commonly used arrangements of glyphs that have a translation associated (and hence an explanation page).
Form: [...]
The standard html display for phrases encloses the glyphs between square brackets ([...]) but other possibilities of displaying can be devised.
E.g.: "[@b@f]" (@b@f means "my, of me", which has its own translation and database entry in the phrases dictionary).
Note: a phrase can contain all other nested structures, i.e. other special symbols, including more "[]"
+^ ... +^

The codes in between two occurrances of the "+^" combination (PLUS SIGN followed by CIRCUMFLEX ACCENT) specify a number of glyphs forming a "multicomposite" (a composite of more than two glyphs).
Hence between two "+^" should appear ONLY an even number of symbols from the "glyph subset alphabet".
The part inside the special symbols can hence be trivially decomposed in substrings of length 2, each linking to an image.
Form: +^ABCDEF+^, where AB CD and EF are the glyph codes.
The standard html display for composites encloses the glyphs between curly brackets ({...}) but other possibilities of displaying can be devised.
The explanation pages are located in the "x/oi/" subdirectory.
E.g.: "+^xI1eq3+^" (it means "cosmogony")
++...

A double plus sign is followed by three glyph symbols.
They indicate a shift in category, a "reclarification" of a glyph into a new category.
Form: ++XAB, where X is the category code and AB the glyph code.
The category code (also belonging to to the "glyph subset alphabet") should be parsed so that the category image is shown. The glyph code is a normal substring of length two.
Category images are located in the "r/" directory ('r' stands for 'radicals').
The standard html display shows them as images having half the size of the other glyphs.
E.g.: "++x6j" (meaning "serene" for weather, with "x" being the "natural world" category)
^....

A circumflex accent signals the presence of a composite of the first kind.
This composite has two glyphs coming from two different glyph categories.
It can be parsed very easily: after the "^", 4 symbols from the "glyph subset alphabet" have to be taken and these specify the two glyphs that form the composite.
Form: ^ABCD, where AB and CD are the glyph codes.
The standard html display for composites encloses the glyphs between curly brackets ({...}) but other possibilities of displaying can be devised.
E.g.: ^2kbt (meaning "speak on the phone, call")
+...

A plus sign specifies the second kind of composite glyph.
This composite also has two glyphs, but they come from the same glyph category.
One glyph code symbol is hence redundant and not appearing in the encoded string.
To parse: take the three symbols after "+" and combine them so as to produce the two substrings formed by "the first and the second symbol" and "the first and the third symbol", as shown below.
Form: +ABC, where AB and AC are the glyph codes.
The standard html display for composites encloses the glyphs between curly brackets ({...}) but other possibilities of displaying can be devised.
E.g.: "+7nq" (meaning "mediaglyph")
== ... ==

The double equals sign combination encloses phonetic names. Hence everything between == is not to be understood in terms of glyph codes but in terms of ascii letters.
The UTF7 (an ascii escaping for Unicode encoding) is used for phonetic names.
UTF7 encodes Unicode special letters using: [a-z][A-Z][0-9]+-/
Some slight modifications (escaping) of UTF7 are needed for filesystem compatibility, in order to use these strings as filenames.
- uppercase followed by "_" (like the glyph codes)
- "/" replaced with "{"
- "'" replaced with "}"
In addition, phonetic names are usually prefixed by a language code.
Form: ==LNG:string==, where LNG is the language code and what follows the ":" is the utf7 escaped form of the original name.
The standard html display is: if there is an image created for the phonetic name (usually the png holds the original form in the original language and the pronunciation, in IPA alphabet, of the name), then display the png image. Otherwise, convert to UTF8 and let the browser display it.
Location for images: the "l/uu/LNG/png" directory, if there is a language code, otherwise the "l/uu/png" directory. No "=" appear in the filenames.
E.g. "==eng:James==" (English language, name of the city of James)
= ... =

What is between single equal signs is a name created with glyphs (or combinations of glyphs).
Hence the "=...=" case is equal to the "[...]" case, with the codes inside being treated as a subphrase and parsed accordingly.
Form: =...=
The standard html display for glyphnames encloses the glyphs between equal signs (=...=) but other possibilities of displaying can be devised.
E.g.: "=lp+eqp=" (meaning "The Little Prince", character name and book title: this is a glyph name containing one single glyph and one composite of the second kind)

More examples: sample sentences

The HTML and the encoded string of the following sample sentences can be compared:

Test and compare

It's possible to compare the result of parsing made by a new program and the existing display system

Parsing algorithm

Remove " " and "_" from the encoded string

For the whole length of the encoded string do:

if you encounter a punctuation sign: extract and format it
if you encounter "[": find the correct matching "]", extract the substring delimited by them (which could contain more []) and parse it again
if you encounter "+^": find the next "+^" and extract the substring delimited by them, divide it in substrings of lenght 2, those are the glyphs
if you encounter "++": extract 3 characters after, the first of them is a category code, the other two specify one glyph
if you encounter "^": extract 4 characters after it, divide in two substrings to obtain the codes that specify two glyphs
if you encounter "+": extract 3 characters after it, take the first and the second, then the first and the third: those are the codes specifying two glyphs
if you encounter "==": find the next "==" and extract the substring delimited by them, parse it as a phonetic name (check if png exists, otherwise treat as utf8 letters)
if you encounter "=": find the next "=" and extract the subtring delimited by them, and parse it again
else: take the next two characters, they specify a glyph