Home | History | Annotate | Download | only in docs
      1 Notes on WST StructuredDocument
      2 -------------------------------
      3 
      4 Created:    2010/11/26
      5 References: WST 3.1.x, Eclipse 3.5 Galileo
      6 
      7 To manipulate XML documents in refactorings, we sometimes use the WST/SEE
      8 "StructuredDocument" API. There isn't exactly a lot of documentation on
      9 this out there, so this is a short explanation of how it works, totally
     10 based on _empirical_ evidence. As such, it must be taken with a grain of salt.
     11 
     12 Examples of usage can be found in
     13   sdk/eclipse/plugins/com.android.ide.eclipse.adt/src/com/android/ide/eclipse/adt/internal/refactorings/
     14 
     15   
     16 1- Get a document instance
     17 --------------------------
     18 
     19 To get a document from an existing IFile resource:
     20 
     21     IModelManager modelMan = StructuredModelManager.getModelManager();
     22     IStructuredDocument sdoc = modelMan.createStructuredDocumentFor(file);
     23 
     24 Note that the IStructuredDocument and all the associated interfaces we'll use
     25 below are all located in org.eclipse.wst.sse.core.internal.provisional,
     26 meaning they _might_ change later.
     27 
     28 Also note that this parses the content of the file on disk, not of a buffer
     29 with pending unsaved modifications opened in an editor.
     30 
     31 There is a counterpart for non-existent resources:
     32 
     33     IModelManager.createNewStructuredDocumentFor(IFile)
     34 
     35 However our goal so far has been to _parse_ existing documents, find
     36 the place that we wanted to modify and then generate a TextFileChange
     37 for a refactoring operation. Consequently this document doesn't say
     38 anything about using this model to modify content directly.
     39 
     40 
     41 2- Structured Document overview
     42 -------------------------------
     43 
     44 The IStructuredDocument is organized in "regions", which are little pieces
     45 of text.
     46 
     47 The document contains a list of region collections, each one being
     48 a list of regions. Each region has a type, as well as text.
     49 
     50 Since we use this to parse XML, let's look at this XML example:
     51 
     52 <?xml version="1.0" encoding="utf-8"?> \n
     53 <resource> \n
     54     <color/>
     55     <string name="my_string">Some Value</string>  <!-- comment -->\n
     56 </resource>
     57 
     58 
     59 This will result in the following regions and sub-regions:
     60 (all the constants below are located in DOMRegionContext)
     61 
     62 XML_PI_OPEN
     63     XML_PI_OPEN:<?
     64     XML_TAG_NAME:xml
     65     XML_TAG_ATTRIBUTE_NAME:version
     66     XML_TAG_ATTRIBUTE_EQUALS:=
     67     XML_TAG_ATTRIBUTE_VALUE:"1.0"
     68     XML_TAG_ATTRIBUTE_NAME:encoding
     69     XML_TAG_ATTRIBUTE_EQUALS:=
     70     XML_TAG_ATTRIBUTE_VALUE:"utf-8"
     71     XML_PI_CLOSE:?>
     72 
     73 XML_CONTENT
     74     XML_CONTENT:\n
     75 
     76 XML_TAG_NAME
     77     XML_TAG_OPEN:<
     78     XML_TAG_NAME:resources
     79     XML_TAG_CLOSE:>
     80 
     81 XML_CONTENT
     82     XML_CONTENT:\n + whitespace before color
     83 
     84 XML_TAG_NAME
     85     XML_TAG_OPEN:<
     86     XML_TAG_NAME:color
     87     XML_EMPTY_TAG_CLOSE:/>
     88 
     89 XML_CONTENT
     90     XML_CONTENT:\n + whitespace before string
     91 
     92 XML_TAG_NAME
     93     XML_TAG_OPEN:<
     94     XML_TAG_NAME:string
     95     XML_TAG_ATTRIBUTE_NAME:name
     96     XML_TAG_ATTRIBUTE_EQUALS:=
     97     XML_TAG_ATTRIBUTE_VALUE:"my_string"
     98     XML_TAG_CLOSE:>
     99 
    100 XML_CONTENT
    101     XML_CONTENT:Some Value
    102 
    103 XML_TAG_NAME
    104     XML_END_TAG_OPEN:</
    105     XML_TAG_NAME:string
    106     XML_TAG_CLOSE:>
    107 
    108 XML_CONTENT
    109     XML_CONTENT: (2 spaces before the comment)
    110 
    111 XML_COMMENT_TEXT
    112     XML_COMMENT_OPEN:<!--
    113     XML_COMMENT_TEXT: comment
    114     XML_COMMENT_CLOSE:--
    115 
    116 XML_CONTENT
    117     XML_CONTENT: \n after comment
    118 
    119 XML_TAG_NAME
    120     XML_END_TAG_OPEN:</
    121     XML_TAG_NAME:resources
    122     XML_TAG_CLOSE:>
    123 
    124 XML_CONTENT
    125     XML_CONTENT:
    126 
    127 
    128 3- Iterating through regions
    129 ----------------------------
    130 
    131 To iterate through all regions, we need to process the list of top-level regions and then
    132 iterate over inner regions:
    133 
    134     for (IStructuredDocumentRegion regions : sdoc.getStructuredDocumentRegions()) {
    135         // process inner regions
    136         for (int i = 0; i < regions.getNumberOfRegions(); i++) {
    137             ITextRegion region = regions.getRegions().get(i);
    138             String type = region.getType();
    139             String text = regions.getText(region);
    140         }
    141     }
    142 
    143 Each "region collection" basically matches one XML tag, with sub-regions for all the tokens
    144 inside a tag.
    145 
    146 Note that an XML_CONTENT region is actually the whitespace, was is known as a TEXT in the w3c DOM.
    147 
    148 Also note that each outer region has a type, but the inner regions also reuse a similar type.
    149 So for example an outer XML_TAG_NAME region collection is a proper XML tag, and it will contain
    150 an opening tag, a closing tag but also an XML_TAG_NAME that is the tag name itself.
    151 
    152 Surprisingly, the inner regions do not have many access methods we can use on them, except their
    153 type and start/length/end. There are two length and end methods:
    154 - getLength() and getEnd() take any whitespace into account.
    155 - getTextLength() and getTextEnd() exclude some typical trailing whitespace.
    156 
    157 Note that regarding the trailing whitespace, empirical evidence shows that in the XML case
    158 here, the only case where it matters is in a tag such as <string name="my_string">: for the
    159 XML_TAG_NAME region, getLength is 7 (string + space) and getTextLength is 6 (string, no space).
    160 Spacing between XML element is its own collapsed region.
    161 
    162 If you want the text of the inner region, you actually need to query it from the outer region.
    163 The outer IStructuredDocumentRegion (the region collection) contains lots more useful access
    164 methods, some of which return details on the inner regions:
    165 - getText     : without the whitespace.
    166 - getFullText : with the whitespace.
    167 - getStart / getLength / getEnd : type-dependent offset, including whitespace.
    168 - getStart / getTextLength / getTextEnd : type-dependent offset, excluding "irrelevant" whitespace.
    169 - getStartOffset / getEndOffset / getTextEndOffset : relative to document.
    170 
    171 Empirical evidence shows that there is no discernible difference between the getStart/getEnd
    172 values and those returned by getStartOffset/getEndOffset. Please abide by the javadoc.
    173 
    174 All offsets start at zero.
    175 
    176 Given a region collection, you can also browse regions either using a getRegions() list, or
    177 using getFirst/getLastRegion, or using getRegionAtCharacterOffset(). Iterating the region
    178 list seems the most useful scenario. There's no actual iterator provided for inner regions.
    179 
    180 There are a few other methods available in the regions classes. This was not an exhaustive list.
    181 
    182 
    183 ----
    184