HaXML TODO ========== Refactoring ----------- Lex.hs: - Bad smell "Duplicated Code". look for sequences like: | isXmlSpaceChar s = accumulateUntil (c:cs) tok (s:acc) pos (white s p) ss k | isXmlChar s = accumulateUntil (c:cs) tok (s:acc) pos (addcol 1 p) ss k | otherwise = lexerror ("(accumulateUntil) illegal character") p and replace with single function call (taking args: function to process char, function to process continuation, rest of string ... ? (Could be tricky getting the various continuations right. Maybe not.) - eliminate unused reLex functions. - function "prefixes" can be deleted: use List.isPrefixOf Combinators.hs: - should consistently use QNames for element and attribute names. - tagWith uses (String->Bool) predicate; should really be QName->Bool (or match only attributes without namespace in QName?) Validate.hs: - Factor out content modifier handling in Validate.hs (function checkCP) (But only when validation regression test is available.) General - Investigate storing [internal?] GEs in tokenized form, to avoid re-lexing. - (But only when test suite is more comprehensive.) - Review use of space handling in CString content values. I'd like to get rid of the space-preservation flag, and always preserve whitespace internally in element content. [later] this was found to be causing text-merging problems, so I now always set it True in the XML parser. This strengthens the case for getting rid of the flag. - look onto attribute normalization (mentioned in XML spec, somewhere). SubstituteGEFilter.hs and Namespace.hs: - duplicated function "atip". XmlBase.hs and XmlLang.hs: - factor out common code - also with namespace handling Create single definition of xml namespace URI and namespace. (Types.hs?) Test cases ---------- Extend use of W3C conformance suite, especially for external entitities and namespaces. Functional changes ------------------ Parse.hs: - when storing base URI, apply logic from HTTP module to make this into a fully qualified file: URI. Lex.hs: - Preserve all spaces in XML tree text content. This is more subtle than I had thought. Currently the lexer drops whitespace-only sequences between elements and entity references. This probably works OK, most of the time, but logically the whitespace should be kept and possibly dropped when GE substitution has been performed -- specifcally only dropping whitespace-only sequences between elements after entities have been replaced. Function 'condenseContentText' might be a candidate for doing this. (Strictly, it appears that whitespace cannot be discarded until validation is performed, because even whitespace-only sequences where #PCDATA is expected should be preserved. Cf. 2.4)) [Later] Some of the RDF tests indicate that ALL whitespace should be kept. It appears that the correct approach is to include all the whitespace (merging any adjacent text), and then to ignore certain whitespace-only values when validating. An RDF/XML application must be prepared to skip whitespace between elements, etc. The code to change is xmlContent in Parse.hs, but this will also require re-working about 20-30 test cases. Create xml:space processing filter (cf. xml spec section 2.10)? ExtEntity.cpphs: - Figure out how to handle Windows filenames with '\' path separators. Map any '\' in base/name to '/'? E.g. file2 relative to path\file1 should yield: path/file2 (Longer term:) Replace the CErr option in the XML content model with a monadic structure for CFilter and friends, assuming that this can be shown to obey the monad laws. Done ---- Refactoring: / add CVS tags; add to my CVS / move external entity access to separate module / copy all CPPed modules to .cpphs / new token class to decouple parser combinators from token details / rework monad structure in HMJ combinator module to support better diagnostics / extend type definitions to support namespaces / Change type definitions to allow non-expanded parameter entity in DTD (cf. XML spec 1.0, 2ed, section 2.8, production [28].) * Remove 'extpe'? (NO, may be used for parsing external PEs.) * Move PI handling to get external entity module, so that character handling can be separated. (Not really: could use sniffing, meanwhile the Unicode module from HXML Toolbox + keeping all the parsing in one module seems like an easier way to go.) / Route all entity access (including the initial document) via the ExtEntity module (which can handle character mapping). Two interfaces: (a) read file/resource, and (b) convert file/resource. getEntity/mapEntity? / Add parser context to signal whether or not PE expansion is being performed. (if not, PErefs in entity content are illegal) / Make provision for case insensitive keyword matching (e.g. for ). / Make provision for base URI handling: add document name to parser state, and add base URI (document name) to external entity definitions, to be used when the definitions are dereferenced. / Fixed up some knock-on compilation errors in Combinators, Pretty caused by the changes to support namespaces. (The filter combinators aren't covered by the regression tests; I'm not sure what effect this might have.) * Make provision for namespaces in the parse tree * Make provision for language tagging in the parse tree * Make provision for base URIs (xml:base) in the parse tree / The last three items have been achieved by adding as type parameter to the content model for a user-supplied value that is stored with each Element in the parse tree. The parser simply returns a tree using () for such values. UTF-8 handling of out-of-range characters: / decide how to handle diagnostics back to program / Fixed this by returning a Null Unicode character when invalid UTF-8 is encountered. This is invalid in XML, and causes an invalid-error character to be raised by the lexer. Rework parameter entity handling: / rename 'peRef' production as 'peRefer' / define new null 'peRef' function as 'id' / add 'peRefer' productions where permitted by XML: PEReference: [28a] declSep, [9] EntityValue EntityValue: [73] EntityDef, [74] PEDef (peReference in EntityValue only in external subset) / TEST / similarly, redefine 'blank' as 'id' / replace calls to peRefer by separate use of 'pereference' production leaving pereferences unsubstituted in the parse tree / add context indicating whether or not internal or external subset is being processed / leave PE unexpanded in DTD declarations and entity values / TEST (some output may change) testXmlFormat21 output changes because no entity substitution / create filter for internal parameter entity substitution / create new tests for external entity processing / create new filter to do external entity processing. / TEST ExtEntity.cpphs: / suppport URIs, not just filenames / support filename relative to supplied base * modify and extend the code to deal with character encoding UTF-8, UTF-16 based on byte order marks. xml encoding decl ignored. / isolate the "replacement text" as specified by the XML specification. (actually, done as part of external entity substitution in SubsitutePE) / support http: retrieval via supplied URI. / Coded, test case needed. * provide safe functions operating in the IO monad I think this may be not possible given the code organization. Lex.hs: / Don't export posInNewCxt. Instead, provide new lexing function for PE substitution, as needed. Rework SubstitutePE substitute function so that at the appropriate time the replacement string is re-tokenized so that char entities can be recognized as markup. There is probably a better strategy than that currently used. Finalize details when appropriate test cases are encountered. TestXml.hs: / Test character reference in attribute value / Test substitution of external parsed general entity Create general entity substitution filter (including external GEs). / add examples from XML spec appendix D as test cases. / add test case with a GE reference in an attribute. ExtEntity.hs: / return Either String String, rather than just a string, so that I/O and other errors can be handled more cleanly. Create XML-validation test suite, and test XML validation functions. Create namespace processing filter In SubstituteGEFilter.hs, merge resulting adjacent text elements so that multiple CStrings are never consecutive. (This has strange effects on pretty printing, among other things. I think it's also closer to what the XML infoset specification. This will require regeneration of some test data.) * This has been done, also for attributes so that xmlns attributes are recognized properly. Surprisingly, the formatted output used for the test cases is not affected. I think this may be because of space handling: successive text content is merged only if the same space handling is applied in each case. Types.hs: / Consider the future of the EVPERef case in entity values: should it be scrapped? YES * EntityVal simplifies to String (GERefs not parsed and PERefs already substituted)? NO: entity value may still contain GERef Parse.hs: / Factor out common code introduced for attribute value and entity value parsing. * Store entity definitions as simple text (with char refs replaced). (Still stores GE references) / Eliminate flattenEV function. (Note: SubstitutePE.hs has flattenEvToks.) Currently used by GE substitution: moved to GE module. / Remove all references to PE handling from parser (Note: the parser now calls a separate PE processing module. There is a theoretical option for a non-validating parser to parse XML source with PEs not processed, but to do this would require reinstatement of some parser productions. The data type definitions and pretty printer retain this theoretical possibility.) / Remove 'blank' function / Regroup functions in parser module XParserUtils.hs, Parse.cpphs, SubstitutePE.hs: / merge common code from Parse.cpphs and SubstitutePE.hs, move to XParserUtils Include base URI in XML content (needed for Infoset and xml:base processing) * The filename/base URI is included as an additional field in the Prolog value. Originally, I placed it in the Document value, but then decided this would probably break too many existing HaXml applicarions. Judging that relatively few applications actually examine the Prolog, I'm hoping that changing this structure is not too disruptive. Parse.cpphs: / reorganize element/content parsing to improve diagnostics (combinator "many" is too blunt an instrument here) (define new combinator manythen, use for element content) * manyThen doesn't help. Instead, use '+++ fatalerror "msg"' when it is known that no alternative parse is possible. This Prevents backtracking to the start of the input data. Detect recursive substition of entities -- nested substitution is handled when the outermost entity is substituted, so recursion would result in blowing the stack. Create xml:base processing filter. Create xml:lang processing filter.