1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550
// region: lmake_md_to_doc_comments include README.md A //! //! # reader for microXml //! //! **reader for microXml - the simplified subset of xml** //! ***[repo](https://github.com/bestia-dev/reader_for_microxml); version: 2.0.2 date: 2021-01-13 authors: bestia.dev*** //! //! [![crates.io](https://img.shields.io/crates/v/reader_for_microxml.svg)](https://crates.io/crates/reader_for_microxml) [![Documentation](https://docs.rs/reader_for_microxml/badge.svg)](https://docs.rs/reader_for_microxml/) [![crev reviews](https://web.crev.dev/rust-reviews/badge/crev_count/reader_for_microxml.svg)](https://web.crev.dev/rust-reviews/crate/reader_for_microxml/) [![RustActions](https://github.com/bestia-dev/reader_for_microxml/workflows/rust/badge.svg)](https://github.com/bestia-dev/reader_for_microxml/) [![latest doc](https://img.shields.io/badge/latest_docs-GitHub-orange.svg)](https://bestia-dev.github.io/reader_for_microxml/reader_for_microxml/index.html) [![Licence](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/bestia-dev/reader_for_microxml/blob/main/LICENSE) //! //! [![Lines in Rust code](https://img.shields.io/badge/Lines_in_Rust-278-green.svg)](https://github.com/bestia-dev/reader_for_microxml/) //! [![Lines in Doc comments](https://img.shields.io/badge/Lines_in_Doc_comments-208-blue.svg)](https://github.com/bestia-dev/reader_for_microxml/) //! [![Lines in Comments](https://img.shields.io/badge/Lines_in_comments-64-purple.svg)](https://github.com/bestia-dev/reader_for_microxml/) //! [![Lines in examples](https://img.shields.io/badge/Lines_in_examples-222-yellow.svg)](https://github.com/bestia-dev/reader_for_microxml/) //! [![Lines in tests](https://img.shields.io/badge/Lines_in_tests-287-orange.svg)](https://github.com/bestia-dev/reader_for_microxml/) //! //! //! //! There are many xml parsers/readers/tokenizers/lexers around, but I need something very small and simple for my simple html templates in wasm.\ //! I found the existence of a standard (or W3C proposal) for *MicroXml* - dramatically simpler then the full Xml standard. Perfect for my use-case: I have small simple html files, that are microXml compatible. //! //! ## microXml //! //! MicroXML is a subset of XML. It is dramatically simpler.\ //! <https://www.xml.com/articles/2017/06/03/simplifying-xml-microxml/>\ //! <https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html>\ //! MicroXML is actually well-formed Xml.\ //! In the data model of MicroXml there are no CData, namespaces, declarations, processing instructions,...\ //! An example of all can be done in a well-formed microXml: //! //! ```xml //! <memo lang="en" date="2017-05-01"> //! I <em>love</em> microXML!<br /> //! <!-- some comment --> //! It's so clean & simple. //! </memo> //! ``` //! //! MicroXml can be only in utf-8. I am lucky, because Rust Strings are internally utf-8 and are automatically checked for correctness.\ //! MicroXml should go through normalization: CR & CRLF should be converted to LF, but I don't do that here. Also decoding xml control characters `"`, `&`,... or decoding unicode encodings like `` , ``,... is not inside the reader. This is left for a higher library to choose what to do with it.\ //! MicroXml can contain Comments, but they are not official microXml data. But I need them for my templating project.\ //! Whitespaces are completely preserved in Text Nodes. For me they are significant. Also newline and Tabs. This is different from full Xml whitespace processing.\ //! All other whitespaces are ignored - they are insignificant. //! //! ## reader //! //! This ReaderForMicroXml obviously cannot read a complicated full XML.\ //! This `reader_for_microxml` is used for small html fragments.\ //! They must be well-formed microXml.\ //! This fragments are meant for a html templating for dodrio.\ //! Because of the small size of fragments, I can put all the text in memory in a string.\ //! Only basic mal-formed incorrectness produce errors. I am not trying to return errors for all the possible mal-formed incorrectness in microXml.\ //! The speed is not really important, but the size of the code is, because it will be used in WebAssembly. Every code is too big for Wasm!\ //! The crate has `#![no_std]`, #![forbid(unsafe_code)], NO dependencies, NO allocations, //! //! ## iterator //! //! The reader is an iterator.\ //! It implements the trait of the iterator.\ //! Use this syntax to process all tokens:\ //! `for result_token in reader_iterator {`\ //! or\ //! `let x: Option<Result<Token, &str>> = reader_iterator.next();` //! //! ## Tests //! //! Run 16 tests with:\ //! `cargo make test` //! //! ## Examples //! //! Find examples in the repository on github.\ //! Run them with: //! `cargo make run_rel1`\ //! `cargo make run_rel2`\ //! it is a shortcut to:\ //! `cargo run --example microxml_tree examples/t2.html` //! //! ```rust //! /// read xml and write to screen //! use reader_for_microxml::*; //! //! fn main(){ //! let str_xml = r#"<html>test</html>"#; //! let mut reader_iterator = ReaderForMicroXml::new(str_xml); //! let result = read_xml_to_debug_string(&mut reader_iterator); //! println!("Result: {}", result) //! } //! //! fn read_xml_to_debug_string(reader_iterator: &mut ReaderForMicroXml) -> String { //! let mut result = String::new(); //! // reader_iterator is iterator Option<Result<Token,&str>> //! // the first option is used for the iterator to know where is the end //! // then the Result can have an Token or an Error //! for result_token in reader_iterator { //! match result_token { //! Ok(token) => match token { //! Token::StartElement(name) => { //! result.push_str(&format!("Start: \"{}\"\n", name)); //! } //! Token::Attribute(name, value) => { //! result.push_str(&format!("Attribute: \"{}\" = \"{}\"\n", name, value)); //! } //! Token::TextNode(txt) => { //! result.push_str(&format!("Text: \"{}\"\n", txt)); //! } //! Token::Comment(txt) => { //! result.push_str(&format!("Comment: \"{}\"\n", txt)); //! } //! Token::EndElement(name) => { //! result.push_str(&format!("End: \"{}\"\n", name)); //! } //! }, //! Err(err_msg) => { //! panic!(err_msg); //! } //! } //! } //! //return //! result //! } //! ``` //! //! ## used in projects //! //! <https://github.com/bestia-dev/cargo_crev_web> //! <https://github.com/bestia-dev/dodrio_templating> //! <https://github.com/bestia-dev/mem6_game> //! //! ## cargo crev reviews and advisory //! //! It is recommended to always use [cargo-crev](https://github.com/crev-dev/cargo-crev)\ //! to verify the trustworthiness of each of your dependencies.\ //! Please, spread this info.\ //! On the web use this url to read crate reviews. Example:\ //! <https://web.crev.dev/rust-reviews/crate/num-traits> //! //! ## Ideas for the future //! //! ### Speed //! //! The speed could probably be improved if I use Vec\<u8\> instead of CharIndices. That could work because all the xml delimiters are ASCII characters. The specifics of the UTF-8 encoding is that ASCII characters can in no way be misinterpreted inside a string. They always have the first bit set to 0.\ //! All other unicode characters are multi-byte and all this bytes MUST start with bit 1.\ //! So there is no way of having them confused.\ //! <https://betterexplained.com/articles/unicode/>\ //! <https://naveenr.net/unicode-character-set-and-utf-8-utf-16-utf-32-encoding/> //! //! ## References //! //! <https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html>\ //! <https://www.xml.com/articles/2017/06/03/simplifying-xml-microxml/>\ //! <https://github.com/tafia/quick-xml>\ //! <https://github.com/RazrFalcon/roxmltree> //! // endregion: lmake_md_to_doc_comments include README.md A //! #![no_std] #![forbid(unsafe_code)] pub struct PosChar { pub pos: usize, pub ch: char, } /// struct Reader for MicroXml - the Class /// Rust has Structs + Traits, but for me it is just like Class/Object. /// Just without inheritance. /// All the fields are internal and not public. /// The only way to interact is through methods. pub struct ReaderForMicroXml<'a> { /// reference to the xml string (no allocation) input: &'a str, /// Iterator CharIndices over the input string indices: core::str::CharIndices<'a>, /// I need to know the TagState for programming as a state machine tag_state: TagState, /// the last read character from the indices iterator last_char: PosChar, /// for significant whitespace (in TextNode beginning) start_of_text_node_before_whitespace: usize, } /// The reader_for_microxml returns tokens. /// The caller will manage this tokens. So they must be public. /// The string slices are reference to the original string with microXml text #[derive(Clone, Debug)] pub enum Token<'a> { /// Start of xml element StartElement(&'a str), /// End of xml element EndElement(&'a str), /// Attribute Attribute(&'a str, &'a str), /// Text node between `StartElement` and `EndElement`. TextNode(&'a str), /// comment node Comment(&'a str), } /// internal enum: Tags are strings inside delimiters `< and > like <div> or </div>` enum TagState { /// outside of tag OutsideOfTag, /// inside of tag InsideOfTag, /// reached normal end of file EndOfFile, } impl PosChar { pub fn set(&mut self, tup: (usize, char)) { self.pos = tup.0; self.ch = tup.1; } } impl<'a> ReaderForMicroXml<'a> { /// Constructor. String is immutably borrowed here. No allocation. pub fn new(input: &str) -> ReaderForMicroXml { // CharIndices is an iterator that returns a tuple: (pos, ch). // I convert this into PosChar{pos, ch} for easier coding. // The "byte" position for using the string slice and the character. // This is a complication because one utf-8 character can have more bytes. // And the slices are defined by "bytes position", not by "character position". // Very important distinction! let mut indices = input.char_indices(); let mut last_char = PosChar { pos: 0, ch: ' ' }; if input.is_empty() { // unwrap because it cannot error if the string is not empty last_char.set(indices.next().unwrap()); } ReaderForMicroXml { input, indices, tag_state: TagState::OutsideOfTag, last_char, start_of_text_node_before_whitespace: 0, } } /// Reads the next token (internal). /// The internal function can understand when the Eof is in a correct position /// and stops the propagation of Option None. #[allow(clippy::integer_arithmetic, clippy::nonminimal_bool)] fn read_token_internal(&mut self) -> Option<Result<Token<'a>, &'static str>> { match &self.tag_state { TagState::OutsideOfTag => { if self.start_of_text_node_before_whitespace == 0 { self.start_of_text_node_before_whitespace = self.last_char.pos; } self.move_over_whitespaces()?; // Tags can look like this: // Start Tags: < xxx >, < xxx attr="val" >, < xxx /> // End Tags: </xxx> // Comments: <!-- xxx --> // start delimiter is < if self.last_char.ch == '<' { self.tag_state = TagState::InsideOfTag; self.move_next_char()?; self.move_over_whitespaces()?; // if it is not comment or end tag, must be the element name if !(self.last_char.ch == '!' || self.last_char.ch == '/') { self.read_element_name() } else if self.last_char.ch == '!' { // this is a comment <!-- xxx --> // comment are not data in MicroXml standard // but I need them for my templating project self.read_comment() } else { // the end element look like this </xxx> self.read_end_element() } } else { // the text node is between element so looks like this // > text < self.read_text_node() } } TagState::InsideOfTag => { self.move_over_whitespaces()?; // InsideOfTag (after name) can be > or attributes or self_closing // < xxx >, < xxx attr="val" >, < xxx /> // if it is not self-closing or > then must be an attribute if self.last_char.ch == '>' { // here must be the end of start tag > self.move_next_char()?; self.tag_state = TagState::OutsideOfTag; self.start_of_text_node_before_whitespace = 0; // recursive calling return self.read_token_internal(); } else if self.last_char.ch == '/' { // self-closing element self.move_next_char()?; // to > self.move_over_whitespaces()?; if self.last_char.ch != '>' { return Some(Err("Error: Tag has / but not />")); } else { self.move_next_char()?; self.tag_state = TagState::OutsideOfTag; self.start_of_text_node_before_whitespace = 0; return Some(Ok(Token::EndElement(""))); } } else { // attribute self.read_attribute() } } TagState::EndOfFile => { //return None to stop the iterator None } } } /// Reads the element name /// Propagation of Option None if is Eof fn read_element_name(&mut self) -> Option<Result<Token<'a>, &'static str>> { // start of tag name < xxx > self.move_over_whitespaces()?; let start_pos = self.last_char.pos; let end_pos; loop { // read until delimiter space, / or > if self.last_char.ch.is_whitespace() || self.last_char.ch == '/' || self.last_char.ch == '>' { end_pos = self.last_char.pos; break; } else { self.move_next_char()?; } } self.move_over_whitespaces()?; self.tag_state = TagState::InsideOfTag; // unwrap because I am confident that start_pos or end_pos are correct return Some(Ok(Token::StartElement( self.input.get(start_pos..end_pos).unwrap(), ))); } /// Reads the attribute name and value. /// Return Option None if Eof. fn read_attribute(&mut self) -> Option<Result<Token<'a>, &'static str>> { self.move_over_whitespaces()?; let start_pos = self.last_char.pos; let end_pos; loop { // delimiters are whitespace or = if self.last_char.ch.is_whitespace() || self.last_char.ch == '=' { end_pos = self.last_char.pos; break; } else { self.move_next_char()?; } } // unwrap because I am confident that start_pos or end_pos are correct let attr_name = self.input.get(start_pos..end_pos).unwrap(); // region: skip delimiters: whitespace, =, " self.move_over_whitespaces()?; if self.last_char.ch == '=' { self.move_next_char()?; } self.move_over_whitespaces()?; if self.last_char.ch == '"' { self.move_next_char()?; } else { return Some(Err("Error: Attribute does not have the char = .")); } // endregion let start_pos = self.last_char.pos; let end_pos; loop { // end delimiter is " if self.last_char.ch == '"' { end_pos = self.last_char.pos; self.move_next_char()?; break; } else { self.move_next_char()?; } } self.move_over_whitespaces()?; // unwrap because I am confident that start_pos or end_pos are correct let attr_value = self.input.get(start_pos..end_pos).unwrap(); // return Some(Ok(Token::Attribute(attr_name, attr_value))) } /// reads end element fn read_end_element(&mut self) -> Option<Result<Token<'a>, &'static str>> { // end tag for element </ xxx > // we are already at the / char self.move_next_char()?; self.move_over_whitespaces()?; let start_pos = self.last_char.pos; let end_pos; loop { // read until space or > if self.last_char.ch.is_whitespace() || self.last_char.ch == '>' { end_pos = self.last_char.pos; break; } else { self.move_next_char()?; } } self.move_over_whitespaces()?; if self.last_char.ch == '>' { // after the End element is possible to have a correct Eof if let Some(()) = self.move_next_char() { //dbg!(self.last_char.pos); self.start_of_text_node_before_whitespace = self.last_char.pos; if let Some(()) = self.move_over_whitespaces() { self.tag_state = TagState::OutsideOfTag; } else { self.tag_state = TagState::EndOfFile; } } else { self.tag_state = TagState::EndOfFile; } return Some(Ok(Token::EndElement( // unwrap because I am confident that start_pos or end_pos are correct self.input.get(start_pos..end_pos).unwrap(), ))); } else { return Some(Err("End Element does not have > .")); } } /// Reads text node /// I don't do any encoding/decoding here, because I need it "as is" for html templating. /// I preserve all the "significant" whitespaces because I will use this for templating. /// And because there is no hard standard for trailing spaces in xml text node. /// If reached Eof propagates Option None. fn read_text_node(&mut self) -> Option<Result<Token<'a>, &'static str>> { // text element look like this > some text < // it has significant whitespace start let start_pos = self.start_of_text_node_before_whitespace; // reset it to 0, because I don't need it more here // and this is the signal to store a new one. self.start_of_text_node_before_whitespace = 0; let mut end_pos; loop { //dbg!(self.last_char.ch); end_pos = self.last_char.pos; // end delimiter in < or end of file if self.last_char.ch == '<' { self.tag_state = TagState::OutsideOfTag; break; } else { if self.move_next_char().is_none() { end_pos += 1; self.tag_state = TagState::EndOfFile; break; } } } // unwrap because I am confident that start_pos or end_pos are correct //dbg!(end_pos); return Some(Ok(Token::TextNode( self.input.get(start_pos..end_pos).unwrap(), ))); } /// Comments are not data for MicroXml standard, /// But I need them as data for my templating project. /// The Option is returned only because of Option None propagation because of Eof. fn read_comment(&mut self) -> Option<Result<Token<'a>, &'static str>> { // comments looks like this <!-- xxx --> // we should be now at the second character <! self.move_next_char()?; // skip char ! self.move_next_char()?; // skip char - self.move_next_char()?; // skip char - let start_pos = self.last_char.pos; let end_pos; // read until end of comment --> let mut ch1 = ' '; let mut ch2 = ' '; loop { let ch3 = self.last_char.ch; // end delimiter --> if ch1 == '-' && ch2 == '-' && ch3 == '>' { end_pos = self.last_char.pos - 2; self.move_next_char()?; break; } else { ch1 = ch2; ch2 = ch3; self.move_next_char()?; } } // it is possible to have a comment in between 2 text nodes self.start_of_text_node_before_whitespace = 0; self.tag_state = TagState::OutsideOfTag; // unwrap because I am confident that start_pos or end_pos are correct return Some(Ok(Token::Comment( self.input.get(start_pos..end_pos).unwrap(), ))); } // region: methods for iterator /// Moves the iterator and stores the last_char. /// Iterator next() of CharIndices is consuming the char. /// There is no way back to the same char. /// But often I need to get again the same character of the last operation. /// I tried with peekable.peek(), but it gives a reference and this was a problem. /// So now I have 2 separate methods: move_next_char() and get_last_char(). /// I store the last_char for repeated use. /// Anytime it can reach the End of File (Eof), /// then it propagates the Option None to the caller with the ? syntax. /// Only the caller knows if the Eof here is ok or it is an unexpected error. /// The usize inside the Option is only a dummy, /// only because I need to propagate the Option None because of Eof fn move_next_char(&mut self) -> Option<()> { // Eof can be reached anytime. I will propagate None to the caller with ? self.last_char.set(self.indices.next()?); // returns a dummy only because of Option None propagation with ? Some(()) } /// Skips all whitespaces if there is any /// and returns the last_char when it is not whitespace. /// saves the whitespace beginning position, because /// the caller must know if the whitespaces are insignificant. For example TextNode. /// If found Eof, propagates Option None. fn move_over_whitespaces(&mut self) -> Option<()> { loop { if !self.last_char.ch.is_whitespace() { return Some(()); } else { self.move_next_char()?; } } } // endregion } impl<'a> Iterator for ReaderForMicroXml<'a> { type Item = Result<Token<'a>, &'static str>; /// Reads the next token: StartElement, Attribute, Text, EndElement fn next(&mut self) -> Option<Result<Token<'a>, &'static str>> { // return self.read_token_internal() } }