Home / WordWebNav / Development-Docs Comments

1 Fixing and editing Word HTML

1.1 Word TOC HTML

1.1.1 Summary

1.1.2 Survey of the types of TOC entries in the Word HTML:

1.1.2.1 HTML TOC entries, based on the possible TOC options

1.1.2.2 TOC paragraph with one or more tags:  "<span lang= ..."

1.1.2.3 TOC paragraph with multiple anchor tags: "<a..."

1.1.3 Experiments

1.1.3.1 Overview

1.1.3.2 Testing the different TOC-Option combinations

1.1.3.3 TOC-experiment-01.docx

1.1.3.4 TOC-experiment-02.docx

1.1.3.5 TOC-experiment-03.docx

1.1.3.6 Word doc created via File : New

1.1.3.7 Word's default Normal.dot file

1.1.3.8 example-book-formatting.doc

1.2 Word unordered lists

1.2.1 Summary

1.2.2 Survey of the types of unordered lists

1.2.2.1 From my docs

1.2.2.2 Downloaded doc

1.2.3 Unordered lists: my Word app

1.2.3.1 Word HTML bullet symbols

1.2.3.1.1 Wingdings'>§<

1.2.3.1.1.1 Found in testing, had to fix my code to handle this

1.2.3.1.2 >·<

1.2.3.1.3 Courier New"'>o<

1.2.3.2 Indentation

1.2.4 Unordered lists:  downloaded .doc file

1.2.5 Other unordered lists

1.3 Word ordered-lists

1.3.1 Summary of ordered-list list-items’ HTML

1.3.2 Survey

1.3.2.1 Ordered-lists: my Word app

1.3.2.1.1 Indentation problems

1.3.2.2 headings-bullets-default-norma-dot.htm

1.3.2.3 Ordered-lists:  downloaded .doc file

1.3.2.4 Word multi-level lists

1.3.2.4.1 downloaded .doc file

1.3.2.4.1.1 Generalized HTML

1.3.2.4.1.2 Examples

1.3.2.4.2 Other files

1.4 Invisible paragraph

1.4.1 Bug in Word HTML

1.4.2 Investigation, for creating a fix

1.4.2.1 PublicWWW.com searches

1.4.2.2 HTML for setting the background color

1.4.3 Word color-related formatting

1.4.3.1 Info from MS's HTML specs

1.5 Word HTML structure

1.5.1 Overall

1.5.2 Top-most HTML tags, including <meta> tags

1.5.3 div's

1.6 Unconventional characters

2 Shortcomings and bugs in Word's HTML

2.1 Word messages about HTML formatting problems

2.2 Problems form spurious section-breaks in Word doc

2.3 Note:  Word HTML uses <ol> and <li>

2.4 Equation page-layout bug

2.5 Images can be become blurred

2.6 Text boxes converted to illegible images, and layout problems

2.7 Text-box not included in Word HTML document

2.8 Excessive space between lines

2.9 References to page-numbers

2.10 Things HTML can't do, but Word can

3 Word HTML:  not a bug

3.1 Effects of "justified" text

 

WWN Development Document

 

Word HTML Bugs :

bugs, bug-fixes, and problems

 

 

Word’s Navigation pane shows the table-of-contents (View : Show : Navigation pane).

 

·         Contents:

o      For Word HTML bugs

§  Bugs, bug-fixes, and HTML editing

o      Shortcomings in Word’s HTML, due to document formatting and layout

 

This document was created by the WWN author for his own use in developing WWN.  It is included in the WWN repo, as other developers may find it useful.

 

1  Fixing and editing Word HTML

1.1  Word TOC HTML

·         TOC info is also in:  word-html--table-of-contents.docx

1.1.1  Summary

·         Summary

o      The objective:  change link formatting (underline, color change), extract html to put in a TOC div

o      The changes required are to add CSS class specifications...  need to describe this

o      The general format of the TOC-entry paragraphs is shown below.

 

·         General format of TOC-entry paragraphs with hyperlinks

o      The paragraph has one child tag, and that child tag can have children

 

<p class=MsoToc1>[nested span opening-tags]<a href="#_Toc68878247">[span lang= opening-tag][text][span lang= closing-tag][page-number span tags, with display:none;]</a>[nested span closing-tags]

 

<p class=MsoToc1>[anchor tags with no text]<a href="#_Toc68878247">[text]</a>[nested span closing-tags]

 

o      All TOC-entry paragraphs start with <p class=MsoToc[1-9]>

o      Not shown:  there can potentially be whitespace after opening-tags and/or before closing-tags, e.g., due to newlines.

o      [nested span opening-tags]

§  Can be none, one or two of these span tags

§  Two types of span opening-tags:

·         <span lang=DE>

·         <span class=MsoHyperlink>

§  If one, I've just seen <span class=MsoHyperlink>

§  If both, they are nested and the types can be in any order

<span lang=DE><span class=MsoHyperlink>...</span></span>

<span class=MsoHyperlink><span lang=DE>...</span></span>

§  If one or both, they contain the <a>...</a> tag

·         The span closing-tags go after the </a>:  [nested span closing-tags]*

 

·         General format of TOC-entry paragraphs without hyperlinks:

o      See below

 

1.1.2  Survey of the types of TOC entries in the Word HTML:

1.1.2.1  HTML TOC entries, based on the possible TOC options

·         HTML TOC entries, based on the possible TOC options

o      For a particular TOC, all TOC entries are of the same type (same HTML elements)

 

·         TOC options:  hyperlinks, TOC includes no page numbers:

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68878247">1 L1</a></span></p>

 

·         TOC options:  unknown

o      The TOC entry has an anchor element (<a>), but NOT "<span class=MsoHyperlink>"

o      This TOC entry has a hyperlink, but I can't figure-out what creates this type of TOC entry

§  It may be a combination of:  the headings used, other document content, and TOC options

§  The TOC options appear to be:  hyperlinks, page-numbers right-aligned

o      Example:  (headings-and-bullets.docx, headings-and-bullets.htm), (sys-admin.docx, sys-admin.htm)

<p class=MsoToc1><a href="#_Toc67590875">1 Level 1<span style='color:windowtext;

display:none;text-decoration:none'>. </span><span

style='color:windowtext;display:none;text-decoration:none'>1</span></a></p>

 

·         TOC options:  hyperlinks, TOC includes page numbers

o      Note: 

§  When page-numbers are included, there are two <span> elements after the TOC-entry's text.

§  Those <span> elements' text does not get displayed, and can be ignored.

·         Due to style='...display:none;...'

§  These <span> elements may be created by MS Word for its own use in opening the *.htm file, and displaying the TOC.

 

o      TOC options:  hyperlinks, TOC includes page numbers, right-align page numbers

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68878729">1 L1<span

style='color:windowtext;display:none;text-decoration:none'>. </span><span

style='color:windowtext;display:none;text-decoration:none'>1</span></a></span></p>

 

o      TOC options:  hyperlinks, TOC includes page numbers, page numbers not right-aligned

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68879103">1 L1<span

style='color:windowtext;display:none;text-decoration:none'> </span><span

style='color:windowtext;display:none;text-decoration:none'>1</span></a></span></p>

 

·         TOC options:  no hyperlinks,

o      TOC options:  no hyperlinks, no page numbers

<p class=MsoToc1>1 L1</p>

o      TOC options:  no hyperlinks, page numbers

<p class=MsoToc1>1 L1................................................................................................................................................. 1</p>

 

1.1.2.2  TOC paragraph with one or more tags:  "<span lang= ..."

·         TOC paragraph with one or more tags:  "<span lang= ..."

o      "<span lang= ..." in 4 places below

1.1.2.3  TOC paragraph with multiple anchor tags: "<a..."

<p class="MsoToc1"><a class="tocAnchor" name="_Toc74303531"></a><a name="_Toc74303626"></a><a name="_Toc74304313"></a><a name="_Toc74304544"></a><a href="#_Toc74313249">1 L1-1</a></p>

 

1.1.3  Experiments

1.1.3.1  Overview

·         This section records experiments I did to determine what HTML Word creates for TOC entries.

 

·         Scope of experiments

o      Normal.dot files used:

§  Used my file, with my Word customizations

§  Used MS Word's default file

o      TOC options used

§  Options that varied across experiments

·         With and without:  page numbers, hyperlinks

·         Print and web layouts

§  Options used in all experiments

·         Format: from template

·         Options:  used default

·         Modify: used default

·         Save as:  Web, filtered

o      Experiment files are in my directory:  Word-TOC-entry-HTML-types

 

1.1.3.2  Testing the different TOC-Option combinations

·         Word new

o      Show page numbers:  off

§  Use hyperlinks: off

·         Web layout (v1)

<p class=MsoToc1>1 L1</p>

·         Print layout (v2)

<p class=MsoToc1>1 L1</p>

§  Use hyperlinks: on

·         Web layout (v3)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68878247">1 L1</a></span></p>

·         Print layout (v7)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68879418">1 L1</a></span></p>

o      Show page numbers:  on

§  Use hyperlinks: off

·         Web layout (v4)

<p class=MsoToc1>1 L1................................................................................................................................................. 1</p>

§  Use hyperlinks: on

·         Web layout

o      Right align page numbers: on (v5)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68878729">1 L1<span

style='color:windowtext;display:none;text-decoration:none'>. </span><span

style='color:windowtext;display:none;text-decoration:none'>1</span></a></span></p>

o      Right align page numbers: off (v6)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68879103">1 L1<span

style='color:windowtext;display:none;text-decoration:none'> </span><span

style='color:windowtext;display:none;text-decoration:none'>1</span></a></span></p>

 

·         XYPlorer new

o      Show page numbers:  off

§  Use hyperlinks: off

·         Web layout (v)

·         Print layout (v)

§  Use hyperlinks: on

·         Web layout (v)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68899583">1 L1</a></span></p>

·         Print layout (v)

o      Show page numbers:  on

§  Use hyperlinks: off

·         Web layout (v)

§  Use hyperlinks: on

·         Web layout

o      Right align page numbers: on (v)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68899914">1 L1<span

§  Print layout (v10)

<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc68901825">1 L1<span

o      Right align page numbers: off (v)

 

1.1.3.3  TOC-experiment-01.docx

·         Summary

o      TOC entries have links

o      HTML TOC entries

<p class=MsoToc2><span class=MsoHyperlink><a href="#_Toc68857208">1.2 Level 1.2</a></span></p>

 

<p class=MsoToc3><span class=MsoHyperlink><a href="#_Toc68857209">1.2.1 Level

1.2.1</a></span></p>

 

·         Config

 

·         HTML

 

 

1.1.3.4  TOC-experiment-02.docx

·         Summary

o      TOC entries do not have links. 

o      HTML TOC entries

<p class=MsoToc1>1 level 1</p>

 

·         Config

 

 

·         HTML

 

 

·         Chrome

o      No hyperlinks

 

1.1.3.5  TOC-experiment-03.docx

·         Summar

o      "Show page numbers" results in extra HTML, which is not displayed

o      The Web-Layout and Print-Layout views differ

§  Note this in user manual

 

·         TOC creation options used

 

·         TOC views

o      Web Layout

 

o      Print Layout

 

·         HTML

 

1.1.3.6  Word doc created via File : New

·         Other tests may have created the Word doc via "File:New" or the XYPlorer new-item Word doc

 

·         Test 1

 

·         Test 2

 

·         Test 3

o      Notes

§  Multiple names created for a header, and TOC has multiple href's to multiple names

§  Multiple names probably due to repeatedly creating and deleting a TOC in the doc

§  toc-test--.docx

o      TOC entry

<p class="MsoToc1"><a class="tocAnchor" name="_Toc74303531"></a><a name="_Toc74303626"></a><a name="_Toc74304313"></a><a name="_Toc74304544"></a><a href="#_Toc74313249">1 L1-1</a></p>

 

o      Heading entries

 

 

 

 

 

 

1.1.3.7  Word's default Normal.dot file

 

1.1.3.8  example-book-formatting.doc

·         The file is .doc, from the Internet

·         "<span lang= ..." in 4 places here

 

1.2  Word unordered lists

1.2.1  Summary

·         <p class=[class-name] ...>, possible class names:

MsoListParagraphCxSpFirst

MsoListParagraphCxSpMiddle

MsoListParagraphCxSpLast

MsoListParagraph

MsoNormal

--Found later during testing of downloaded Word docs

MsoBodyText

ScrollListBullet  (not supported in my code, needs more investigation)

O-BodyText   (not supported in my code, needs more investigation)

 

<p class=ScrollListBullet><span style='font-family:Symbol'>·<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>Step 3 allows you to choose the installation folder.  The default

is <i>C:\Program Files\NYSPRL</i>.</p>

 

 

·         <p... style=[style_spec]>

<p... style='margin-left:[.*]in;text-indent:[.*]in'>

o      [style_spec] usually has "margin-left:[.*]in;", but not always

§  If omitted, it's defined in the Word <style> section, for the paragraph class

o      Indentation seems ok and consistent

o      <p class=ScrollListBullet>

§  p tag does not have other attributes, e.g., no style=.  It's the only bullet like that

 

·         <p ...><span style='font-family:[font_name]'>[bullet_symbol]<span style=>[&nbsp;]</span></span>

o      outter span can be preceded by, and enclosed by:  <span class=MsoHyperlink>

o      outter span can be preceded by, and enclosed by: <font size=2 face=Symbol>

o      outer span:

§  style can also have "font-size:10.0pt;", "color:windowtext;", "text-decoration:none"

§  can have: lang=EN-GB

o      3 font symbols

§  <span style='font-family:Wingdings'>§

§  <span style='font-family:Symbol'>·

§  <span style='font-family:"Courier New"'>o

o      inner span:

§  "&nbsp;"'s can have a space after them, before </span>

 

·         Algorithm for list-item detection

o      paragraph has expected class-name, and style with "text-indent:"

o      Get strings for paragraph

o      First string:

§  a bullet-symbol

§  enclosed in a span, with sytle and expected font-family

o      Second string

§  "&nbsp;"'s and possibly whitespace

§  enclosed in a span with style

§  the span's parent is in the bullet-symbol's parent

 

1.2.2  Survey of the types of unordered lists

1.2.2.1  From my docs

·         Symbol types found

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'>

<span style='font-family:Wingdings'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span>

[text]</p>

 

<p class=MsoListParagraphCxSpFirst style='margin-left:.25in;text-indent:-.25in'>

<span style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>

[text]</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:.75in;text-indent:-.25in'>

<span style='font-family:"Courier New"'>o<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;

</span></span>

[text]</p>

 

 

1.2.2.2  Downloaded doc

<p class=MsoNormal style='margin-left:.5in;text-indent:-.25in'>

 

<span lang=EN-GB style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>

<span lang=EN-GB>

[text]</p>

 

 

1.2.3  Unordered lists: my Word app

1.2.3.1  Word HTML bullet symbols

1.2.3.1.1  Wingdings'>§<

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

style='font-family:Wingdings'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span>level 3</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

style='font-size:10.0pt;font-family:Wingdings'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span>  <span style='font-size:10.0pt;font-family:"Courier New"'>pip

install pprintpp </span></p>

 

>>> Space after &nbsp;

>>> Inside <span class=MsoHyperlink>

>>> "Wingdings;" followed by:  color:windowtext;text-decoration:none

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

class=MsoHyperlink><span style='font-family:Wingdings;color:windowtext;

text-decoration:none'>§<span style='font:7.0pt "Times New Roman"'>&nbsp; </span></span></span><a

href="https://pymotw.com/2/ConfigParser/">https://pymotw.com/2/ConfigParser/</a></p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

class=MsoHyperlink><span style='font-family:Wingdings;text-decoration:none'>§<span

style='font:7.0pt "Times New Roman"'>&nbsp; </span></span></span><a

href="https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chkdsk">https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chkdsk</a></p>

 

>>> full html:  D:\Documents\Professional-projects\My-web-site-development\Word-to-HTML\Word-to-HTML-experiments\headings-and-bullets--full-html.htm

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;mso-add-space:

auto;text-indent:-.25in;mso-list:l4 level3 lfo41'><![if !supportLists]><span

style='font-family:Wingdings;mso-fareast-font-family:Wingdings;mso-bidi-font-family:

Wingdings'><span style='mso-list:Ignore'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span></span><![endif]>level 3</p>

 

1.2.3.1.1.1  Found in testing, had to fix my code to handle this

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

class=MsoHyperlink><span style='font-family:Wingdings;text-decoration:none'>§<span

style='font:7.0pt "Times New Roman"'>&nbsp; </span></span></span><a

href="https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chkdsk">https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/chkdsk</a></p>

 

>>> "class=MsoListParagraph"

<p class=MsoListParagraph style='margin-left:1.25in;text-indent:-.25in'><span

style='font-family:Wingdings'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span>Click &quot;Expand All&quot; type of link to see all downloads</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;text-indent:-.25in'><span

style='font-size:10.0pt;font-family:Wingdings'>§<span style='font:7.0pt "Times New Roman"'>&nbsp;

</span></span> <span style='font-size:10.0pt;font-family:"Courier New"'>127.0.0.1

1www.jimyuill.org</span></p>

 

1.2.3.1.2  >·<

<p class=MsoListParagraphCxSpFirst style='margin-left:.25in;text-indent:-.25in'><span

style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>YouTube video shows how to install new ssd and hdd</p>

 

·         Note:  p style doesn't have "margin-left:.25in;"

 

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in'><span

style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>level 1</p>

 

1.2.3.1.3  Courier New"'>o<

<p class=MsoListParagraphCxSpMiddle style='margin-left:.75in;text-indent:-.25in'><span

style='font-family:"Courier New"'>o<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;

</span></span>level 2</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:.75in;text-indent:-.25in'><span

style='font-family:"Courier New"'>o<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;

</span></span><a href="https://www.youtube.com/watch?v=uEIlmHyJ8V4">https://www.youtube.com/watch?v=uEIlmHyJ8V4</a>

</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:.75in;text-indent:-.25in'><span

style='font-size:10.0pt;font-family:"Courier New"'>o<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;

</span></span><span style='font-size:10.0pt;font-family:"Courier New"'>pip

install yamllint</span></p>

 

1.2.3.2  Indentation

·         Summary

o      Indentation seems ok and consistent

 

·         From:  headings-and-bullets-01.htm

 

·         List starts not indented, it's created by my VBA macro F11

L1: 'margin-left:.25in;text-indent:-.25in'

L2: 'margin-left:.75in;text-indent:-.25in'

L3: 'margin-left:1.25in;text-indent:-.25in'

L4: 'margin-left:1.75in;text-indent:-.25in'

L5: 'margin-left:2.25in;text-indent:-.25in'

L6: 'margin-left:2.75in;text-indent:-.25in'

L7: 'margin-left:3.25in;text-indent:-.25in'

L8: 'margin-left:3.75in;text-indent:-.25in'

L9: 'margin-left:4.25in;text-indent:-.25in'

 

·         This list starts indented, it's created by default

L1: 'text-indent:-.25in'

L2: 'margin-left:1.0in;text-indent:-.25in'

L3: 'margin-left:1.5in;text-indent:-.25in'

L4: 'margin-left:2.0in;text-indent:-.25in'

L5: 'margin-left:2.5in;text-indent:-.25in'

L6: 'margin-left:3.0in;text-indent:-.25in'

L7: 'margin-left:3.5in;text-indent:-.25in'

L8: 'margin-left:4.0in;text-indent:-.25in'

L9: 'margin-left:4.5in;text-indent:-.25in'

1.2.4  Unordered lists:  downloaded .doc file

·         example-book-formatting.doc

o      Deeper levels are not possible (not supported by older Word?)

o      level-1:

<p class=MsoNormal style='margin-left:.5in;text-indent:-.25in'><span

lang=EN-GB style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><span lang=EN-GB>Id facilis reformidans eum</span></p>

1.2.5  Other unordered lists

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

 

<p class=MsoBodyText style='margin-left:26.25pt;text-indent:-26.25pt'><span

style='font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><b><i>Definition 1</i></b><i>:</i> A class of machine learning ...

</p>

 

·         pesticide-report.htm

o      The ScrollListBullet's may be spurious;  apparently not useful text in them?

o      ScrollListBullet3 does not have a bullet symbol

<p class=ScrollListBullet3 style='margin-left:.75in;text-indent:0in'>&nbsp;</p>

 

<p class=ScrollListBullet><span style='font-family:Symbol'>·<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>&nbsp;</p>

 

·         software-install-guide.html

o      ScrollListBullet

§  <p...> has no text-indent

o      ScrollListBullet3

§  seems to be spurious and is just used to hold a table

 

<p class=ScrollListBullet3 style='margin-left:.75in;text-indent:0in'>

 

<p class=ScrollListBullet><span style='font-family:Symbol'>·<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>Step 3 allows you to choose the installation folder.  The default

is <i>C:\Program Files\NYSPRL</i>.</p>

 

·         media-use-terms.html

 

<p class=O-BodyText style='margin-left:.75in;text-align:justify;text-indent:

-.25in;line-height:normal'><font size=2 face=Symbol><span lang=EN

style='font-size:11.0pt;font-family:Symbol'>·<font size=1 face="Times New Roman"><span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></font></span></font><font

size=2 face=Calibri><span lang=EN style='font-size:11.0pt;font-family:"Calibri",sans-serif'>in

logos, trademarks, services marks or any other branding or identifiers.</span></font></p>

 

1.3  Word ordered-lists

·         An example of a typical list-item paragraph:

o      The line labels (e.g., Line A:) are not in the HTML

Line A:  <p class=MsoListParagraphCxSpMiddle style='margin-left:1.5in;text-indent:-1.5in'>

Line B:  <span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;</span>i.<span

Line C:  style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;</span>List-item text</p>

 

·         There are two commonly-found bugs in these list-items:

o      The list-item symbol is often not properly indented

o      The text after the list-item symbol is often not properly indented

§  (it does not have the proper number of spaces between the symbol and the start of the text)

1.3.1  Summary of ordered-list list-items’ HTML

·         <p class=[class-name] ...>

o      Possible class names:

MsoListParagraphCxSpFirst

MsoListParagraphCxSpMiddle

MsoListParagraphCxSpLast

MsoNormal

o      class-name NOT observed, but used with unordered lists

MsoListParagraph

---- Found in later testing

MsoBodyText

 

·         <p... style=...> starts with:

o      always has style with "text-indent:"

§  The 'text-indent: ..." tag is inconsistent and often incorrect

§  Using 'text-indent:-.25in' for all levels fixes the problem

o      style usually has "margin-left:", but not always

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in'>

<p class=MsoListParagraphCxSpLast style='margin-left:4.0in;text-indent:-.25in'>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.0in;text-indent:-.25in'>

<p class=MsoListParagraphCxSpLast style='margin-left:4.0in;text-indent:-.25in'>

<p class=MsoNormal style='margin-left:.5in;text-indent:-.25in'>

 

·         Summary:  headings-bullets-default-norma-dot.htm

<p...>[item-number"."]<span style=>["&nbsp;]</span>[text]

<p...><span style=>["&nbsp;]</span>[item-number"."]<span style=>["&nbsp;]</span>[text]

 

·         Summary:  multi-level lists, from downloaded .doc file

<p...><span lang= >[item_number"."]<span style=>["&nbsp;"s]</span></span>

<p...><span lang= ><span style=>[&nbsp;]</span>[item_number"."]<span style=>[&nbsp;]</span></span>

 

·         Item-number formats

o      From Word ribbon:

§  "item-number" can be followed by:  ), ], .

§  "item-number" can be preceded by: [

 

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

o      <p...>[item-number")"]<span

<p class=MsoBodyText style='margin-left:.5in;text-indent:-.25in'>1)<span

 

·         headings-and-bullets.docx

 

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in">1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>A</p>

 

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in">1.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>One</p>

 

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in">[1]<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp; </span>First</p>

1.3.2  Survey

1.3.2.1  Ordered-lists: my Word app

1.3.2.1.1  Indentation problems

·         Summary

o      The 'text-indent: ..." tag is inconsistent and often incorrect

o      Using 'text-indent:-.25in' for all levels fixes the problem

 

·         From:  headings-and-bullets-01.htm

 

l-1: 'text-indent:-.25in'

l-2: 'margin-left:1.0in;text-indent:-.25in'

l-3: 'margin-left:1.5in;text-indent:-1.5in'

l-4: 'margin-left:2.0in;text-indent:-.25in'

l-5: 'margin-left:2.5in;text-indent:-.25in'

l-6: 'margin-left:3.0in;text-indent:-3.0in'

l-7: 'margin-left:3.5in;text-indent:-.25in'

l-8: 'margin-left:4.0in;text-indent:-.25in'

l-9: 'margin-left:4.5in;text-indent:-4.5in'

1.3.2.2  headings-bullets-default-norma-dot.htm

·         Summary:  headings-bullets-default-norma-dot.htm

<p...>[item-number"."]<span style=>["&nbsp;]</span>[text]

<p...><span style=>["&nbsp;]</span>[item-number"."]<span style=>["&nbsp;]</span>[text]

 

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in'>1.<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>L-1</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.0in;text-indent:-.25in'>a.<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>L-2</p>

 

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.5in;text-indent:-1.5in'><span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span>i.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span>L-3</p>

 

<p class=MsoListParagraphCxSpLast style='margin-left:4.0in;text-indent:-.25in'>b.<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>L-8</p>

1.3.2.3  Ordered-lists:  downloaded .doc file

·         Summary:

<p...><span lang= >[item-number"."]<span style=>["&nbsp;]</span></span>

 

·         example-book-formatting.doc

o      Deeper levels are not possible (not supported by older Word?)

o      level-1:

 

<p class=MsoNormal style='margin-left:.5in;text-indent:-.25in'><span

lang=EN-GB>1.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><span lang=EN-GB>Id facilis reformidans eum</span></p>

 

1.3.2.4  Word multi-level lists

1.3.2.4.1  downloaded .doc file

·         Summary

o      Formatting problems, worse in lower levels

o      Bullets' HTML start with:

<p class=MsoNormal style='margin-left:50.2pt;text-indent:-.25in'>

o      Experiments to fix it, not successful

§  Didn't change bullet HTML starting element <p... :

·         Saving as .docx

·         Copying original contents to a new .docx

 

·         example-book-formatting--multi-level-list-added.doc

o      I added multi-level list to:  example-book-formatting.doc

·         example-book-formatting--multi-level-list-added--v2.docx

o      I created this from .doc versioin

 

·         Doc:

·         WHWN HTML

 

 

1.3.2.4.1.1  Generalized HTML

·         Summary

o      [item_number"."] and <span> are siblings

<p...><span lang= >[item_number"."]<span style=>["&nbsp;"s]</span></span>

<p...><span lang= ><span style=>[&nbsp;]</span>[item_number"."]<span style=>[&nbsp;]</span></span>

 

<p... ><span lang=EN-GB>1.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span>

 

<p... ><span lang=EN-GB><span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span>i.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span>

1.3.2.4.1.2  Examples

·         Summary:  from downloaded .doc file

<p...><span lang= >[item_number"."]<span style=>["&nbsp;]</span></span>

<p...><span lang= ><span style=... >["&nbsp;]</span>[item_number"."]<span>[&nbsp;]</span></span>

 

<p class=MsoNormal style='margin-left:50.2pt;text-indent:-.25in'><span

lang=EN-GB>1.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><span lang=EN-GB>level 1</span></p>

 

<p class=MsoNormal style='margin-left:86.2pt;text-indent:-.25in'><span

lang=EN-GB>a.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><span lang=EN-GB>level 2</span></p>

 

<p class=MsoNormal style='margin-left:122.2pt;text-indent:-122.2pt'><span

lang=EN-GB><span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span>i.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span></span><span lang=EN-GB>level 3</span></p>

 

1.3.2.4.2  Other files

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

 

<p class=MsoBodyText style='margin-left:.5in;text-indent:-.25in'>1)<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><b>Generative

deep architectures</b>, which are intended to capture high-order correlation of ... </p>

 

1.4  Invisible paragraph

1.4.1  Bug in Word HTML

·         Bug:  paragraph text can be formatted as 'color.white', but the background is white, so the text is invisible

o      I've only seen this bug in a paragraph immediately after a table

 

·         From Python Mammoth test files

o      test-Word-files\python-mammoth-Word-files\tables.docx

<p class=MsoNormal><span lang=EN-GB style='color:white'>Below</span></p>

 

·         I reproduced the bug in a Word file

o      table--testing-for-omitted-paragraph-after-table--v1.docx

o      There's no other color specifications for the HTML

<p class=MsoNormal><span style='color:white'>paragraph 2</span></p>

 

·         Found in

o      pesticide-report.docx

§  http://sbdocs.psur.cornell.edu/download/attachments/4784135/OptionsDAndOUserGuide.docx?version=1&modificationDate=1449157135050&api=v2

o      Word:

 

o      Word HTML

 

 

·         https://hogwartslive.com/privacy.html

o      color:white is benign, since it's just for a space

o      after a table

</table>

</div>

<p class=MsoNormal style='text-align:justify;line-height:115%'><span

lang=en-US style='font-size:10.0pt;line-height:115%;font-family:"Arial",sans-serif;

color:white'>&nbsp;</span></p>

 

·         http://nigeria-law.org/

o      color:white is benign, since it's just for a space

o      after a table

</table>

<p align=center style='margin:0cm;text-align:center'><span style='color:white;

mso-color-alt:windowtext'>&nbsp;</span></p>

 

·         http://jurnal.org/articles/2011/mat7.html

o      Word HTML, filtered, but other divs were added, e.g., for ads

o      Minor problem, just for a "_"

<html>

<head>

<title>Методы конечных разностей и конечных элементов в задачах электромагнитной совместимости</title>

<noindex>

<meta http-equiv=Content-Type content="text/html; charset=windows-1251">

<meta name=Generator content="Microsoft Word 14 (filtered)">

<style>

 

<p class=MsoNormal style='margin-left:0cm;text-align:justify;text-indent:35.45pt'>10.<span

style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

</span>Berenger J.-P. A perfectly matched layer for the absorption of

electromagnetic waves // J.Comput.Phys. – 1994. – Vol. 114, № 2. – P.<span

style='color:white'>_</span>185-200</p>

 

·         http://webstratus.com/

o      Word HTML, but not filtered

o      Minor problem, only affects a "."

o      After a <img tag

<img

border=0 width=431 height=106 src="default_files/image001.png"

alt="Stratus-Header_Traffic" v:shapes="Picture_x0020_1"></span><![endif]></span></a></p>

 

<p class=MsoNormal align=center style='text-align:center'><span

style='color:black;mso-themecolor:text1'>For questions or issues with Stratus <b

style='mso-bidi-font-weight:normal'>Traffic <span class=GramE>And</span>

Billing</b>, please email: <a href="mailto:trafficarhelp@cumulus.com"><span

style='color:black;mso-themecolor:text1'>trafficarhelp@cumulus.com</span></a></span><span

style='color:white;mso-themecolor:background1'>.<o:p></o:p></span></p>

 

·         http://www.ellington.k12.mo.us/ALUMNI.HTM

o      Word HTML, but not filtered

o      In a table cell <td

 

<td><![endif]>

    <div v:shape="_x0000_s1027" style='padding:3.6pt 7.2pt 3.6pt 7.2pt'

    class=shape>

    <p class=MsoNormal><span style='color:white;mso-themecolor:background1'>$10.00

    each<o:p></o:p></span></p>

    </div>

1.4.2  Investigation, for creating a fix

1.4.2.1  PublicWWW.com searches

·          "color:white" <table>

o      Bug is often with a paragraph just after a table

 

1.4.2.2  HTML for setting the background color

·         Legitimate uses of color:white

o      <p> uses color:white, but intended, as background color is not white

 

·         https://www.bible.ca/canon.htm  

o      Pure Word HTML (filtered), legit use of color:white

<meta name=Generator content="Microsoft Word 15 (filtered)">

o      Table cell:

  <td width=623 valign=top style='width:467.5pt;border:solid windowtext 1.0pt;

  background:black;padding:0in 5.4pt 0in 5.4pt'>

  <p class=MsoNoSpacing align=center style='text-align:center'><b><span

  style='font-size:18.0pt;color:white'>The Canon of the Bible</span> [html removed]</p>

  </td>

 

·         http://mrcophth.com/

o      Not pure Word HTML, apprently from MS FrontPage

o      color:white appears to be a bug

<body bgcolor=white lang=EN-US link="#3333FF" vlink="#3333FF" style='tab-interval:

.5in' alink="#CC33CC">

<div class=Section1>

<p class=MsoNormal><span class=GramE><span style='font-size:7.5pt;color:white'>z</span></span><span

style='font-size:7.5pt;color:white'> Singapore National Eye Centre (SNEC) <span

class=SpellE>Moorfields</span> Eye Hospital World Ophthalmology Congress</span>

<o:p></o:p></p>

 

·         http://wwlln.net/

o      Full Word HTML (not filtered), legit use of color:white

 

<body bgcolor=black lang=EN-US link="#CC33CC" vlink="#CC33CC" style='tab-interval:

.5in' alink="#ff0000">

<p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span

style='color:white'>&nbsp; <o:p></o:p></span></p>

<td width="22%" style='width:22.0%;background:black;padding:0in 0in 0in 0in'>

 

·         http://sbsc.com.vn/ContactUs.aspx

o      Not pure Word HTML

o      This HTML might not be generated by Word

o      color:white set in <a> tag

<td style='background-color: #2d7caf;color:#fff; padding: 5px 10px; border-top:solid 1px White;font-size:14px;line-height:18px;margin-bottom:14px;font-family:Arial,sans-serif'><a style='color:white;'

 

1.4.3  Word color-related formatting

1.4.3.1  Info from MS's HTML specs

·         Word Borders and Shading

o      Microsoft Word allows the application of border and shading properties to text, paragraphs, sections, tables, and table cells. The following CSS style attributes correspond to Word formatting elements. Attribute Value in Word

§  background-color : The shading fill behind the text or art.

§  background : The shading fill of the object.

o      To preserve these effects, Word uses the following HTML style attributes. Note that each attribute has multiple values; for example, border-top uses a string to define the values (in order) of width, style, and color. Style Word property 

§  background : Fill color of the element.

o      color : CSS Text See the CSS Level 2 Recommendation

o      mso-background : Office only Cell Formatting auto,<color>,windowtext  

o      mso-background-source : Office only Cell Formatting auto

 

·         Microsoft Word, Microsoft Excel, and Microsoft PowerPoint allow saving the background color or image to HTML.

o      Word

§  If the background is a color, Word implements the bgcolor attribute of the Body element using standard HTML colors

 

1.5  Word HTML structure

1.5.1  Overall

·         The overall Word HTML structure is described in the Word HTML specs:

o      Office HTML and XML File Formats

§  When a Microsoft Office document is saved as a Web page, a main HTML file and a number of related files are created.

o      Page Layout and Section Breaks

§  Many important Microsoft Word page layout settings are stored on a section-by-section basis within a document.

 

·         There can be mulitple div sections

o      From the Word HTML specs

<head><style> <!--

@page { document-level settings }

 

@page Section1 { first section settings }

div.Section1 { page: Section1; }

 

 

@page Section2 { second section settings }

div.Section2 { page: Section2; }

 

 

@page Section3 { third section settings }

div.Section3 { page: Section3; }

 

--></style></head>

<body>

 

<div class=Section1>first section data goes here</div>

 

<div class=Section2 >second section data goes here</div>

 

<div class=Section3> <third section data goes here</div>

 

</body>

 

 

1.5.2  Top-most HTML tags, including <meta> tags

·         table--testing-for-omitted-paragraph-after-table--v1.html

o      From my Word doc

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

<meta name=Generator content="Microsoft Word 15 (filtered)">

<style>

 

·         charset=

o      charset=utf-8

o      charset=windows-1252

 

·         test-Word-files\python-mammoth-Word-files--Word-HTML\comments.html

o      Word file has review comments

o      Word added a <script... element in <head>

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

<meta name=Generator content="Microsoft Word 15 (filtered)">

 

<style id="dynCom" type="text/css"><!-- --></style>

<script language="JavaScript"><!--

function msoCommentShow(anchor_id, com_id)

{

      if(msoBrowserCheck())

 

·         http://dandanplay.com/bsintro.htm

<html>

<head>

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

<meta name=Generator content="Microsoft Word 15 (filtered)">

<style>

 

·         http://izapya.com/policy_en.html

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

<meta name=Generator content="Microsoft Word 15 (filtered)">

 <title>Zapya Privacy Policy</title>

<style>

<!--

 /* Font Definitions */

[html removed]

-->

</style>

</head>

<body lang=ZH-CN link="#0563C1" vlink="#954F72" style="padding:1rem;">

<div class=WordSection1>

·         http://izapya.com/v3/about_us.html

o      viewport does not appear to be generated by MS Word

o      It's common use seems to be in mobile-device support (small screen size) and with HTML email

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html;  charset=utf-8">

<meta name=Generator content="Microsoft Word 15 (filtered)">

<meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,user-scalable=0">

<title>About Us</title>

<style>

<!--

 /* Font Definitions */

 

·         http://nastooh.ir/docs/portfolio/newsagency-analytics-report.htm

o      Arabic characters in title

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=UTF-8">

<meta name=Generator content="Microsoft Word 15 (filtered)">

<title>&#1576;&#1575;&#1586;&#1578;&#1575;&#1576;

&#1605;&#1575;&#1607;&#1740;&#1575;&#1606;&#1607;

&#1582;&#1576;&#1585;&#1711;&#1586;&#1575;&#1585;&#1740;&#8204;&#1607;&#1575;</title>

<style>

<!--

 /* Font Definitions */

 

·         test-Word-files\other-Word-files--Word-HTML\example-book-formatting-v2.html

<html>

 

<head>

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

<meta name=Generator content="Microsoft Word 15 (filtered)">

<title>Emma</title>

<style>

 

1.5.3  div's

·         Word features that create extra div's

o      footnotes, byte-order-mark, review comments

 

·         test-Word-files\python-mammoth-Word-files--Word-HTML\footnote-hyperlink.html

o      Footnotes result in two additional div's, one nested in the other

<div class=WordSection1>

 

<p class=MsoNormal><a href="#_ftn1" name="_ftnref1" title=""><span

class=MsoFootnoteReference><span lang=EN-GB><span class=MsoFootnoteReference><span

lang=EN-GB style='font-size:11.0pt;line-height:115%;font-family:"Calibri",sans-serif'>[1]</span></span></span></span></a></p>

 

</div>

 

<div><br clear=all>

 

<hr align=left size=1 width="33%">

 

<div id=ftn1>

 

<p class=MsoFootnoteText><a href="#_ftnref1" name="_ftn1" title=""><span

class=MsoFootnoteReference><span lang=EN-GB><span class=MsoFootnoteReference><span

lang=EN-GB style='font-size:10.0pt;line-height:115%;font-family:"Calibri",sans-serif'>[1]</span></span></span></span></a><span

lang=EN-GB> <a href="http://www.example.com">Example</a></span></p>

 

</div>

 

</div>

 

·         Byte order mark results in an additional div

o      test-Word-files\python-mammoth-Word-files--Word-HTML\utf8-bom.html

o      XML byte order mark - Google Search

o      Byte order mark - Wikipedia

§  https://en.wikipedia.org/wiki/Byte_order_mark

 

<div class=WordSection1>

 

<div style='border:none black 1.0pt;padding:0in 0in 0in 0in'>

 

<p class=MsoNormal>This XML has a byte order mark.</p>

 

</div>

 

</div>

1.6  Unconventional characters

·         How can I clean extra code out of Word HTML

o      https://www.moorecreative.com/Articles/Detail/tabid/522/ArticleId/18/How-can-i-clean-extra-code-out-of-Word-HTML.aspx

o      This app is one of our favorites because it also converts all non standard characters (like curly quotes, em and en dashes, Macintosh character issues, etc) into the proper ASCII.

 

·         Smart quotes

o      Accessibility at Penn State | Cautions on Converting Word to HTML

§  https://accessibility.psu.edu/microsoftoffice/microsoftword/wordhtml/

§  If Smart Quotes are turned on, then they will be converted to a Unicode numeric character or left intact. Older browsers and screen readers may not be able to decipher these curly symbols. This issue also affects apostrophes and lengthened hypens.

 

2  Shortcomings and bugs in Word's HTML

Some of these problems are described in the WWN User’s Guide

2.1  Word messages about HTML formatting problems

·         This GUI was displayed when saving .doc file as HTML

·         computer-concepts-instructors-manual.docx

o      http://virgil.azwestern.edu/~cvb/CIS120/Book%20Notes/Chapter.04.docx

 

 

2.2  Problems form spurious section-breaks in Word doc

·         computer-concepts-instructors-manual.docx

o      http://virgil.azwestern.edu/~cvb/CIS120/Book%20Notes/Chapter.04.docx

 

·         Suprious section breaks caused at least two problems in the Word HTML

o      Removing the section breaks fixed the problem in Word HTML

 

·         The section breaks generated this Word HTML:

<span style='font-size:11.0pt;font-family:"Sylfaen",serif'><br clear=all

style='page-break-before:auto'>

</span>

 

·         Lines added before and after list

o      Word

 

o      Word HTML

 

·         Also causes left margin to be wrong for whole HTML doc

 

 

2.3  Note:  Word HTML uses <ol> and <li>

2.4  Equation page-layout bug

·         Summary

o      When displayed in HTML, one equation is misplaced on the page

§  Word doc used old Word equation editor

§  The "(3)" for the equation was additional text

o      Fix

§  I used Word's feature to upgrade the equation for use with current Word equation editor

·         Right click on equation

§  I edited the equation to put the (3) in with the equation.

§  See:  physics-tutorial--fixed.docx

 

·         physics-tutorial.docx

o      https://www.niu.edu/brown/_pdf/physics374_spring2021/l4-1-21.docx

 

·         Word

 

·         Word HTML:

 

·         Fixed, Word HTML

 

2.5  Images can be become blurred

·         Summary

o      Source image is actually a text box, with a picture and Figure caption as text

§  Problem:  text boxes are converted to graphics images which are blurred

§  Solution:

·         Should be able to use a table instead of text box, to fix blurred Figure caption?

·         Image can be moved outside of the text box

o      Image blurred when document converted to Word HTML

§  Problem: 

·         Embedded image in document is png

·         When doc is converted to HTML, image is converted to gif

·         Apparently gif conversion is blurry

§  Solution

·         Convert png to jpg, and embed the jpg image instead

·         When converting doc to HTML, Word will save the file as a jpg, and it displays legibly

o      Note:  file is ".doc"

 

·         Test file: 

o      MS-tech-report--Sequential File Programming Patterns.doc

o      MS-tech-report--Sequential File Programming Patterns.htm

 

·         In Word:

o      This is actually a text box with a png graphic image inside of it

 

·         In Word HTML, in browser

o      Picture is now a linked gif file

 

 

·         Fix attempt:  replace text box with just the image

o      Now, left-side fonts don't display well.

 

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

o      The web-page images are much clearer here

o      Image techniques used:

§  Use a link to get image (e.g., gif ), from an external file

§  Embedded image is jpg

2.6  Text boxes converted to illegible images, and layout problems

·         Summary

o      legibility

§  An easy fix:  use a table instead, with one cell

·         This was done later in the document, though with 2 cells, and it worked

§  Changing font-size in text box did not fix legibility

o      layout

§  Fixed by changing the text box layout specs

o      Note:  file is ".doc"

 

·         Test file: 

o      MS-tech-report--Sequential File Programming Patterns.doc

o      MS-tech-report--Sequential File Programming Patterns.htm

 

·         In Word

 

·         In Word HTML, in browser

 

·         Fix layout

o      Changed position relative to text.  I was incorrect in Word doc, but didn't show-up there.

 

 

 

 

2.7  Text-box not included in Word HTML document

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

o      This figure is at the end of the document, and appears to be constructed of text-boxes within a text-box

o      It is not in the Word HTML

2.8  Excessive space between lines

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx

 

·         Word doc

 

·         Word HTML

 

 

2.9  References to page-numbers

·         In the text, references to page-numbers are not meaningful in HTML

o      MS-tech-report--Sequential File--original.doc

 

2.10  Things HTML can't do, but Word can

·         HTML was orginally not designed to be a layout preserving format. If you want to offer a newsletter which was created using Word on a web page, best option may be to create a PDF from that and allow your audience to download it.

o      https://stackoverflow.com/questions/8104230/converting-word-newsletter-to-html?rq=1

 

3  Word HTML:  not a bug

3.1  Effects of "justified" text

·         For text that is "justified", the page width can cause beginning of line to be indented an extra amount

o      For "Justified" text, for each line in a paragraph (except last), the last letter is on the right margin.

o      Changing the page width via the slider can change that indentation

 

 

·         MS-tutorial--Deep Learning for Signal and Information Processing.docx