4.1 From Internet search of Word files
4.2 From Internet search of Word-generated HTML files
4.2.1 Can be processed by BeautifulSoup
4.2.2 Cannot be processed by BeautifulSoup (decode errors)
5.1 Summary of features to test
5.2 Word files from the Internet
5.2.1 Example formatting for Kindle
5.2.2 Microsoft technical-report on programming
5.2.3 Microsoft paper on programming
5.2.4 University computer-science lab instructions
5.2.5 University Python computer-lab instructions
5.2.6 University physics-lab instructions
5.2.7 University Python tutorial
5.2.8 Agricultural technical-report
WWN Development Document
WWN Testing : tests performed, and Word docs used
Word’s Navigation pane shows the table-of-contents (View : Show : Navigation pane).
· Contents:
o Documentation of testing performed, includes
§ Regression tests that could be used for WWN enhancements
§ Browsers tested
This document was created by the WWN author for his own use
in developing WWN. It is included in the WWN repo, as other developers may
find it useful.
· See: D:\Documents\Professional-projects\Reference-info\Software-development\Python\python--language--notes.docx
· undefined key:value specified
· 6/26 code review and test of all schema defininitions and code that processes the parameter-file, esp looking for needed dictionary key checks before accessing key, and processing of required and optional key/value pairs, e.g., omitted keys and/or values
· TOC in different locations in the doc
o multiple TOCs
· Testing: linters for css and html
o https://nvuillam.github.io/mega-linter/descriptors/css/
o https://nvuillam.github.io/mega-linter/descriptors/html/
· GitHub - mwilliamson/python-mammoth: Convert Word documents (.docx files) to HTML
o https://github.com/mwilliamson/python-mammoth
o The following features are currently supported:
§ Headings.
§ Lists.
§ Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
§ Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
§ Footnotes and endnotes.
§ Images.
§ Bold, italics, underlines, strikethrough, superscript and subscript.
§ Links.
§ Line breaks.
§ Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
§ Comments.
· HTML validator
o https://rocketvalidator.com/html-validation
· Chrome web-page downloader:
o Tab Save - Chrome Web Store
§ https://chrome.google.com/webstore/detail/tab-save/lkngoeaeclaebmpkgapchgjdbaekacki
· Useful set of Word files used for testing Word -> HTML features
· Notes
o In testing the generated HTML, links that are relative to the web-server don't work, e.g., links to images
o Testing is simple: just compare source web-page with the generated web-page
· https://www.bible.ca/canon.htm
o <meta name=Generator content="Microsoft Word 15 (filtered)">
o Uses color:white, but properly
· https://static.sparemin.com/static/doc/201808291932/media-use-terms.html
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 15 (filtered)">
· http://dandanplay.com/bsintro.htm
o <meta name=Generator content="Microsoft Word 15 (filtered)">
o Asian text, bullets, TOC, tables, hyper links,
o title:
<p class=MsoTitle style='border:none;padding:0cm'><span style='font-family:
宋体'>弹弹</span><span lang=EN-US>play </span><span style='font-family:宋体'>展会版使用文档</span></p>
· https://www.bible.in.ua/security/
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1251">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
· http://izapya.com/policy_en.html
o Could be a useful test.
o Two div sections, one nested
o Uses arrow symbols for list items
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=Generator content="Microsoft Word 15 (filtered)">
· http://nastooh.ir/docs/portfolio/newsagency-analytics-report.htm
o Could be a useful test.
o Multiple div sections, including: WordSection2, WordSection3
<meta name=Generator content="Microsoft Word 15 (filtered)">
· How do I find out what version of WORD was used to create a Word document?
Print ActiveDocument.CompatibilityMode
A version number will appear on the next line, where
11 = Word 2003
12 = Word 2007
14 = Word 2010
15 = Word 2013
· Google search
o filetype:docx "table of contents" site:www.microsoft.com/en-us/research/wp-content/uploads/
§ also: filetype:docx
· HTML modifications and fixes
o Unordered lists (multi-level)
§ 2 symbols replaced, to be viewable in Firefox
§ Indentation fixed
§ Spacing after bullet symbol fixed
o Ordered lists (multi-level)
§ Indentation fixed
§ Spacing after bullet symbol fixed
o White-colored text
§ Removed style "color:white"
§ Test files that have style "color:white" are noted as such.
· HTML modification to table of contents:
o Color does not change when clicked
o Not underlined
· Document description: Provides an example of Word formatting for a Kindle book. Created in the UK, using Word 2003.
· Source:
o https://ebookconversion.paulbrookes.net/wp-content/uploads/2011/09/basicstyles.doc
§ That Word document's link is on this web-page. The link text is: "basicstyles.doc".
· https://ebookconversion.paulbrookes.net/tag/formatting-2/
o Local file: example-book-formatting--original.doc
· Bibliographic info:
o Title: Sequential File Programming Patterns and Performance with .NET
o Date: 2004
o Pages: 15
· Testing-related features:
o Testing relevance: example of the typical layout for a book that will be converted to Kindle format; created with Word 2003
o Word-file features (.doc)
§ Created by Word 2003
§ TOC
· No existing TOC
· Current TOC-format setting: Word's default TOC styles
§ Other features: heading-levels 1 to 3, ordered and unordered lists
o Word HTML features (.html)
§ Some paragraphs have span-tags with attributes "lang=DE" and "lang=EN-GB".
· In fixing the Word HTML, the HTML parsing should be resiliant enough to ignore these tags and make needed corrections.
<p class="MsoToc1"><span lang="DE"><span class="MsoHyperlink tocAnchor"><a class="tocAnchor" href="#_Toc75212936"><span lang="EN-GB">CHAPTER
I</span></a></span></span></p>
· Variations of the document that I created, for testing:
o Local file-name: example-book-formatting--added-features.doc
§ Added a multi-level ordered-list. This is for testing the code that fixes the Word HTML, for ordered-lists.
· Program tests, using file: example-book-formatting--added-features.doc
o Tests for the program: generate_word_html.docm
§ Verify: TOC created
o Tests for the program: create_web_page.py
§ Inspect created web-page for the following. All should be OK.
· TOC levels and formatting
· HTML fixes
· Text formatting vs Word Doc
· Description: Microsoft programming technical-report, using Word 2003
· Source:
o https://www.microsoft.com/en-us/research/wp-content/uploads/2004/12/tr-2004-136.doc
o Local file-name: MS-tech-report--Sequential File--original.doc
· Bibliographic info:
o Title: Sequential File Programming Patterns and Performance with .NET
o Date: 2004
o Pages: 15
· Testing-related features:
o Testing-relevance: Microsoft-published technical-report on programming; 14-pages; Word 2003
o Word-file features
§ Created by Word 2003
§ TOC
· Existing TOC after title
· Current TOC-format setting: Word's default TOC styles
§ Other features: tables, text-boxes, figures, source-code, API specs, only heading-level one is used
o Word HTML-file features
§ Document formatting problems due to problems with Word's HTML-generation
· Blurred images, and problems with page-layout of text-boxes
· These problems can be fixed by changing the Word document. This is described elsewhere in the system documentation.
· These problems can be ignored in the system function-test
· Variations of the document that I created, for testing:
o Local file-name: MS-tech-report--Sequential File--fixed.doc
§ Contains proof-of-concept fixes for the document-formatting problems due to problems with Word's HTML-generation
· Program tests, using file: MS-tech-report--Sequential File--original.doc
o generate_word_html.docm
§ Verify: TOC created
§ Note: the Word-file has Track Changes on, and the VBA code will turn it off
o create_web_page.py
§ Inspect created web-page for the following. All should be OK:
· TOC levels and formatting
· HTML fixes
· Text formatting vs Word Doc (except for problems due to Word's HTML-generation)
· Description: Microsoft publication-quality paper on programming, Word 2013
· Source:
o Local file-name: MS-tutorial--Deep Learning --original.docx
· Bibliographic info:
o Title: Deep Learning For Signal And Information Processing
o Date: 2013
o Pages: 97
· Testing-releated features
o Testing-relevance: long programming paper for publication, Word 2013
o Word-file features
§ Created by Word 2013
§ TOC
· Existing TOC after title
· Current TOC-format setting: Word's default TOC styles
§ Other features: heading levels 1-2, lots of equations
o Word HTML-file features
§ Document formatting problems due to problems with Word's HTML-generation
· Equations that are centered in the Word-file might not be centered in the HTML
· Excessive spacing between some lines (e.g., a paragraph after one of the equations)
· For some bullets, there is too much space between the bullet-symbol and the text. This appears to be caused by the text format-setting that forces the word at the end of the line to be at the right margin.
· Figure at the end of the document does not appear in the HTML.
§ It appears that these problems can be fixed by changing the Word document. This is described elsewhere in the system documentation.
§ These problems can be ignored in the system function-test
· Variations of the document that I created, for testing:
o Local file-name: MS-tutorial--Deep Learning--fixed.docx
§ Contains proof-of-concept fixes for the document-formatting problems due to problems with Word's HTML-generation
· Program tests, using file: MS-tutorial--Deep Learning --original.docx
o generate_word_html.docm
§ Verify: TOC created
o create_web_page.py
§ Inspect created web-page for the following. All should be OK:
· TOC levels and formatting
· HTML fixes
· Text formatting vs Word Doc (except for problems due to Word's HTML-generation)
· Description: university computer-lab instruction manual
· Source:
o http://virgil.azwestern.edu/~cvb/CIS120/Book%20Notes/Chapter.04.docx
o Local file-name: computer-concepts--original.html
· Bibliographic info:
o Title: Computer Concepts : Chapter Four: Operating Systems and File Management
o Date: 2011
o Pages: 20
· Testing-related features:
o Testing-relevance: university computer-lab instruction manual
o Word-file features
§ Created by Word 2003
§ TOC
· Document has a TOC, but it's not a Word-generated TOC. It was made by the author.
· Current TOC-format setting: Word's default TOC styles
§ Other features:
· Uses headings 1 and 8; lots of: tables, lists, highlighting;
· The unconventional formatting indicates the author had limited understanding of proper Word use
o Word HTML-file features
§ Body has 2 instances of sytle "color:white". They are intentional, and in the top H1 headers, with gray background.
§ In the Word-file, much of the text starts left of the left-margin. In the generated HTML, text that is left of the left-margin is not visible. This formatting problem appears to be fixable, by changing the Word styles that specify the left-alignment incorrectly.
§ In the Word-file, there are two spurious section-breaks that cause formatting problems in the HTML. In particular, they created extra blank lines for some of the questions. These problems are fixable, by deleting the spurious section-breaks.
· Variations of the document that I created, for testing:
o Local file-name: computer-concepts--fixed.docx
§ Removed the 2 spurious section breaks that were causing formatting problems.
· Program tests, using file: computer-concepts--original.html
o generate_word_html.docm
§ Verify: TOC creation and formatting
§ The TOC formatting has problems, but due to the document's flawed formatting
· Heading levels used are 1 and 8
· Many of the heading instances have one-off formatting changes made to them
§ When TOCs are generated by Word, one-off heading changes are incorporated into the TOC
· Described here:
o http://wordfaqs.ssbarnhill.com/TOCTips.htm
o Section: "Effect of direct formatting"
· This appears to be why the TOC is bold
o create_web_page.py
§ Inspect created web-page for the following. All should be OK:
· TOC levels and formatting
o Each TOC entry has a bold tag (<b>). The document's headings have custom formatting, which Word incorporates in the TOC, i.e., bold
<p class=MsoToc1><span class=MsoHyperlink><a href="#_Toc74846524"><b><span
style='font-family:"Sylfaen",serif'>Computer Concepts</span></b></a></span></p>
· HTML fixes
· Text formatting vs Word Doc (except for the formatting problems described above)
· Description: University Python computer-lab instructions, with related formatting
· Source:
o Local file-name: python-lab-instructions--original.docx
· Bibliographic info:
o Title: Python Lab
o Date: 2020
o Pages: 2
· Testing-related features:
o Testing-relevance: standard formatting for a programming-related doc
o Word-file features
§ Created by Word 2010
§ TOC
· There's no headings, so no TOC is possible
§ Other features:
· Standard simple-features used, no lists
· Spurious underlines
o Before and after "Part 0: Overview" is a paragraph with one space, and underline formatting on.
o The underline does not show-up in Word, but it does in the HTML
o Word HTML-file features
§ Nothing noteworthy
· Program tests, using file: python-lab-instructions--original.docx
o generate_word_html.docm
§ Verify: No TOC created
o create_web_page.py
§ Inspect created web-page for the following. Should be OK.
· Text formatting vs Word Doc
o Note: the 2 spurious underlines are OK, in the HTML
· Description: University physics-lab instructions. Contains python code, equations, ordered-list.
· Source:
o https://www.niu.edu/brown/_pdf/physics374_spring2021/l4-1-21.docx
o Local file-name: physics-tutorial--original.docx
· Bibliographic info:
o Title: Physics 374 – Junior Physics Lab
o Date: Spring 2021
o Pages: 5
· Testing-related features:
o Testing-relevance:
o Word-file features
§ Created by Word 2013
§ TOC
· There's no headings, so no TOC is possible
§ Other features:
· Lots of equations; ordered-list and its list-symbols use letters
· Some of the blank lines have underline formatting turned-on. It's not visible in Word, but is visible in HTML
o Word HTML-file features
§ One misasligned equation.
· It uses the old Word equation editor. The "(3)" for the equation was additional text
· I fixed it by using Word's feature to upgrade the equation for use with the current Word equation-editor (right click on equation). Also, I edited the equation to put the (3) in with the equation.
§ For the ordered list, the text after the list-symbol is slightly misaligned. It is indented too much, by about 1 character-width. This is a tolerable error, for now.
· Variations of the document that I created, for testing:
o Local file-name: physics-tutorial--fixed.docx
§ Has the fix for the misaligned equation
· Program tests, using file: physics-tutorial--original.docx
o generate_word_html.docm
§ Verify: No TOC created
o create_web_page.py
§ Inspect created web-page for the following. All should be OK, with the exceptions described above.
· HTML fixes
· Text formatting vs Word Doc
· Description: University Python tutorial, with related formatting. 23 pages
· Source:
o http://brahe.canisius.edu/~meyer/PYTHONICTEMPLE/Very%20quick%20tutorial%20on%20Python.docx
o Local file-name: python-tutorial--original.docx
· Bibliographic info:
o Title: Very quick tutorial on Python
o Date: 2014
o Pages: 23
· Testing-related features:
o Testing-relevance: standard formatting for a programming-related doc
o Word-file features
§ Created by Word 2010
§ TOC
· There's no headings, so no TOC is possible
§ Other features:
· Spurious underlines
o There are some paragraphs with one space, and underline formatting on. The underline does not show-up in Word, but it does in the HTML
o Word HTML-file features
§ Minor formatting problems in the spacing between header and paragraph, but it appears they can be fixed in the Word file.
· Program tests, using file: python-tutorial--original.docx
o generate_word_html.docm
§ Verify: No TOC created
o create_web_page.py
§ Inspect created web-page for the following. All should be OK, with the exceptions described above.
· HTML fixes
· Text formatting vs Word Doc
· Description: agricultural technical-report from Cornell Univ.
· Source:
o Local file-name: pesticide-report--original.docx
· Bibliographic info:
o Title: Options D and O User Guide
o Date: 2015
o Pages: 12
· Testing-related features:
o Testing relevance: tech-report with creative and unconventional hacked formatting
o Word-file features
§ Created by Word 2010
§ TOC
· Current TOC-format setting: Word's default TOC styles
· There's an existing TOC. It does not include the spurious heading at the top of the file, hidden under the graphic.
§ Other features:
· Has table of contents, figures, unconventional hacked formatting
· When opening the Word file, the formatting for first page initially looks flawed, as it opens in Web Layout, but it looks OK when changed to Print Layout
o Word HTML-file features
§ The TOC gets its formatting from the author-modified headings, e.g., color
· This is normal for Word's HTML TOC-generation
§ The graphic's formatting becomes flawed, but this problem can be ignored. It should be possible to fix the formatting.
· Variations of the document that I created, for testing:
o Local file-name: pesticide-report--fixed.docx
§ Deleted spurious heading at the top of the file, under the graphic.
§ Updated existing table-of-contents
· Program tests, using file: pesticide-report--fixed.docx
o Tests for the program: generate_word_html.docm
§ Verify: TOC created OK
o Tests for the program: create_web_page.py
§ Inspect created web-page for the following. All should be OK, except for the problems described above.
· TOC levels and formatting
· HTML fixes
· Text formatting vs Word Doc
· Description: Agricultural software-installation guide from Cornell Univ.
· Bibliographic info:
· Source:
o Local file-name: software-install-guide--original.docx
· Bibliographic info:
o Title: PRL Software Installation Guide
o Date: 2015
o Pages: 5
· Description:
o Testing-relevance: software documentation from an organization whose mission is not software development
o Notes:
§ Apparently produced by the same organization that created the test-file "Agricultural technical-report from Cornell Univ."
§ Both documents have similar formatting, and formatting bugs, e.g., unused header under the graphic at the top of the document.
o Word-file features
§ Created by Word 2010
§ TOC
· Current TOC-format setting: Word's default TOC styles
· There's an existing TOC. It does not include the spurious heading at the top of the file, hidden under the graphic.
§ Other features:
· Has table of contents, figures, unconventional hacked formatting
· When opening the Word file, the formatting for first page initially looks flawed, as it opens in Web Layout, but it looks OK when changed to Print Layout
· Spurious bullets can be seen in the HTML view. They can be found in the Word file, via the Word Style Manager.
o Word HTML-file features
§ The TOC gets its formatting from the author-modified headings, e.g., color
· This is normal for Word's HTML TOC-generation
§ The graphic's formatting becomes flawed, but this problem can be ignored. It should be possible to fix the formatting.
· Variations of the document that I created, for testing:
o Local file-name: software-install-guide--fixed.docx
§ Deleted spurious heading at the top of the file, under the graphic.
§ Updated existing table-of-contents
§ Deleted supurious bullets
· Program tests, using file: software-install-guide--fixed.docx
o Tests for the program: generate_word_html.docm
§ Verify: TOC created OK
· The TOC gets formatting from modified headings, e.g., color
· The graphic's formatting becomes flawed, but this problem can be ignored. The graphics can be reformatted so that TOC can be added without problem.
o Tests for the program: create_web_page.py
§ Inspect created web-page for the following. All should be OK, except for the problems described above.
· TOC levels and formatting
· HTML fixes
· Text formatting vs Word Doc
Browser testing (on 3/2021):
* Windows 10 with:
* Firefox 87.0
* IE version 2004, build 19041
* Edge version 89.0.774.54
* Chrome version 89.0.4389.82
* Ubuntu
* Firefox 86.0.1
* Mac OS (the testing service just showed a screenshot)
* Firefox 82.0
* Chrome 87.0
* jQuery
https://jquery.com/browser-support/
#Current Active Support
Desktop
Chrome: (Current - 1) and Current
Edge: (Current - 1) and Current
Firefox: (Current - 1) and Current, ESR
Internet Explorer: 9+
Safari: (Current - 1) and Current
Opera: Current
#Mobile
Stock browser on Android 4.0+[1]
Safari on iOS 7+
* jQuery UI
https://jqueryui.com/browser-support/#current-active-support
jQuery UI 1.12.x supports the following browsers:
Chrome: (Current - 1) or Current
Firefox: (Current - 1) or Current
Safari: (Current - 1) or Current
Opera: (Current - 1) or Current
IE: 11
Edge: (Current - 1) or Current
https://jqueryui.com/browser-support/#current-active-support