EFTidyCOM – HTML Code cleaner ATL Component by thatsalok

Introduction

Before I go in detail ,I want you to known what actually EfTidy is, EfTidy is Wrapper Component of Tidy Library and if you don’t know what is Tidy, here is little description.

TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools.
–By Mr. Dave Raggett

So What I am doing with This Library

Recently one of my company client requested us to make TidyAtl class for new TidyLibrary, as last ATL component or Active X wrapper for this Tidy library is built in  2002, So my company assign me task of creating  ATL Library for this component , After completion of the Component, my BOSS told me “Alok, this is open source component and other programmer deserve to use it “. So here I am, presenting you this Component with supporting source code and a brief overview of each function.

Component Reference

The EfTidy contain Four Interfaces :-

  • IEfTidyAttr   ( 2 Properties)
  • IEfTidyNode (1 Property and 4 Methods)
  • ItidyOption  ( 66 Properties)
  • ItidyCom     ( 5 Methods and 4 Properties)

And EfTidy also Contain Five Enumeration :-

CharEncodingType

typedef [public] enum tagCharEncodingType

{ ASCII, LATIN1, RAW, UTF8, ISO2022, MAC, WIN1252, UTF16LE,
UTF16BE, UTF16, BIG5, SHIFTJIS }

CharEncodingType;

OutputType

typedef [public] enum tagOutputType

{

XmlOut, /**< Create output as XML */

XhtmlOut, /**< Output extensible HTML */

HtmlOut /**< Output plain HTML, even for XHTML input.*/

}OutputType;

IndentScheme

typedef [public] enum IndentScheme

{

NOINDENT=0,

INDENTBLOCKS,

AUTOINDENT

}IndentScheme;

DoctypeModes

typedef [public] enum { DoctypeOmit, /**< Omit DOCTYPE altogether
*/DoctypeAuto, /**< Keep DOCTYPE in input. Set version to content */DoctypeStrict, /**< Convert document to HTML 4 strict content model
*/

DoctypeLoose, /**< Convert document to HTML 4 transitional content
model */

DoctypeUser /**< Set DOCTYPE FPI explicitly */

} DoctypeModes;

EfTidyMainNode

typedef [public] enum {

TIDY_ROOT, //Return Tidy ROOT Node

TIDY_HTML, //Return Tidy HTML Node

TIDY_HEAD, //Return Tidy HEAD Node

TIDY_BODY //Return Tidy BODY Node

}EfTidyMainNode;

Now Lets Take Each Interface one by one:-

1. ItidyCom-

First check out each every Method or property present in this interface, and
function it perform.

Property/Method Name

Parameters

Get/Put

    Description

TidyFiletoMem (method) [in] BSTR sourceFile, [out, retval] BSTR*
result
n/a write output to memory
TidyFileToFile (method) [in] BSTR sourceFile, [in] BSTR destFile n/a Write output in file
TidyMemToMem (method) [in] BSTR sourceStr, [out, retval] BSTR* result n/a Write output to memory
TidyMemtoFile (method) [in] BSTR buffer, [in] BSTR destFile n/a Take input as buffer and output in File
TotalWarnings (Property) ([out, retval] long *pVal); Get Return total number of warning after above four operation
TotalErrors (property) ([out, retval] long *pVal); Get Return total number of Errors after above four operation
ErrorWarning [out, retval] BSTR *pVal Get Return buffer, which contain human readable errors/ warnings.
Option (property) [out, retval] ItidyOption* *pVal Get Set the Option for the tidy library
 EfTidyNode (method) [in]EfTidyMainNode Type,[out,retval]IEfTidyNode **ppNewEfTidyNode n/a As html page has tree structure. This method returns you tidyNode,that
assist you to read each every tag and its attribute.this is latest
addition to tidy library

2. ItidyOption

here is list of properties for  ItidyOption Interface

Property/Method Name

Parameter

Get/Put   Description
LoadConfigFile (method) BSTR n/a Load option settings from a configuration file
ResetToDefaultValue Void n/a Reset options to default settings
Doctype BSTR BOTH Doctype declaration generated by Tidy
TidyMark VARIANT_BOOL BOTH For meta element indicating tidied doc
HideEndTag VARIANT_BOOL BOTH Suppress optional end tags
EncloseText VARIANT_BOOL BOTH If yes text at body is wrapped in <p>
EncloseBlockText VARIANT_BOOL BOTH If yes text in blocks is wrapped in <p>
LogicalEmphasis VARIANT_BOOL BOTH Replace i by em and b by strong
DefaultAltText BSTR BOTH Default text for alt attribute
Clean VARIANT_BOOL BOTH Replace presentational clutter by style rules
DropFontTags VARIANT_BOOL BOTH Discard presentation tags
DropEmptyParas VARIANT_BOOL BOTH Discard empty p elements
Word2000 VARIANT_BOOL BOTH Both Draconian cleaning for Word2000
FixBadComment VARIANT_BOOL BOTH Both Fix comments with adjacent hyphens
FixBackslash VARIANT_BOOL BOTH Both Fix URLs by replacing with /
NewEmptyTags BSTR BOTH Declared empty tags
NewInlineTags BSTR BOTH Declared inline tags
NewBlockLevelTags BSTR BOTH Declared block tags
NewPreTags BSTR BOTH Declared pre tags
OutputType OutputType *pVal BOTH Both You can set Output type from here Like you can get output
as XML,XHtml or pure HTML
InputAsXML VARIANT_BOOL BOTH Treat input as XML
ADDXmlDecl VARIANT_BOOL BOTH Add >?xml ?< for XML docs
AddXmlSpace VARIANT_BOOL BOTH If set to yes adds xml: space attr as needed
Bare VARIANT_BOOL BOTH Make bare HTML
AssumeXmlProcins VARIANT_BOOL BOTH If set to yes PIs must end with ?>
CharEncoding CharEncodingType BOTH Set/GET In/out character encoding
InCharEncoding CharEncodingType BOTH Input character encoding (if different)
OutCharEncoding CharEncodingType BOTH Output character encoding (if different)
NumericsEntities VARIANT_BOOL BOTH Use numeric entities for symbols
QuoteMarks VARIANT_BOOL BOTH Output ” marks as &quot
QuoteNBSP VARIANT_BOOL BOTH Both Output non-breaking space as entity
QuoteAmpersand VARIANT_BOOL BOTH Output naked ampersand as & amp
OutputTagInUpperCase VARIANT_BOOL BOTH Output tags in upper not lower case
OutputAttrInUpperCase VARIANT_BOOL BOTH Output attributes in upper not lower case
WrapScriptlets VARIANT_BOOL BOTH Wrap within JavaScript string literals
WrapAttVals VARIANT_BOOL BOTH Wrap within attribute values
WrapSection VARIANT_BOOL BOTH Wrap within section tags
WrapAsp VARIANT_BOOL BOTH Wrap within ASP pseudo elements
WrapJste VARIANT_BOOL BOTH Wrap within JSTE pseudo elements
WrapPhp VARIANT_BOOL BOTH Wrap within PHP pseudo elements
Indent IndentScheme BOTH Indent content of appropriate tags
IndentSpace long BOTH Indentation n spaces
WrapLen long BOTH Set wrap margin for output
TabSize long BOTH Expand tabs to n spaces
IndentAttributes long BOTH Newline+indent before each attribute
BreakBeforeBR VARIANT_BOOL BOTH Output newline before

or not

LiteralAttribs VARIANT_BOOL BOTH If true attributes may use newlines
MarkUp VARIANT_BOOL BOTH
ShowWarnings VARIANT_BOOL BOTH On/Off
Quiet VARIANT_BOOL BOTH No ‘Parsing X’, guessed DTD or summary
KeepTime VARIANT_BOOL BOTH If yes last modied time is preserved
ErrorFile BSTR BOTH File name to write errors to
GnuEmacs VARIANT_BOOL BOTH If true format error output for GNU Emacs
FixUrl VARIANT_BOOL BOTH Applies URI encoding if necessary
BodyOnly VARIANT_BOOL BOTH Output BODY content only
HideComments VARIANT_BOOL BOTH Hides all (real) comments in output
DoctypeMode DoctypeModes BOTH Set the doctype mode for output

3. IEfTidyNode

here is list of properties for IEfTidyNode Interface

Property/Method Name

Parameter

Get/Put   Description
Name BSTR *pVal Get return the name of Current Tag.
GetFirstChildNode IEfTidyNode n/a Return First Child Node
GetNextChildNode IEfTidyNode n/a Using his you can enum rest of Tags
GetFirstAttribute IEfTidyAttr n/a Return first Attribute of current Tag
GetNextAttribute IEfTidyAttr n/a Return rest of Attribute one by one

4. IEfTidyAttr

here is list of properties for  IEfTidyAttr Interface

Property/Method Name

Parameter

Get/Put   Description
Name BSTR *pVal Get Name of attribute
Value BSTR *pVal Get Value of attribute

Using the code

Almost every component was developed to use with Visual Basic and other COM friendly language. So all the code describes here is in visual basic.I am going to use some test case to explain working of component.

I have used the Test.htm (included with Project) to test EfTidy responses.

Here is what Test.htm contains

<html> 
     <head>
		<title>tidy Library</title> 
	</head>

	 <body> 
	      <blockquote> 
	      <p> </p> --(1)

	  <p><fontsize="5"color=   

   "#FF00FF">TidyLibrary</font></p></blockquote><P><p><fontsize="5"color="#FF00FF"></font></p>

   <table border="1" cellpadding="0" cellspacing="0" 
		style="border-collapse: collapse" bordercolor="#111111" width="100%" 
		id="AutoNumber1">

   <tr> 
	<td width="50%" style="border-left-style: 
		solid; border-left-width: 1; border-right-style: none; border-right-width: 
		medium; border-top-style: solid; border-top-width: 1; border-bottom-style: 
		none; border-bottom-width: medium"> --(2)
	</td>

	<td width="50%" style="border-left-style: none; border-left-width: medium;
		 border-right-style:solid; border-right-width: 1; border-top-style: solid;
		 border-top-width: 1;border-bottom-style: none; border-bottom-width: medium">
	</td> 

		</tr>
	 </table> 
 <b>Tidy  --- (3)		
 </h1> <tidy> ---(4)  
 </body> 
</html>

in test.htm I have added following mistake

 a Dummy <Tidy> tag at (4),

         missing <h1> tag at (4)

        empty Para <p> tag (1)

        unclosed <b> tag at (3)

Now Test Case # 1 using ITidyCOM

First Create Object to Our Component,here is listing how to achieve that.

		  Private Sub Form_Load() 
		    Dim TidyCOMObj as EFTIDYLib.tidyCom 
		    Set TidyCOMObj = New EFTIDYLib.tidyCom 
		  End Sub

Now Clean the test.htm file using this object , code listing for that is

		    Private Sub cmdMemtoMem_Click() 
				Dim result As String  
			       TidyCOMObj.TidyFileToFile("test.htm","test1.htm")

			      ?check No of error in the HTML 
			      txtError = TidyCOMObj.TotalErrors 
			      ?check no of warning in above HTML 
			      txtWarning = TidyCOMObj.TotalWarnings 
		    End Sub

And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain

<html> 
<head> 
 <meta name="generator" 
 content= "HTML Tidy for Windows (vers 1st September 2004), see www.w3.org"> 

	<title>tidy Library</title> 
 </head>
<body> 
	<blockquote> 
		<p> </p> 
		<p><font size="5" color="#FF00FF">Tidy Library</font>
		</p> 
	</blockquote> 

	<p><font size="5" color= "#FF00FF">	</font></p> 

	<table border="1" cellpadding="0" cellspacing="0" style= "border-collapse: 
				collapse" bordercolor="#111111" width="100%" id= "AutoNumber1">
	<tr> 
		<td width="50%" style= "border-left-style: solid; border-left-width: 1; 
			   border-right-style: none; border-right-width: medium; 
			   border-top-style: solid; border-top-width: 1; border-bottom-style: none;
			    border-bottom-width: medium">
	 	</td> 

	<td width="50%" style= "border-left-style: none;border-left-width: medium;
	    border-right-style: solid; border-right-width: 1;border-top-style: solid; 
	    border-top-width: 1; border-bottom-style: none;border-bottom-width: medium"> 

		</td> 
	</tr> 
 </table> 
<b>Tidy</b> --(1) 
</body> 
</html>

if you see the Above cleaned HTML page – Dummy <tidy> tag and </h1> has been removed near (1) and </b> is added after Tidy  at (1) here is Summary  of Error/Warning Produced By EfTidyCom ,showing detail of each action it has performed

		line 1 column 1 - Warning: missing <!DOCTYPE> declaration
		line 22 column 10 - Warning: discarding unexpected </h1>
		line 23 column 1 - Error: <tidy> is not recognized!
		line 23 column 1 - Warning: discarding unexpected <tidy>
		line 15 column 1 - Warning: <table> proprietary attribute "bordercolor"
		line 15 column 1 - Warning: <table> lacks "summary" attribute
		Info: Document content looks like HTML Proprietary
			5 warnings, 1 error were found!
Now Test Case # 2 using ITidyCOM.

Now Apply some  Option to Test.htm get Custom Output. so i am using these Options

  • Clean =TRUE ( to make separate class for style)
  • DoctypeMode = DoctypeUser (to enable display string)
  • Doctype = “Ef Tidy library”</STRONG >    (Display string)
  • OutputType = XhtmlOut   (output type)
  • NewInlineTags = “tidy” (Make our Dummy <tidy>tag Legal )

Here is Code Listing to achieve above

Private Sub cmdMemtoMem_Click() 
	Dim me1 As String 
		TidyCOMObj.Option.Clean = True 
		TidyCOMObj.Option.NewInlineTags = "tidy" 
		TidyCOMObj.Option.OutputType = 	XhtmlOut 

		'our string shown in the cleaned html
		'only if the doctype mode is User

		TidyCOMObj.Option.DoctypeMode = DoctypeUser 
		TidyCOMObj.Option.Doctype = "Ef Tidy library" 

		TidyCOMObj.TidyFileToFile("test.htm","test1.htm") 
		txtError = TidyCOMObj.TotalErrors 
		txtWarning = TidyCOMObj.TotalWarnings 
End Sub

And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain after applying out options

<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)  
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
	<meta name="generator" 
	content="HTML Tidy for Windows (vers 1st September 2004), see www.w3.org" />

<title>tidy Library</title>

<style type="text/css">  --(2)
/*<![CDATA[*/
 table.c4 {border-collapse: collapse}
 td.c3 {border-left-style: none; border-left-width: medium; border-right-style: solid; 
        border-right-width: 1; border-top-style: solid; border-top-width: 1; 
        border-bottom-style: none; border-bottom-width: medium}
 td.c2 {border-left-style: solid; border-left-width: 1; border-right-style: none; 
        border-right-width: medium; border-top-style: solid; border-top-width: 1;
        border-bottom-style: none; border-bottom-width: medium}
 h2.c1 {color: #FF00FF}
/*]]>*/
</style>
</head>
<body>
	<blockquote>
	<p> </p>
		<h2 class="c1">Tidy Library</h2>
	</blockquote>

	<h2 class="c1">

	</h2>
	<table border="1" cellpadding="0" cellspacing="0" class="c4"
				bordercolor="#111111" width="100%" id="AutoNumber1">
		<tr>
			<td width="50%" class="c2"> </td> ----(3)
			<td width="50%" class="c3"> </td>
		</tr>
	</table>

	<b>Tidy <tidy></tidy></b> ----(4)


</body>
</html>

Now Let see What Tidy Clean for us

  • In (1) our Custom string “EfTidyCom” is visible
  • In (2) and (3) style are cleaned and class is created for that
  • In (4) our <Tidy> tag get legal,though it do nothing in actual HTML page

here is summary of all the Error/Warning

		line 1 column 1 - Warning: missing <!DOCTYPE> declaration
		line 22 column 10 - Warning: discarding unexpected </h1>
		line 23 column 1 - Warning: <tidy> is not approved by W3C
		line 23 column 1 - Warning: missing </tidy> before </body>
		line 22 column 2 - Warning: missing </b> before </body>
		line 15 column 1 - Warning: <table> proprietary attribute "bordercolor"
		line 15 column 1 - Warning: <table> lacks "summary" attribute
		Info: Document content looks like HTML Proprietary

		7 warnings, 0 errors were found!
Now Test Case # 3 Using IEftidyNode and IEfTidyAttr.

This two Interface will help you gather node by node and Attribute by  ttribute information from Tree Structure of Html cleaned by Tidy libraray. here is code listing for Finding the <Head> tag and Enumerate all the Attribute in that.

  Note :always use the these two interface on html cleaned by Tidy.

Private Sub cmdGetNode_Click()

  ?assuming TidyDoc contain Cleaned HTML
  ?after applying any of four ITidyCom method
  ?here TidyDoc is Object of iTidyCom
a = TIDY_HEAD
 ?give the <head> Node

Set tidyNode = TidyDoc.EfTidyNode(a)

  ?display name
   txtNodeName = tidyNode.Name

    If tidyNode Is Nothing Then
		Else
		?Enumerate all attribute in the head if any

		Set atr = tidyNode.GetFirstAttribute
   	   Do Until atr Is Nothing
			lstAttr.AddItem atr.Name & "   " & atr.Value
			Set atr = tidyNode.GetNextAttribute
       Loop
	End If
End Sub

Now how to Enumerate child in the Head Node and get attribute of each, I am
finding first child for you here, the code listing for that is –>

Private Sub cmdGetFirstChildNode_Click()
	 Dim localnode As EfTidyNode

	 Set localnode = tidyNode.GetFirstChildNode
	 txtNodeName = localnode.Name

	 If localnode Is Nothing Then
	 Else
		Set atr = localnode.GetFirstAttribute
       Do Until atr Is Nothing
		lstAttr.AddItem atr.Name & "   " & atr.Value
		Set atr = localnode.GetNextAttribute
	   Loop
	End If
End Sub

wait a min, I has shot a nice snapshot after clicking on clicking on above
code button

Here,All i have given small overview of tidyLibrary and EfTidyCom.For
more information about Tidy library visit tidy Home Page http://tidy.sourceforge.com

Author Comment

I know there is much scope for improvement in this Component
especially in Interfaces IEfTidyNode and IEfTidyAttr. I promise these
improvement will there in next version/update of library

History

Keep a running update of any changes or improvements you’ve made here.

Files Listing With Project

Source File
Contains –

  • TidyLib (original Tidy Library) Source Code
  • TidyLib (original Tidy Library) Source Code

Project
file
Contains

  • Release version of EfTidy Component
  • Visual Basic Test project for ItidyCom & ItidyOption (with source)
  • Visual Basic test project for iTidyNode and iTidyAttr(with Source code)
  • Test.htm

Update History

  • 28 November 2004 : EfTidy version 1.0 Introduced.

Special Thanks

  • My Boss Mr Saurabh Gupta Director Efextra eSolutions Pvt Ltd
  • Paul E. Bible For his CCOMString Class.
  • Tidy SourceForgeGroup for this nice library i.e. Tidylibrary