Parse-O-Matic
Home Sitemap
 

About the PSKB

Terms of Use Installing and running a script


Note: Because this script parses HTML, and thus contains HTML, it had to be altered in order to
include it in this HTML file. Thus, if you want to make a copy of the script, you should highlight
everything below and do a copy-and-paste. If you try to copy it from "View Source", it will not
be a valid script since (for example) the "less-than" characters had to be replaced with < symbols.
;===============================================================================
;  
; Script to filter an HTML file and extract raw text.
;
; The HTML file should be properly formatted, with a <head> or at least a
; <body> tag. The script can usually cope with files that lack these tags,
; but it is best if they are included. ;
; The output is not word-wrapped -- this script makes no attempt to render
; the appearance of the original web page in plain text. It simply extracts
; the text and tries to get rid of any extraneous material, such as scripts.
; The idea here is to eliminate just about all formatting. If you wanted to
; retain the format, you could load the HTML page into a word processing
; program such as Word.
;
; Most special HTML symbols, such as   (no-break space) are translated,
; though in that particular case it is simply translated to an ordinary
; space.
;
; You can use wildcards in the Input File box to process an entire folder.
; Only files with the .htm or .html extensions are processed.
; ; This script is designed for use with the Parse-O-Matic Power Tool,
; which is available from www.parse-o-matic.com.
;
;===============================================================================
; Configuration
;===============================================================================
Config
;-----------------------------------------------------------------------------
; Interface
;-----------------------------------------------------------------------------
$CfgEnableOptionX = 'N'
$CfgEnableOptionY = 'N'
$CfgEnableOptionZ = 'N'
;-----------------------------------------------------------------------------
; Files
;-----------------------------------------------------------------------------
$CfgDefaultIFN = 'Index.html'
$CfgDefaultOFN = 'Output.txt'
;-----------------------------------------------------------------------------
; Documentation
;-----------------------------------------------------------------------------
$CfgCopyright = 'Copyright © 2005-2008 by Pyroto, Inc.'
$CfgVersion = '1.00.00'
$CfgProgrammer = 'Timothy Campbell'
AtSym = $40 ; Anti-spam gimmick
$CfgEmail = 'info' AtSym 'parse-o-matic.com'
$CfgLicense = 'This script may be used by anyone who has a valid ' >>
'Advanced Scripting License from Pyroto, Inc. ' >>
', or is evaluating one of our ' >>
'Parse-O-Matic products (for up to 30 days).'
End
;===============================================================================
; TaskInit
;===============================================================================
TaskInit
BlankLineTags = >>
' <BR><BR>' >> ; Two breaks = a blank line
' <DIV' >>
' <H1 <H2 <H3 <H4' >>
' </H1 </H2 </H3 </H4' >>
' <HR' >>
' <OL' >>
' <P' >>
' <TABLE' >>
' <TR' >>
' <UL'
CRLF = $0A$0D ; Carriage Return and Line Feed
CRLF2 = CRLF CRLF ; Two CRLF's NumInpFiles = 0
OnlyHTML = 'Only files with the .htm or .html extension are processed'
SepLine = Padded '' 80 'Left' '='
ValidDataStart = ' <BODY <FORM <HEAD <HTML'
End ;===============================================================================
; TaskDone
;===============================================================================
TaskDone
ShowNote ''
End ;===============================================================================
; FileInit
;===============================================================================
FileInit
ShortFName = Parse $ActualIFN '>*\' ''
ShowNote ShortFName
If $ActualIFN ^ '.htm' FileOkay = 'Y'
Otherwise FileOkay = 'N'
FirstLog = 'Y'
HadLF = 'Y' ; Avoid starting with null line
LeftOver = '' ValidData = 'N'
SawOkayFile = 'N' Inc NumInpFiles
If NumInpFiles #> 1 OutNull
OutEnd SepLine
OutEnd $ActualIFN
OutEnd SepLine
OutNull
End ;===============================================================================
; Main ;===============================================================================
; Skip invalid files
;-------------------------------------------------------------------------------
Begin FileOkay = 'N'
LogMsgLF
LogMsg OnlyHTML
OutEnd OnlyHTML
NextStep
Else
SawOkayFile = 'Y'
End
;-------------------------------------------------------------------------------
; Prefix any unresolved data
;-------------------------------------------------------------------------------
Begin LeftOver <> ''
$Data = LeftOver $Data
LeftOver = ''
End
;-------------------------------------------------------------------------------
; Ignore null lines
;-------------------------------------------------------------------------------
If $Data = '' Done
;-------------------------------------------------------------------------------
; Look for scripts
;-------------------------------------------------------------------------------
Call ScriptCheck 'script'
Call ScriptCheck 'noscript'
;-------------------------------------------------------------------------------
; Are we seeing actual HTML?
;-------------------------------------------------------------------------------
Begin ValidData = 'N'
ScanPosn $Ignore $Ignore $Data ValidDataStart
If $Success = 'N' ScanPosn $Ignore $Ignore $Data BlankLineTags
Begin $Success = 'Y'
ValidData = 'Y'
Else
Call LabelLog
LogMsg $Data
Done
End
End
;-------------------------------------------------------------------------------
; Parse out the HTML
;-------------------------------------------------------------------------------
Line = $Data
Begin
ScanPosn FromPosn $Ignore Line ' <[A-Z] <[a-z] </ <! <?' 'First RegExp'
Begin FromPosn = 0 ; Did we find an HTML tag?
Break ; No tags found, so bale out
Else
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; We found a tag
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insertion = '' ; For special tag processing
LineLeft = Parse Line 1 FromPosn 'Cut' ; Get up to the start of tag
$Ignore = Parse LineLeft -1 -1 'Cut' ; Remove the < character
FullTag = Parse Line '' '1*>' 'Cut Include' ; Look for the end of tag
Begin FullTag = '' ; Didn't find the end of tag?
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Didn't find the end of this tag
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LeftOver = $Data ; Restore the line
Done ; Save line for next pass
Else
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Found full tag; see if it needs to be treated specially
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FullTag = '<' FullTag
ScanPosn $Ignore $Ignore FullTag BlankLineTags
Begin $Success = 'Y'
Insertion = CRLF ; Embed a CRLF
Else
ScanPosn $Ignore $Ignore FullTag '/<LI /<LI>'
Begin $Success = 'Y' Insertion = '- '
Else
ScanPosn $Ignore $Ignore FullTag '/<BR>'
If $Success = 'Y' Insertion = ' '
End
End
End
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Tag successfully removed
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Line = LineLeft Insertion Line ; Stitch the line back together
End ; We found an HTML tag
Again ; Loop while we have tags
;-------------------------------------------------------------------------------
; Output
;-------------------------------------------------------------------------------
Begin Line <> ''
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Translate HTML symbols. This script would probably run somewhat faster if
; we put these symbols in a lookup file and used the MassChange command, but
; that is beyond the scope of this demonstration script. Also, we only
; include a few of the frequently-used numeric symbols.
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Change Line '&#153;' '™'
Change Line '&#160;' ' '
Change Line '&#163;' '£'
Change Line '&#164;' '¤'
Change Line '&#165;' '¥'
Change Line '&#169;' '©'
Change Line '&#174;' '®'
Change Line '&#176;' '°'
Change Line '&#177;' '±'
Change Line '&Aacute;' 'Á'
Change Line '&aacute;' 'á'
Change Line '&acirc;' 'â'
Change Line '&Acirc;' 'Â'
Change Line '&acute;' '´'
Change Line '&AElig;' 'Æ'
Change Line '&aelig;' 'æ'
Change Line '&agrave;' 'à'
Change Line '&Agrave;' 'À'
Change Line '&amp;' '&'
Change Line '&aring;' 'å'
Change Line '&Aring;' 'Å'
Change Line '&atilde;' 'ã'
Change Line '&Atilde;' 'Ã'
Change Line '&Auml;' 'Ä'
Change Line '&auml;' 'ä'
Change Line '&bdquo;' '„'
Change Line '&brvbar;' '¦'
Change Line '&bull;' '•'
Change Line '&ccedil;' 'ç'
Change Line '&Ccedil;' 'Ç'
Change Line '&cedil;' '¸'
Change Line '&cent;' '¢'
Change Line '&circ;' 'ˆ'
Change Line '&copy;' '©'
Change Line '&curren;' '¤'
Change Line '&dagger;' '†'
Change Line '&Dagger;' '‡'
Change Line '&deg;' '°'
Change Line '&divide;' '÷'
Change Line '&eacute;' 'é'
Change Line '&Eacute;' 'É'
Change Line '&Ecirc;' 'Ê'
Change Line '&ecirc;' 'ê'
Change Line '&Egrave;' 'È'
Change Line '&egrave;' 'è'
Change Line '&emsp;' ' '
Change Line '&ensp;' ' '
Change Line '&ETH;' 'Ð'
Change Line '&eth;' 'ð'
Change Line '&euml;' 'ë'
Change Line '&Euml;' 'Ë'
Change Line '&euro;' '€'
Change Line '&fnof;' 'ƒ'
Change Line '&frac12;' '½'
Change Line '&frac14;' '¼'
Change Line '&frac34;' '¾'
Change Line '&gt;' '>'
Change Line '&hellip;' '…'
Change Line '&Iacute;' 'Í'
Change Line '&iacute;' 'í'
Change Line '&Icirc;' 'Î'
Change Line '&icirc;' 'î'
Change Line '&iexcl;' '¡'
Change Line '&igrave;' 'ì'
Change Line '&Igrave;' 'Ì'
Change Line '&image;' 'I'
Change Line '&iquest;' '¿'
Change Line '&iuml;' 'ï'
Change Line '&Iuml;' 'Ï'
Change Line '&laquo;' '«'
Change Line '&ldquo;' '“'
Change Line '&lsaquo;' '‹'
Change Line '&lsquo;' '‘'
Change Line '&lt;' '<' Change Line '&macr;' '¯' Change Line '&mdash;' '—'
Change Line '&micro;' 'µ' Change Line '&middot;' '·'
Change Line '&minus;' '-' Change Line '&nbsp;' ' '
Change Line '&ndash;' '–' Change Line '&not;' '¬'
Change Line '&Ntilde;' 'Ñ' Change Line '&ntilde;' 'ñ'
Change Line '&Oacute;' 'Ó'
Change Line '&oacute;' 'ó' Change Line '&Ocirc;' 'Ô'
Change Line '&ocirc;' 'ô'
Change Line '&OElig;' 'Œ'
Change Line '&oelig;' 'œ' Change Line '&Ograve;' 'Ò'
Change Line '&ograve;' 'ò' Change Line '&ordf;' 'ª'
Change Line '&ordm;' 'º' Change Line '&Oslash;' 'Ø' Change Line '&oslash;' 'ø'
Change Line '&otilde;' 'õ'
Change Line '&Otilde;' 'Õ' Change Line '&Ouml;' 'Ö'
Change Line '&ouml;' 'ö'
Change Line '&para;' '¶'
Change Line '&permil;' '‰'
Change Line '&plusmn;' '±'
Change Line '&pound;' '£'
Change Line '&quot;' '"'
Change Line '&raquo;' '»'
Change Line '&rdquo;' '”'
Change Line '&real;' 'R'
Change Line '&reg;' '®'
Change Line '&rsaquo;' '›'
Change Line '&rsquo;' '’'
Change Line '&sbquo;' '‚'
Change Line '&scaron;' 'š'
Change Line '&Scaron;' 'Š'
Change Line '&sect;' '§'
Change Line '&shy;' '­'
Change Line '&sup1;' '¹'
Change Line '&sup2;' '²'
Change Line '&sup3;' '³'
Change Line '&szlig;' 'ß'
Change Line '&thinsp;' ' '
Change Line '&thorn;' 'þ'
Change Line '&THORN;' 'Þ'
Change Line '&tilde;' '˜'
Change Line '&times;' '×'
Change Line '&trade;' '™'
Change Line '&uacute;' 'ú'
Change Line '&Uacute;' 'Ú'
Change Line '&ucirc;' 'û'
Change Line '&Ucirc;' 'Û'
Change Line '&ugrave;' 'ù'
Change Line '&Ugrave;' 'Ù'
Change Line '&uml;' '¨'
Change Line '&uuml;' 'ü'
Change Line '&Uuml;' 'Ü'
Change Line '&yacute;' 'ý'
Change Line '&Yacute;' 'Ý'
Change Line '&yen;' '¥'
Change Line '&yuml;' 'ÿ'
Change Line '&Yuml;' 'Ÿ'
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Handle leading CRLF's
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Change Line CRLF2 CRLF ; Remove multiple linefeeds
NeedLFBefore = 'N'
Begin Line[1 2] = CRLF
$Ignore = Parse Line 1 2 'Cut'
If HadLF = 'N' NeedLFBefore = 'Y'
End ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Handle trailing CRLF's
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - NeedLFAfter = 'N'
Begin Line Len>= 2
LastTwoChars = Parse Line -2 -1
Begin LastTwoChars = CRLF
$Ignore = Parse Line -2 -1 'Cut'
NeedLFAfter = 'Y'
End
End
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Remove spaces on the edges, since many HTML coders tend to indent their text
; to highlight its structure.
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TrimChar Line
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Send the line to the output file
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Begin Line = ''
If HadLF = 'N'
OutNull
HadLF = 'Y'
Done
End
If NeedLFBefore = 'Y'
OutNull
OutEnd Line
HadLF = 'N'
Begin NeedLFAfter = 'Y'
OutNull
HadLF = 'Y'
End
End
Done
;===============================================================================
; Subroutines
;===============================================================================
Procedure LabelLog
Begin FirstLog = 'Y'
FirstLog = 'N'
LogMsgLF
LogMsg '-----------------------------'
LogMsg 'Header and Script Information'
LogMsg '-----------------------------'
End
End
Procedure ScriptCheck
EndList = ' </no' ScriptCheck '>' ; e.g. ' </noscript>'
;-----------------------------------------------------------------------------
; Have we exited a multi-line script section?
;-----------------------------------------------------------------------------
Begin ValidData = 'N'
ScanPosn SCFrom SCTo $Data EndList
Begin $Success = 'Y'
ValidData = 'Y'
SCTag = Parse $Data SCFrom SCTo 'Cut'
Call LabelLog
LogMsg SCTag
TrimChar $Data
If $Data = '' Done End
End
;-----------------------------------------------------------------------------
; Look for a script starting, and maybe ending on the same line
;-----------------------------------------------------------------------------
StartList = ' <' ScriptCheck ; e.g. ' <script'
ScanPosn $Ignore $Ignore $Data StartList
Begin $Success = 'Y'
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Found the start of the script
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ScanPosn $Ignore $Ignore $Data EndList
Begin $Success = 'N'
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Script does not end on the same line
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Call LabelLog
LogMsg $Data
ValidData = 'N'
Done
Else
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Script ends on the same line
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ExtFrom = '1*<' ScriptCheck
ExtTo = '>*</no' ScriptCheck
Extract = Parse $Data ExtFrom ExtTo
Call LabelLog
LogMsg Extract
End
End
End ===============================================================================



Editing a Text File

Text files can generally be loaded by a text editor program (such as Windows Notepad, or NoteTab from Fookes Software), and most word-processing programs can load them as well. However, when you save a text file loaded this way it may lose its original format. For example, you might load a Mac text file (in which each line ends with LF), but if you edit and save it using a Windows text editor you might find that each line in the file now ends with CRLF. This might cause problems later, if the next program to use the file does not know how to deal with CRLF-delimited files. In such case, the extra LF may appear in the program as a strange-looking character at the beginning of each line (starting with the second line). Worse problems can arise if you edit a text file in a word-processing program. When you save the file, you must ensure that you save it as a text file rather than a word-processing file. In Word 2002 you can select "File/Save As", and then select "Plain Text (*.txt)". If you should inadvertently save a text file in a word-processing format, it will now contain a lot of additional information it did not have before. This will probably render it useless to the next program that tries to use it, since it expected an ordinary text file. Fortunately, it will probably be easy to load the file back into the word processing program and save it again, this time making sure to specify a text file format. Some Examples of Text File Extensions A file whose name ends with the characters .txt is almost certainly a text file. Other extensions typical of text files include .me (as in a file named Read.Me) and .htm which is an HTML file, as used by web pages. Windows files with the .ini extension are also text files, so they could be loaded into a text editor program. However, just because you can do this does not mean that you should do this. An ini file typically contains the settings for a program, and if you alter the file the program might stop working. Files with the .csv extension are comma-separated-value files. These can be loaded into a text editor, but your operating system may be configured to open them in a spreadsheet if you double-click on them. To summarize the foregoing: many programs save data in text files, but not all text files are supposed to be loaded into a text editor program.




 

Parse-O-Matic Free, Basic, Business and Enterprise are data conversion tools that allow you to parse, convert, mine, import and export data files, reports, web capture, logs, legacy databases, text, CSV (comma separated; comma delimited), ASCII, EBCDIC, and almost any data format that you may have.

Copyright © 1986-2010 Pyroto, Inc. All rights reserved. Legal