Tuesday, May 20, 2008

Processing Xml in Snobol4

In this post a quick way to process XML files using a SAX-like method in SNOBOL4 is presented.

I couldn't find a tool for XML parsing/processing for SNOBOL4 so I decided to try to create one to learn more about the language. I decided to use SAX method because is simpler to implement and lets me focus in the text processing part of the code .

Since SNOBOL4 works consumes the input line by line, a function to flatten all the input was created. This helps by eliminating the problem of considering line breaks, but it also makes the code very inefficient since it creates a big line with all the code in in the XML file.


Define('ReadAll()content,tcontent') :(RA_END)
ReadAll
content = ''
RA_LOOP
tcontent = INPUT : F(ERA_LOOP)
content = content tcontent : (RA_LOOP)
ERA_LOOP
ReadAll = content :(RETURN)
RA_END



Having this problem solved, the Xml processing function looks as follows:


Define('ReadXml(inputStr,iPos,fTStart,fTEnd,fText)iPos,fPos,name,closing,attsString,text') :(RX_END)
ReadXml
&anchor = 0
Init
XmlDirectiveL
inputStr POS(iPos) '<?' ARB '?>' @fPos :F(TagStartL)
iPos = fPos :(Init)

TagStartL
inputStr POS(iPos) '<' SPAN(TagChar) $ name ARB $ attsString ('/>' | '>') $ closing @fPos :F(EndTagL)
attsTable = ReadAttributes(attsString)
iPos = fPos
APPLY(fTStart,name,attsTable) :(Init)

EndTagL
inputStr POS(iPos) '</' SPAN(TagChar) $ name '>' @fPos :F(BlanksL)
iPos = fPos
APPLY(fTEnd,name) :(Init)

BlanksL
inputStr POS(iPos) SPAN(Blank) @fPos :F(TextL)
iPos = fPos :(Init)
TextL
inputStr POS(iPos) BREAK('<') $ text @fPos :F(RXXS_END)
iPos = fPos
APPLY(fText,text) :(Init)

RXXS_END

:(RETURN)
RX_END


This code keeps track of the position in the string where the last XML element structure matched by using the iPos variable. The '@' symbol followed by a variable records the position in the input string at a given moment.

Each part of this function marked by the labels XmlDirectiveL, TagStartL, EndTagL, BlanksL, TextL matches one XML element and calls a callback function specified by the fTStart, fTEnd and fText parameters. The call is made by using the APPLY function.

The contents of the ReadAttributes function is the following.


TagChar = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:-'
AttNameChar = TagChar
Blank = " "

Define("ReadAttributes(text)result,attsText,iPOS,fPOS,name,value") :(ratts)
ReadAttributes
result = table()
attsText = trim(text)
iPOS = 0
ratts_loop
attsText POS(iPOS) ARBNO(' ') SPAN(AttNameChar) $ name ARBNO(' ') '=' ARBNO(' ') '"' BREAK('"') $ value '"' @fPOS :F(ratts_loop_end)
result = value
iPOS = fPOS :(ratts_loop)
ratts_loop_end
ReadAttributes = result :(Return)
ratts



An example of the use of these functions is the following:


-include "xmlp.sno"

Define('MyTSFunc(name,attributesTable)') :(MTS_END)
MyTSFunc
OUTPUT = "Into " name
OUTPUT = "id=" attributesTable["id"] :(RETURN)
MTS_END

Define('MyTEFunc(name)') :(MTE_END)
MyTEFunc
OUTPUT = "Out of " name :(RETURN)
MTE_END

Define('MyTTFunc(text)') :(MTT_END)
MyTTFunc
OUTPUT = "Text: " text :(RETURN)
MTT_END

OUTPUT = "XML Test"
Txt = ReadAll()
ReadXml(Txt,0,.MyTSFunc,.MyTEFunc,.MyTTFunc)
END



Given the following input:

<uno>
<dos id="3">
asdf
<tres id="4">
h hh
</tres>
<cuatro>
iasdl
</cuatro>
<cinco id="42"/>
</dos>
</uno>


The program generates:



XML Test
Into uno
id=
Into dos
id=3
Text: asdf
Into tres
id=4
Text: h hh
Out of tres
Into cuatro
id=
Text: iasdl
Out of cuatro
Into cinco
id=42
Out of dos
Out of uno


The benefit of using a SAX-like approach is that the code could be reused for other programs. For example the following program prints all the links and the titles from an OPML file from Google Reader.


-include "xmlp.sno"


Define('TagVisitHandler(name,attributesTable)theUrl,title') :(TVH_END)
TagVisitHandler
name "outline" :F(Return)
title = attributesTable["text"]
theUrl = attributesTable["htmlUrl"]
ident(theUrl , '') :s(return)
OUTPUT = "Link for " title " : " theUrl :(RETURN)
TVH_END

Define('MiTEFunc(name)') :(MTE_END)
MiTEFunc :(RETURN)
MTE_END

Define('MiTTFunc(text)') :(MTT_END)
MiTTFunc :(RETURN)
MTT_END

Txt = ReadAll()
ReadXml(Txt,0,.TagVisitHandler,.MiTEFunc,.MiTTFunc)
END



Documentation from SNOBOL4.ORG was used as reference.