Saturday, December 14, 2013

Exporting Blogger Posts to individual html files

A few months back I exported my blogger blog, producing an atom xml file. I looked at this in XML notepad 2007 and it didn't seem very helpful. I also opened in MS Excel, from there at least I was able to extract a list of titles and keywords, but otherwise a bit messy. I had been experimenting writing own parser for my own simple xml files, the atom xml however seemed too complicated especially the element which contains the html formatted post.

I looked around for off-line editors, but none of the ones I could find seemed to be worthy of the title editor, as they are mostly for creation and posting of new posts, rather than editing and managing existing. What I wanted was to extract the individual posts and edit and possibly update the original posts or post again as a new post: noting that this blog is catharsis and basic brain dump of ideas.

Any case more recently I have been experimenting with the MSXML COM automation object, this I used in my simple barrier program to provide the data for the drop down list and associated load tables. So having had some success programming MSXML in both MS Excel/vba and then translating to vb.net, I thought I would take a look at ripping the blogger atom xml file apart with MSXML. {yeh! I'm now aware that I probably shouldn't use MSXML in vb.net}

Which turned out not to be all that complicated in the first instance. The xml file contains a lot of elements called <entry>, not all of these are posts. From looking at the file with XML notepad as far as I could tell those posts which have element <type> with value "html" are the posts and pages of the blog. The following code extracts the appropriate nodes.

Set nList = xmldoc.selectNodes("//entry[content/@type='html']")

Admittedly working out how to select the appropriate nodes wasn't so simple. Currently having problems understanding what I actually grab from the xml data, and what the text property is going to return, or value property returns.

Having got these nodes the next step was to find the child nodes with the required data and then write to either a plain text file or html file. For the initial test I just grabbed the data and wrote to a plain text file. Before writing the program I also copy/pasted the content node data using XML notepad and pasted to notepad and saved as simple text, renamed and viewed in google chrome. The whole xml file can also be viewed in Google chrome: writing a chrome application may be a alternative possibility for extracting the blog posts, but creating local files with a browser is slightly problematic
.
Once confirmed that could grab the posts, I added a few lines of additional code to create a more completely formatted individual html file. Only I was missing the post titles and getting those was more problematic. Well! initially I had everything for the <entry> element printed in the individual files, or otherwise just grabbed the content. So more research on XPath at w3chools.com, and more experimenting with select nodes, to arrive at the search string shown above.

With that search string have to process the child nodes of the node list to get to the <content> and <title> elements. Once I got the titles back into the posts, I figured I had a lot of files and I didn't know what they were as I merely gave them numbers. So I further modified the program to create an index file. So having got all that working on Friday, I figured the next step would be to do something with the categories or keywords assigned to each post. This posed another problem as not every <category> element contains a keyword, nor does it have a value, it only has attributes. Trying to find information on how to access the attributes wasn't easy.

Anycase there are two attributes scheme and term. It seems that "term" contains the keyword if the "scheme" attribute contains "http://www.blogger.com/atom/ns#". Therefore needed to be able to test the value of the attributes and extract the keywords.

Having figured out how to do this, I wrote the keywords to a table in Excel and counted the frequency of use in the blog posts. The code I used is as follows, it will need modifying to use elsewhere, but it my be useful as outline of basic process of getting information from the blogger atom xml.  Note that some of the code has wrapped to display into the width of the blog post, and the code needs to be on a single line.
        
Sub xmlSimpleList2()
  Dim wrkbk As Workbook
  Dim wrkSht As Worksheet
  Dim WrkRng As Range
  Dim fpath1 As String
  Dim keyWordTable As Range
  Dim keyWord As String

  Dim xmldoc As MSXML2.DOMDocument
  Dim xmlNode As MSXML2.IXMLDOMNode

  Dim tmpNode As MSXML2.IXMLDOMNode
  
  Dim nList As MSXML2.IXMLDOMNodeList
  Dim tmpList As MSXML2.IXMLDOMNodeList
  
  Dim oNode As MSXML2.IXMLDOMNode
  Dim oAttributes As MSXML2.IXMLDOMNamedNodeMap
  
  Dim tagName As String
  Dim tagValue As String
  Dim i As Integer, j As Integer
  Dim fp As Integer, fp1 As Integer
      
  Set wrkbk = ThisWorkbook
  Set wrkSht = wrkbk.Worksheets("xmlData")
  Set WrkRng = wrkSht.Range("A1:A1")
  Set keyWordTable = ThisWorkbook.Worksheets("keyWords").Range("A1:A1") 'row zero is header
   
    Set xmldoc = New MSXML2.DOMDocument

    xmldoc.async = False
    If xmldoc.Load(myDocName) Then
      fp1 = fopen(ThisWorkbook.Path & "\BlogPosts\" & "index.htm", "wt")
      Print #fp1, "<html>"
      Print #fp1, "<body>"
      Print #fp1, "<ol>"

        'Test XPath
        i = 1
        j = 1
' Set nList = xmldoc.selectNodes("//content[@type='html']")
        Set nList = xmldoc.selectNodes("//entry[content/@type='html']")
        
' Final text entity is part of the tree, but attributes are not
        For Each oNode In nList
          tagName = oNode.nodeName
' Debug.Print i, tagName
         WrkRng.Offset(i, 0).Value = tagName
          
         fpath1 = ThisWorkbook.Path & "\BlogPosts\" & "post" & StrLPadc(CStr(i), 4, "0") & ".htm"
         fp = fopen(fpath1, "wt")
         Print #fp, "<html>"
         Print #fp, "<body>"

          If oNode.hasChildNodes Then
' Debug.Print "Number of child Nodes: " & oNode.childNodes.Length
            For Each xmlNode In oNode.childNodes
            
              Select Case UCase(xmlNode.nodeName)
                Case UCase("title")
                  Debug.Print xmlNode.Text
                  WrkRng.Offset(i, 1).Value = xmlNode.Text
                  Print #fp, "<h1>"
                  Print #fp, xmlNode.Text
                  Print #fp, "</h1>"
                  
                  Print #fp1, "<li><a href=" & """" & fpath1 & """" & ">" & xmlNode.Text & "</a></li>"
                  
                Case UCase("content")
                  Debug.Print "content"
                  Print #fp, xmlNode.Text
                  
                Case UCase("category")
                  Debug.Print "category"
                  Set oAttributes = xmlNode.Attributes
                  Set tmpNode = oAttributes.getNamedItem("scheme")
                  
                  If tmpNode.Text = bloggerNameSpace Then
                    j = j + 1
                    Set tmpNode = oAttributes.getNamedItem("term")
                    keyWord = tmpNode.Text
                    WrkRng.Offset(i, j).Value = keyWord
                    Call updateKeyWordList(keyWordTable, keyWord)
                  End If
                  
              End Select
              
            Next xmlNode
          End If
           
          Print #fp, "</body>"
          Print #fp, "</html>"
          Close #fp
          i = i + 1
          j = 1
        Next oNode
        
        Print #fp1, "</ol>"
        Print #fp1, "</body>"
        Print #fp1, "</html>"
        Close #fp1
        
    End If  'xml doc loaded
    
    Set xmldoc = Nothing
    
    Debug.Print "All Done!"
    
End Sub