I had a goal to write a simple Python functions that takes either a url or html document and an XPath query and returns the queried element. This was surprisingly difficult.
After researching some through some blogs I decided to go with lxml. I also installed beautiful soup which supports the parsing of html that is pretty far from structured. I needed to use something that is as system compatible and os independent as possible. I was testing on Ubuntu 8.10.
sudo apt-get install python-lxmlI'll go step by step through the lines of code I wrote give comments
sudo apt-get install python-beautifulsoup
The above code first imports the parse function from the appropriate package. The parse command takes either a url or a file name as a parameter. One note about taking in a URL is that websites do not produce the same html as your browser, hence the second argument. I saved the html in a file and parsed that instead. The command .getroot() is called on a document object and returns the HTMElement at the root of the document (<>).
from lxml.html import parse
root = parse('http://www.hotels.com').getroot()
or
root = parse('hotels_test.html').getroot()
x = root.xpath('/html/body/div[2]/div')The above command is the magical statement I wanted to return to me the node in the document which corresponds to that XPath statement. HOWEVER, lxml does not support the position elements ([#]). This is a big monkey wrench in my plans and it requires a significant work around. The naive method is to replace all the elements containing ' [#]' with the descendant operator ('//'). The does grow the size of the result set.
I am currently thinking of solution which elegantly fit in my methodology. I would love suggestions!! Below are some websites and resources I used and recorded.
- http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
- http://codespeak.net/lxml/xpathxslt.html
- http://codespeak.net/lxml/lxmlhtml.html
- http://codespeak.net/lxml/dev/api/index.html [API]
- http://www.somebits.com/weblog/tech/python/xpath.html
- http://www.boddie.org.uk/python/HTML.html
Update
It turns out that in order to properly specify which in a number of elements an xpath is specifying you need to put parenthesis around the xpath string.
Instead of:
x = root.xpath('/html/body/div[2]/div[3]/li/ul')
we have:
x = root.xpath('((/html/body/div)[2]/div)[3]/li/ul')
This seems to work fine.
Update 2:
I was grabbing XPath's via Firebug... but it seems like either Firefox or Firebug change the the underlying DOM for some reason. Now I need to make sure I either get the transformed HTML or an XPath query which built from the source html.

2 comments:
Hello Christian,
I'm sorry I have to use this medium to contact you. I couldn't get hold of an email address. Mine is ubongitina@yahoo.com or I could come back to your blog to check your reply.
I am a Masters Project Student in London Metropolitan University, London and my project which is due this May is in line with one of your interests. My focus however is using the Neural Networks(NN) toolbox on Matlab to accomplish this. I have not been able to find a good material on this so far as it seems people no longer use the this toolbox or use it alone for data mining applications. I have rather seen many with different processes or data mining specific software (I was able to get a book relating to NN though but its over 12 years old). I want to ask if using the neural network toolbox for an application in data mining (like customer ranking or sales forecasting) is possible since you have knowledge in this area. If you know of any material or your publication that could be useful to me please let me know.
Also I couldn't help but notice that we have similarities. Apart from working in similar areas, I'm a big fan of the TED talks. My favorites so far are sixth sense http://www.ted.com/talks/pattie_maes_demos_the_sixth_sense.html, How education kills creativity http://www.ted.com/index.php/talks/ken_robinson_says_schools_kill_creativity.html and Jeff Hans multi-touch interface design http://www.ted.com/index.php/talks/jeff_han_demos_his_breakthrough_touchscreen.html to mention a few. I think these talks really help to inspire the mind. Again I too am a Son of the Almighty and am thankkful for my redemption through the blood of His precious Son, Jesus. I notice you have barely above sevevty-three days to your wedding and I wish you the best (you must be seriously preparing now) and pray that God, the author of Life, romance, bliss, fruitfulness, Joy, companionship and Love grant you and your wife(to-be) these and much more, Amen.
I really appreciate your help.
Ubong.
Hello nice to hear from you! (I'll reply from here and also shoot you an email so it doesn't look like I'm leaving you hanging.)
I do not use matlab personally, but I do know Matlab has APIs to Java, C++, Perl and whatever language you need. So I am pretty sure tools to support you using matlab functions. Here is a site I found from a google search Matlab from java .
Aren't those talks great! I want to make it to a session at some point in time. Six Thousand may be work the experience... Its tough looking for creativity.
Thanks for your prayer and well wishes! I actually finally picked out my Tux yesterday, I'm really excited!
Thanks brother, continue to be the salt in this world!!
Post a Comment