snipt

Ctrl+h for KB shortcuts

Python

Read all links from a remote html page

from urllib import urlopen
from lxml import html

def read_all_links(url):
    
    page = html.fromstring(urlopen(url).read())

    for link in page.xpath("//a[starts-with(@href, '/taxonomy/')]/text()"):
        print link
    # end for
    
# end def read_all_links

Description

There is a filter for reading just URLs starting with "/taxonomy".
https://snipt.net/embed/471bd3d172483d41b0824ee307da9966/
/raw/471bd3d172483d41b0824ee307da9966/
471bd3d172483d41b0824ee307da9966
python
Python
12
2019-07-20T12:18:49
True
False
False
May 09, 2013 at 04:50 AM
/api/public/snipt/61018/
read-all-links-from-a-remote-html-page
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><a href="#L-1"> 1</a> <a href="#L-2"> 2</a> <a href="#L-3"> 3</a> <a href="#L-4"> 4</a> <a href="#L-5"> 5</a> <a href="#L-6"> 6</a> <a href="#L-7"> 7</a> <a href="#L-8"> 8</a> <a href="#L-9"> 9</a> <a href="#L-10">10</a> <a href="#L-11">11</a> <a href="#L-12">12</a></pre></div></td><td class="code"><div class="highlight"><pre><span></span><span id="L-1"><a name="L-1"></a><span class="kn">from</span> <span class="nn">urllib</span> <span class="kn">import</span> <span class="n">urlopen</span> </span><span id="L-2"><a name="L-2"></a><span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">html</span> </span><span id="L-3"><a name="L-3"></a> </span><span id="L-4"><a name="L-4"></a><span class="k">def</span> <span class="nf">read_all_links</span><span class="p">(</span><span class="n">url</span><span class="p">):</span> </span><span id="L-5"><a name="L-5"></a> </span><span id="L-6"><a name="L-6"></a> <span class="n">page</span> <span class="o">=</span> <span class="n">html</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span> </span><span id="L-7"><a name="L-7"></a> </span><span id="L-8"><a name="L-8"></a> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">page</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s2">&quot;//a[starts-with(@href, &#39;/taxonomy/&#39;)]/text()&quot;</span><span class="p">):</span> </span><span id="L-9"><a name="L-9"></a> <span class="k">print</span> <span class="n">link</span> </span><span id="L-10"><a name="L-10"></a> <span class="c1"># end for</span> </span><span id="L-11"><a name="L-11"></a> </span><span id="L-12"><a name="L-12"></a><span class="c1"># end def read_all_links</span> </span></pre></div> </td></tr></table>
Python