Wie zum extrahieren von URLs aus einer HTML-Seite in Python

Ich zu schreiben, ein web-crawler in Python. Ich weiß nicht, wie das Parsen einer Seite und extrahieren der URLs aus HTML. Wohin soll ich gehen und zu studieren, zu schreiben, wie ein Programm?

In anderen Worten, ist es ein einfaches python-Programm, das verwendet werden kann als eine Vorlage für einen Allgemeinen web-crawler? Idealerweise sollten die Module verwenden, die sind relativ einfach zu verwenden und es sollten auch viele der Kommentare, um zu beschreiben, was jede Zeile code zu tun.

InformationsquelleAutor user2189704 | 2013-03-20

Blick auf Beispiel-code unten. Das Skript extrahiert html-code einer web-Seite (hier Python-Homepage) und extrahiert alle links auf dieser Seite. Hoffe, das hilft.

#!/usr/bin/env python

import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.python.org"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))


def getURL(page):
    """

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """
    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print url
    else:
        break

Ausgabe:

/
#left-hand-navigation
#content-body
/search
/about/
/news/
/doc/
/download/
/getit/
/community/
/psf/
http://docs.python.org/devguide/
/about/help/
http://pypi.python.org/pypi
/download/releases/2.7.3/
http://docs.python.org/2/
/ftp/python/2.7.3/python-2.7.3.msi
/ftp/python/2.7.3/Python-2.7.3.tar.bz2
/download/releases/3.3.0/
http://docs.python.org/3/
/ftp/python/3.3.0/python-3.3.0.msi
/ftp/python/3.3.0/Python-3.3.0.tar.bz2
/community/jobs/
/community/merchandise/
/psf/donations/
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics
http://pycon.org/#calendar
http://www.psfmember.org

...

InformationsquelleAutor Shankar

Können Sie BeautifulSoup wie viele haben auch angegeben. Es kann analysieren, HTML -, XML-usw. Um zu sehen, einige der Funktionen finden Sie unter hier.

Beispiel:

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.google.co.in/'

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
    link = tag.get('href',None)
    if link is not None:
        print link

InformationsquelleAutor pradyunsg

import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

Verwiesen: Python-Web-Crawler, die in Weniger Als 50 Zeilen (Langsam oder funktioniert nicht mehr, lädt nicht bei mir)

InformationsquelleAutor Scy

3

Können Sie beautifulsoup. Befolgen Sie die Dokumentation und sehen, was Ihren Anforderungen entspricht. Die Dokumentation enthält code-snippets, wie zum extrahieren von URLs als gut.
```
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.find_all('a') # Finds all hrefs from the html doc.
```
InformationsquelleAutor Sushant Gupta
2

Mit Seiten analysieren, schauen Sie sich die BeautifulSoup Modul. Es ist einfach zu bedienen und ermöglicht es Ihnen, zu analysieren-Seiten mit HTML. Sie können extrahieren Sie URLs aus dem HTML-indem Sie einfach tun str.find('a')

Nicht verwenden reguläre Ausdrücke zum Parsen von HTML

InformationsquelleAutor TerryA

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.