beautifulSoup html csv

Guten Abend, ich habe BeautifulSoup zu extrahieren einige Daten von einer website wie folgt:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"})

print table

Dies gibt die folgende Ausgabe:

[<table width="70%" class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257">&#160;</a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td>&#160;</td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td>&#160;</td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal &#38; Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment &#38; Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td>&#160;</td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life &#38; Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td>&#160;</td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
</table>]

Ich würde gerne exportieren, diese in CSV-unter Beibehaltung der Struktur einer Tabelle wie auf der website angezeigt, ist dies möglich und wenn ja wie?

Vielen Dank im Voraus für die Hilfe.

Vielleicht möchten Sie sich an dieser Lösung - sebsauvage.net/python/html2csv.py . Fand Sie durch Googeln "html -, csv-python" 🙂
Danke, obwohl, die Lösung scheint Recht kompliziert? Ich hoffe es gibt einen einfacheren Weg bedenkt, dass ich alle Daten in einer relativ sauberen format ... wenn nicht, werde ich versuchen, dieser zu Folgen 🙂

InformationsquelleAutor merlin_1980 | 2013-01-05

Hier ist eine grundlegende Sache, die Sie ausprobieren können. Dies lässt die Annahme zu, dass die headers alle im <th> tags, und alle nachfolgenden Daten in die <td> - tags. Dies funktioniert im einzelnen Fall zur Verfügung gestellt, aber ich bin mir sicher, dass Anpassungen notwendig sein werden, wenn andere Fälle 🙂 Die Allgemeine Idee ist, dass, sobald Sie Ihre table (hier mit find ziehen die ersten ein), so erhalten wir die headers durch Durchlaufen alle th Elemente, speichern Sie in einer Liste an. Dann erstellen wir eine rows Liste, die Listen darstellen der Inhalt jeder Zeile. Diese wird aufgefüllt, indem Sie die Suche nach alle td Elemente unter tr tags und unter der text, codiert in UTF-8 (von Unicode). Sie öffnen Sie dann eine CSV, schreiben die headers zuerst und dann schreiben alle rows, but using(Zeile für Zeile in den Zeilen wenn Zeilen)` zu beseitigen, werden alle leeren Zeilen):

In [117]: import csv

In [118]: from bs4 import BeautifulSoup

In [119]: from urllib2 import urlopen

In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})

In [122]: headers = [header.text for header in table.find_all('th')]

In [123]: rows = []

In [124]: for row in table.find_all('tr'):
   .....:     rows.append([val.text.encode('utf8') for val in row.find_all('td')])
   .....: 

In [125]: with open('output_file.csv', 'wb') as f:
   .....:     writer = csv.writer(f)
   .....:     writer.writerow(headers)
   .....:     writer.writerows(row for row in rows if row)
   .....: 

In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings,

Danke, das sieht aus wie die perfekte Lösung. Jedoch, ich zu sein scheinen immer einen SyntaxError mit der " Katze output_file.csv' - Linie, die es gerade liest ungültige syntax?
Oh tut mir Leid, sollte erwähnt haben, dass ist eine IPython-spezifische Sache (war im Grunde nur versuchen zu zeigen, den Inhalt der Datei). Wenn Sie bis zu diesem Zeitpunkt sollten Sie die gespeicherte Datei in diesem Verzeichnis.
Vielen Dank 🙂 ich glaube nicht, über die Suche im Verzeichnis und öffnen Sie die Datei manuell!
Kein problem - ich hätte mehr klar mit 🙂 viel Glück mit allem!
Das half mir zu: wusste gar nicht, dass ein text-Attribut zugänglich war, auf einer BeautifulSoup Tag geben

InformationsquelleAutor

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.