Wie zum extrahieren von text aus erschwingliches sane HTML?

Meine Frage ist in der Art, wie diese Frage aber ich habe mehr Einschränkungen:

Ich weiß, das Dokument ist einigermaßen gesund
Sie sind sehr regelmäßig (alle kamen aus der gleichen Quelle
Ich möchte über 99% der sichtbaren text
über 99% von dem, was tragfähig ist, an alles ist text (Sie sind mehr oder weniger RTF in HTML konvertiert)
Ich kümmern sich nicht über die Formatierung oder sogar Absatz bricht.

Gibt es irgendwelche tools einrichten, dies zu tun, oder bin ich besser dran, nur ausbrechen RegexBuddy und C#?

Ich bin offen für Befehlszeile oder batch-processing-Werkzeuge wie C/C#/D Bibliotheken.

Nichts, aber regexes.
Wenn es etwas gab, was aber so Einschränkungen, die ich zuvor noch nie glaubst regex 🙂

InformationsquelleAutor BCS | 2010-01-21

c#d html text-extraction

6

Müssen Sie die HTML-Agility-Pack.

Möchten Sie wahrscheinlich finden Sie ein element mithilfe von LINQ-ant der Descendants Anruf, dann Holen Sie sich Ihre InnerText.

Du meinst, ich muss lernen, LINQ? (überraschend, das ist wirklich das erste, was ich ausgeführt habe in waren LINQ klingt wie der richtige Weg zu gehen, aber dann wieder, ich bin normalerweise nicht in diesem Bereich)
Sie nicht müssen zu lernen, LINQ, aber LINQ macht es viel einfacher zu bedienen. Ich würde vermuten, dass die Verwendung von LINQ effektiv machen würde Ihrem code mindestens 120% kürzer und leichter zu verstehen auch.
Wow mein code ist -20 Zeilen code! 😉
+1 Das agility-pack ist so viel besser als das schreiben Ihrer eigenen DOM-Verarbeitung-Programm.
Wie es passiert, LINQ, war nicht die einfachste Lösung, aber nur, weil es ein Beispiel-Projekt html2text, haben 90% von dem, was ich wollte und das Letzte 1% war trivial hinzufügen als ein paar Zeilen if(...) return; (OTOH die Dokumentation war nicht so gut.)

InformationsquelleAutor SLaks

Diesem code habe ich gehackt, bis heute mit HTML-Agility-Pack, extrahiert unformatierte getrimmten text.

public static string ExtractText(string html)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
    {
        if (item.NodeType == HtmlNodeType.Text)
        {
            if (item.InnerText.Trim() != "")
            {
                chunks.Add(item.InnerText.Trim());
            }
        }
    }
    return String.Join(" ", chunks);
}

Wenn Sie wollen, halten einige Ebene der Formatierung, auf die Sie bauen können die Probe mit der Quelle.

public string Convert(string path)
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(path);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlNodeType.Comment:
            //don't output comments
            break;

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);
            break;

        case HtmlNodeType.Text:
            //script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            //get text
            html = ((HtmlTextNode) node).Text;

            //is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            //check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html));
            }
            break;

        case HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    //treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
                ConvertContentTo(node, outText);
            }
            break;
    }
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
    foreach (HtmlNode subnode in node.ChildNodes)
    {
        ConvertTo(subnode, outText);
    }
}

InformationsquelleAutor Sam Saffron

Es ist relativ einfach, wenn Sie laden den HTML-Code in C# und dann mit der mshtml.dll oder das WebBrowser-control in C#/WinForms, Sie können dann behandeln die gesamte HTML-Dokument als einen Baum, Durchlaufen den Baum erfassen Sie die InnerText-Objekte.

Oder Sie können auch dokumentieren.alle, die den Baum, verflacht es, und dann kann man die Iteration über den Baum, wieder der Erfassung der InnerText.

Hier ein Beispiel:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

Hoffe, das hilft!

Googleing für mshtml.dll geben die meisten eine Seite oder bug-reports, bug Fixes und Fehler. --- Hast du einen link zu der Dokumentation?
Ich habe gerade bearbeitet meine post mit einem Beispiel mit dem WebBrowser-Steuerelement.
gutes Beispiel,das funktioniert gut für mich..+1
Leider ist diese Methode nicht funktionieren auf Server Core-Systemen, da Sie nicht über WebBrowser-Komponente installiert.

InformationsquelleAutor AlishahNovin

2

Hier ist der code, den ich verwende:
```
using System.Web;
public static string ExtractText(string html)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;
}
```
Dies kann akzeptabel sein, in einigen Fällen. Beachten Sie jedoch, dass alle rechten Winkel, die in einem Kommentar oder CDATA-block brechen würde dieser regex, nicht zu erwähnen, dass die regex könnte zerfleischen der Inhalt <script> und <style> - tags. Darüber hinaus, obwohl (nach meinem beschränkten wissen) der standard erfordert, dass Sie die Spitzen Klammern im Attribut-Werten codiert werden, die moderne Browser sind tolerant gegenüber Dingen wie <div data:tree="parent>child">Some text</div>, die auch brechen Ihre regex.
Was ist der Zweck der mit der IgnoreCase option für die regex hier?

InformationsquelleAutor Paul
2

Können Sie NUglify unterstützt extrahieren von text aus HTML:
```
var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   //prints: This is a text
```
Als es ist mit einem HTML5 benutzerdefinierte parser, es sollte Recht robust sein (speziell, wenn das Dokument enthält keine Fehler) und ist sehr schnell (keine regexp beteiligt, aber eine Reine recursive-descent-parser)

FWIW, dies funktioniert hervorragend mit einem minimum von fuzz. Danke!

InformationsquelleAutor xoofx
1

Hier können Sie ein tool herunterladen und seine Quelle, wandelt hin und her, HTML-und XAML: XAML/HTML-Konverter.

Es enthält ein HTML-parser (so ein Ding muss natürlich viel toleranter als Ihre standard-XML-parser), und Sie können Durchlaufen, die HTML sehr ähnlich zu XML.

InformationsquelleAutor herzmeister
1

Aus der Befehlszeile, die Sie verwenden können, die Lynx text-browser wie diese:

Wenn Sie möchten, laden Sie eine web-Seite in der formatierten Ausgabe (D. H. ohne HTML-tags, aber, anstatt, wie es erscheinen würde Lynx), dann eingeben:
```
lynx -dump URL > filename
```
Wenn es irgendwelche links auf der Seite, die URLs für die links werden am Ende der Seite heruntergeladen.

Können Sie deaktivieren Sie die Liste der links mit -nolist. Zum Beispiel:
```
lynx -dump -nolist http://stackoverflow.com/a/10469619/724176 > filename
```
InformationsquelleAutor Hugo

Ist hier der Beste Weg:

  public static string StripHTML(string HTMLText)
    {
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");
    }

select link from google where query = "Html RegEx" limit 1 -> stackoverflow.com/questions/1732348

InformationsquelleAutor Ashraf

Hier ist eine Klasse, die ich entwickelt habe, um das gleiche erreichen. Alle verfügbaren HTML-parsing-libraries waren viel zu langsam, regex viel zu langsam, wie gut. Funktionalität ist erklärt in den Kommentaren im code. Aus meiner benchmarks, dieser code ist ein wenig mehr als 10X schneller als die HTML-Agility-Pack entspricht der code, der beim Test auf die Amazon-Zielseite (im Lieferumfang enthalten unten).

///<summary>
///The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
///extract text data from a given HTML character array. The class searches for and deletes
///script and style tags in a first and second pass, with an optional third pass to do the same
///to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
///All whitespace encountered is replaced with a single whitespace in to avoid multiple
///whitespace in the output.
///
///Note that the returned text content still may have named character and numbered character
///references within that, when decoded, may produce multiple whitespace.
///</summary>
public class FastHtmlTextExtractor
{

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
    {
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        //Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

        if(stripComments)
        {
            //Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);
        }

        //Whipe text between all other tags now.
        while(next < len)
        {
            next = SkipUntil(next, '<', input);

            if(next < len)
            {
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                {
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);
                }

                next = closeNext + 1;
            }
        }

        //Collect all non-whitespace and non-null chars into a new
        //char array. All whitespace characters are skipped and replaced
        //with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
        {
            if(m_deletionDictionary[next] > 0)
            {
                next += m_deletionDictionary[next];
                continue;
            }

            if(char.IsWhiteSpace(input[next]) || input[next] == '\0')
            {
                if(lastSpace)
                {
                    continue;
                }

                extracted[extractedPos++] = ' ';
                lastSpace = true;
            }
            else
            {
                lastSpace = false;
                extracted[extractedPos++] = input[next];
            }
        }

        return new string(extracted, 0, extractedPos);
    }

    ///<summary>
    ///Does a search in the input array for the characters in the supplied open and closing tag
    ///char arrays. Each match where both tag open and tag close are discovered causes the text
    ///in between the matches to be overwritten by Array.Clear().
    ///</summary>
    ///<param name="openingTag">
    ///The opening tag to search for.
    ///</param>
    ///<param name="closingTag">
    ///The closing tag to search for.
    ///</param>
    ///<param name="input">
    ///The input to search in.
    ///</param>
    private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input)
    {
        int len = input.Length;
        int pos = 0;

        do
        {
            pos = FindNext(pos, openingTag, input);

            if(pos < len)
            {
                var closenext = FindNext(pos, closingTag, input);

                if(closenext < len)
                {
                    m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length);
                    WipeRange(pos - openingTag.Length, closenext, input);
                }

                if(closenext > pos)
                {
                    pos = closenext;
                }
                else
                {
                    ++pos;
                }
            }
        }
        while(pos < len);
    }

    ///<summary>
    ///Skips as many characters as possible within the input array until the given char is
    ///found. The position of the first instance of the char is returned, or if not found, a
    ///position beyond the end of the input array is returned.
    ///</summary>
    ///<param name="pos">
    ///The starting position to search from within the input array.
    ///</param>
    ///<param name="c">
    ///The character to find.
    ///</param>
    ///<param name="input">
    ///The input to search within.
    ///</param>
    ///<returns>
    ///The position of the found character, or an index beyond the end of the input array.
    ///</returns>
    private int SkipUntil(int pos, char c, char[] input)
    {
        if(pos >= input.Length)
        {
            return pos;
        }

        do
        {
            if(input[pos] == c)
            {
                return pos;
            }

            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    ///<summary>
    ///Clears a given range in the input array.
    ///</summary>
    ///<param name="start">
    ///The start position from which the array will begin to be cleared.
    ///</param>
    ///<param name="end">
    ///The end position in the array, the position to clear up-until.
    ///</param>
    ///<param name="input">
    ///The source array wherin the supplied range will be cleared.
    ///</param>
    ///<remarks>
    ///Note that the second parameter is called end, not lenghth. This parameter is meant to be
    ///a position in the array, not the amount of entries in the array to clear.
    ///</remarks>
    private void WipeRange(int start, int end, char[] input)
    {
        Array.Clear(input, start, end - start);
    }

    ///<summary>
    ///Finds the next occurance of the supplied char array within the input array. This search
    ///ignores whitespace.
    ///</summary>
    ///<param name="pos">
    ///The position to start searching from.
    ///</param>
    ///<param name="what">
    ///The sequence of characters to find.
    ///</param>
    ///<param name="input">
    ///The input array to perform the search on.
    ///</param>
    ///<returns>
    ///The position of the end of the first matching occurance. That is, the returned position
    ///points to the very end of the search criteria within the input array, not the start. If
    ///no match could be found, a position beyond the end of the input array will be returned.
    ///</returns>
    public int FindNext(int pos, char[] what, char[] input)
    {
        do
        {
            if(Next(ref pos, what, input))
            {
                return pos;
            }
            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    ///<summary>
    ///Probes the input array at the given position to determine if the next N characters
    ///matches the supplied character sequence. This check ignores whitespace.
    ///</summary>
    ///<param name="pos">
    ///The position at which to check within the input array for a match to the supplied
    ///character sequence.
    ///</param>
    ///<param name="what">
    ///The character sequence to attempt to match. Note that whitespace between characters
    ///within the input array is accebtale.
    ///</param>
    ///<param name="input">
    ///The input array to check within.
    ///</param>
    ///<returns>
    ///True if the next N characters within the input array matches the supplied search
    ///character sequence. Returns false otherwise.
    ///</returns>
    public bool Next(ref int pos, char[] what, char[] input)
    {
        int z = 0;

        do
        {
            if(char.IsWhiteSpace(input[pos]) || input[pos] == '\0')
            {
                ++pos;
                continue;
            }

            if(input[pos] == what[z])
            {
                ++z;
                ++pos;
                continue;
            }

            return false;
        }
        while(pos < input.Length && z < what.Length);

        return z == what.Length;
    }
}

Entspricht in HtmlAgilityPack:

//Where m_whitespaceRegex is a Regex with [\s].
//Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sampleHtmlText);

if(doc != null && doc.DocumentNode != null)
{
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    {
        script.Remove();
    }

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    {
        style.Remove();
    }

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
    {
        foreach(HtmlNode node in allTextNodes)
        {
            extractedSampleText.Append(node.InnerText);
        }
    }

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
}

InformationsquelleAutor

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.