從HTML文件中抽取正文的簡單方案

2008/08/26

譯者導讀：這篇文章主要介紹了從不同類型的HTML文件中抽取出真正有用的正文內容的一種有廣泛適應性的方法。其功能類似於CSDN近期推出的「剪影」，能夠去除頁眉、頁腳和側邊欄的無關內容，非常實用。其方法簡單有效而又出乎意料，看完後難免大呼原來還可以這樣！行文簡明易懂，雖然應用了人工神經網絡這樣的算法，但因為FANN良好的封裝性，並不要求讀者需要懂得ANN。全文示例以Python代碼寫成，可讀性更佳，具有科普氣息，值得一讀。

You've finally got your hands on the diverse collection of HTML documents you needed. But the content you're interested in is hidden amidst adverts, layout tables or formatting markup, and other various links. Even worse, there's visible text in the menus, headers and footers that you want to filter out. If you don't want to write a complex scraping program for each type of HTML file, there is a solution.

每個人手中都可能有一大堆討論不同話題的HTML文檔。但你真正感興趣的內容可能隱藏於廣告、佈局表格或格式標記以及無數鏈接當中。甚至更糟的是，你希望那些來自菜單、頁眉和頁腳的文本能夠被過濾掉。如果你不想為每種類型的HTML文件分別編寫複雜的抽取程序的話，我這裡有一個解決方案。

This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…

本文講述如何編寫與從大量HTML代碼中獲取正文內容的簡單腳本，這一方法無需知道HTML文件的結構和使用的標籤。它能夠工作於含有文本內容的所有新聞文章和博客頁面……

Do you want to find out how statistics and machine learning can save you time and effort mining text?
你想知道統計學和機器學習在挖掘文本方面能夠讓你省時省力的原因嗎？
The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn't a novel idea, but it works!) The basic process works as follows:

答案極其簡單：使用文本和HTML代碼的密度來決定一行文件是否應該輸出。（這聽起來有點離奇，但它的確有用！）基本的處理工作如下：

Parse the HTML code and keep track of the number of bytes processed.

一、解析HTML代碼並記下處理的字節數。

Store the text output on a per-line, or per-paragraph basis.

二、以行或段的形式保存解析輸出的文本。

Associate with each text line the number of bytes of HTML required to describe it.

三、統計每一行文本相應的HTML代碼的字節數

Compute the text density of each line by calculating the ratio of text to bytes.

四、通過計算文本相對於字節數的比率來獲取文本密度

Then decide if the line is part of the content by using a neural network.

五、最後用神經網絡來決定這一行是不是正文的一部分。

You can get pretty good results just by checking if the line's density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it's easier to implement!

僅僅通過判斷行密度是否高於一個固定的閾值（或者就使用平均值）你就可以獲得非常好的結果。但你也可以使用機器學習（這易於實現，簡直不值一提）來減少這個系統出現的錯誤。

Let's take it from the top…

現在讓我從頭開始……

Converting the HTML to Text

轉換HTML為文本

What you need is the core of a text-mode browser, which is already setup to read files with HTML markup and display raw text. By reusing existing code, you won't have to spend too much time handling invalid XML documents, which are very common — as you'll realise quickly.

你需要一個文本模式瀏覽器的核心，它應該已經內建了讀取HTML文件和顯示原始文本功能。通過重用已有代碼，你並不需要把很多時間花在處理無效的XML文件上。

As a quick example, we'll be using Python along with a few built-in modules: htmllib for the parsing and formatter for outputting formatted text. This is what the top-level function looks like:

我們將使用Python來完成這個例子，它的htmllib模塊可用以解析HTML文件，formatter模塊可用以輸出格式化的文本。嗯，實現的頂層函數如下：

def extract_text(html):

# Derive from formatter.AbstractWriter to store paragraphs.

writer = LineWriter()

# Default formatter sends commands to our writer.

formatter = AbstractFormatter(writer)

# Derive from htmllib.HTMLParser to track parsed bytes.

parser = TrackingParser(writer, formatter)

# Give the parser the raw HTML data.

parser.feed(html)

parser.close()

# Filter the paragraphs stored and output them.

return writer.output()

The TrackingParser itself overrides the callback functions for parsing start and end tags, as they are given the current parse index in the buffer. You don't have access to that normally, unless you start diving into frames in the call stack — which isn't the best approach! Here's what the class looks like:

TrackingParser覆蓋瞭解析標籤開始和結束時調用的回調函數，用以給緩衝對象傳遞當前解析的索引。通常你不得不這樣，除非你使用不被推薦的方法——深入調用堆棧去獲取執行幀。這個類看起來是這樣的：

class TrackingParser(htmllib.HTMLParser):

「」"Try to keep accurate pointer of parsing location.」"」

def __init__(self, writer, *args):

htmllib.HTMLParser.__init__(self, *args)

self.writer = writer

def parse_starttag(self, i):

index = htmllib.HTMLParser.parse_starttag(self, i)

self.writer.index = index

return index

def parse_endtag(self, i):

self.writer.index = i

return htmllib.HTMLParser.parse_endtag(self, i)

The LineWriter class does the bulk of the work when called by the default formatter. If you have any improvements or changes to make, most likely they'll go here. This is where we'll put our machine learning code in later. But you can keep the implementation rather simple and still get good results. Here's the simplest possible code:

LinWriter的大部分工作都通過調用formatter來完成。如果你要改進或者修改程序，大部分時候其實就是在修改它。我們將在後面講述怎麼為它加上機器學習代碼。但你也可以保持它的簡單實現，仍然可以得到一個好結果。具體的代碼如下：

class Paragraph:
def __init__(self):
self.text = 」
self.bytes = 0
self.density = 0.0
class LineWriter(formatter.AbstractWriter):
def __init__(self, *args):
self.last_index = 0
self.lines = [Paragraph()]
formatter.AbstractWriter.__init__(self)
def send_flowing_data(self, data):
# Work out the length of this text chunk.
t = len(data)
# We've parsed more text, so increment index.
self.index += t
# Calculate the number of bytes since last time.
b = self.index - self.last_index
self.last_index = self.index
# Accumulate this information in current line.
l = self.lines[-1]
l.text += data
l.bytes += b
def send_paragraph(self, blankline):
「」"Create a new paragraph if necessary.」"」
if self.lines[-1].text == 」:
return
self.lines[-1].text += 『n' * (blankline+1)
self.lines[-1].bytes += 2 * (blankline+1)
self.lines.append(Writer.Paragraph())
def send_literal_data(self, data):
self.send_flowing_data(data)
def send_line_break(self):
self.send_paragraph(0)

This code doesn't do any outputting yet, it just gathers the data. We now have a bunch of paragraphs in an array, we know their length, and we know roughly how many bytes of HTML were necessary to create them. Let's see what emerge from our statistics.

這裡代碼還沒有做輸出部分，它只是聚合數據。現在我們有一系列的文字段（用數組保存），以及它們的長度和生成它們所需要的HTML的大概字節數。現在讓我們來看看統計學帶來了什麼。

Examining the Data

數據分析

Luckily, there are some patterns in the data. In the raw output below, you'll notice there are definite spikes in the number of HTML bytes required to encode lines of text, notably around the title, both sidebars, headers and footers.

幸運的是，數據裡總是存在一些模式。從下面的原始輸出你可以發現有些文本需要大量的HTML來編碼，特別是標題、側邊欄、頁眉和頁腳。

While the number of HTML bytes spikes in places, it remains below average for quite a few lines. On these lines, the text output is rather high. Calculating the density of text to HTML bytes gives us a better understanding of this relationship.

雖然HTML字節數的峰值多次出現，但大部分仍然低於平均值；我們也可以看到在大部分低HTML字節數的字段中，文本輸出卻相當高。通過計算文本與HTML字節數的比率（即密度）可以讓我們更容易明白它們之間的關係：The patterns are more obvious in this density value, so it gives us something concrete to work with.

密度值圖更加清晰地表達了正文的密度更高，這是我們的工作的事實依據。

Filtering the Lines

過濾文本行

The simplest way we can filter lines now is by comparing the density to a fixed threshold, such as 50% or the average density. Finishing the LineWriter class:

過濾文本行的最簡單方法是通過與一個閾值（如50%或者平均值）比較密度值。下面來完成LineWriter類：

def compute_density(self):

「」"Calculate the density for each line, and the average.」"」

total = 0.0
for l in self.lines:
l.density = len(l.text) / float(l.bytes)
total += l.density
# Store for optional use by the neural network.
self.average = total / float(len(self.lines))
def output(self):
「」"Return a string with the useless lines filtered out.」"」
self.compute_density()
output = StringIO.StringIO()
for l in self.lines:
# Check density against threshold.
# Custom filter extensions go here.
if l.density > 0.5:
output.write(l.text)
return output.getvalue()
This rough filter typically gets most of the lines right. All the headers, footers and sidebars text is usually stripped as long as it's not too long. However, if there are long copyright notices, comments, or descriptions of other stories, then those are output too. Also, if there are short lines around inline graphics or adverts within the text, these are not output.

這個粗糙的過濾器能夠獲取大部分正確的文本行。只要頁眉、頁腳和側邊欄文本並不非常長，那麼所有的這些都會被剔除。然而，它仍然會輸出比較長的版本聲明、註釋和對其它故事的概述；在圖片和廣告周邊的比較短小的文本，卻被過濾掉了。

To fix this, we need a more complex filtering heuristic. But instead of spending days working out the logic manually, we'll just grab loads of information about each line and use machine learning to find patterns for us.

要解決這個問題，我們需要更複雜些的啟發式過濾器。為了節省手工計算需要花費的無數時間，我們將利用機器學習來處理每一文本行的信息，以找出對我們有用的模式。

Supervised Machine Learning

監督式機器學習

Here's an example of an interface for tagging lines of text as content or not:

這是一個標識文本行是否為正文的接口界面：

The idea of supervised learning is to provide examples for an algorithm to learn from. In our case, we give it a set documents that were tagged by humans, so we know which line must be output and which line must be filtered out. For this we'll use a simple neural network known as the perceptron. It takes floating point inputs and filters the information through weighted connections between 「neurons」 and outputs another floating point number. Roughly speaking, the number of neurons and layers affects the ability to approximate functions precisely; we'll use both single-layer perceptrons (SLP) and multi-layer perceptrons (MLP) for prototyping.

所謂的監督式學習就是為算法提供學習的例子。在這個案例中，我們給定一系列已經由人標識好的文檔——我們知道哪一行必須輸出或者過濾掉。我們用使用一個簡單的神經網絡作為感知器，它接受浮點輸入並通過「神經元」間的加權連接過濾信息，然後輸後另一個浮點數。大體來說，神經元數量和層數將影響獲取最優解的能力。我們的原型將分別使用單層感知器（SLP）和多層感知器（MLP）模型。

To get the neural network to learn, we need to gather some data. This is where the earlier LineWriter.output() function comes in handy; it gives us a central point to process all the lines at once, and make a global decision which lines to output. Starting with intuition and experimenting a bit, we discover that the following data is useful to decide how to filter a line:

我們需要找些數據來供機器學習。之前的LineWriter.output()函數正好派上用場，它使我們能夠一次處理所有文本行並作出決定哪些文本行應該輸出的全局結策。從直覺和經驗中我們發現下面的幾條原則可用於決定如何過濾文本行：

Density of the current line.

當前行的密度

Number of HTML bytes of the line.

當前行的HTML字節數

Length of output text for this line.

當前行的輸出文本長度

These three values for the previous line,

前一行的這三個值

… and the same for the next line.

後一行的這三個值

For the implementation, we'll be using Python to interface with FANN, the Fast Artificial Neural Network Library. The essence of the learning code goes like this:

我們可以利用FANN的Python接口來實現，FANN是Fast Artificial Neural NetWork庫的簡稱。基本的學習代碼如下：

from pyfann import fann, libfann
# This creates a new single-layer perceptron with 1 output and 3 inputs.
obj = libfann.fann_create_standard_array(2, (3, 1))
ann = fann.fann_class(obj)
# Load the data we described above.
patterns = fann.read_train_from_file(』training.txt')
ann.train_on_data(patterns, 1000, 1, 0.0)
# Then test it with different data.
for datin, datout in validation_data:
result = ann.run(datin)
print 『Got:』, result, 『 Expected:』, datout
Trying out different data and different network structures is a rather mechanical process. Don't have too many neurons or you may train too well for the set of documents you have (overfitting), and conversely try to have enough to solve the problem well. Here are the results, varying the number of lines used (1L-3L) and the number of attributes per line (1A-3A):
嘗試不同的數據和不同的網絡結構是比較機械的過程。不要使用太多的神經元和使用太好的文本集合來訓練（過擬合），相反地應當嘗試解決足夠多的問題。使用不同的行數（1L-3L）和每一行不同的屬性（1A-3A）得到的結果如下：

The interesting thing to note is that 0.5 is already a pretty good guess at a fixed threshold (see first set of columns). The learning algorithm cannot find much better solution for comparing the density alone (1 Attribute in the second column). With 3 Attributes, the next SLP does better overall, though it gets more false negatives. Using multiple lines also increases the performance of the single layer perceptron (fourth set of columns). And finally, using a more complex neural network structure works best overall — making 80% less errors in filtering the lines.

有趣的是作為一個猜測的固定閾值，0.5的表現非常好（看第一列）。學習算法並不能僅僅通過比較密度來找出更佳的方案（第二列）。使用三個屬性，下一個 SLP比前兩都好，但它引入了更多的假陰性。使用多行文本也增進了性能（第四列），最後使用更複雜的神經網絡結構比所有的結果都要更好，在文本行過濾中減少了80%錯誤。

Note that you can tweak how the error is calculated if you want to punish false positives more than false negatives.

注意：你能夠調整誤差計算，以給假陽性比假陰性更多的懲罰（寧缺勿濫的策略）。

Conclusion
結論

Extracting text from arbitrary HTML files doesn't necessarily require scraping the file with custom code. You can use statistics to get pretty amazing results, and machine learning to get even better. By tweaking the threshold, you can avoid the worst false positive that pollute your text output. But it's not so bad in practice; where the neural network makes mistakes, even humans have trouble classifying those lines as 「content」 or not.

從任意HTML文件中抽取正文無需編寫針對文件編寫特定的抽取程序，使用統計學就能獲得令人驚訝的效果，而機器學習能讓它做得更好。通過調整閾值，你能夠避免出現魚目混珠的情況。它的表現相當好，因為在神經網絡判斷錯誤的地方，甚至人類也難以判定它是否為正文。

Now all you have to figure out is what to do with that clean text content!

現在需要思考的問題是用這些「乾淨」的正文內容做什麼應用好呢？

本文來自CSDN博客，轉載請標明出處：http://blog.csdn.net/lanphaday/archive/2007/08/13/1741185.aspx