我从http://www.millercenter.org刮了一堆讲话。我的演讲只是按照我想要的方式进行了剪辑和格式化,除了一小块。每个文档(全部911个)在开头都有'transcript'这个词,我不希望他们在文档中,因为我正在推进一些NLP。我无法删除它们,并且我尝试了replace
和remove
方法。我甚至尝试通过HTML的一部分,在每个文档的开头说:<h2>Transcript</h2>
延长我的find
方法。网页抓取:如果在文档的前20个字符中删除单词?
这里的样本什么我看,文件明智:
transcript
to the senate and house of representatives
i lay before congress several dispatches from his
和
transcript
the period for a new election of a citizen to administer the executive government
这里是我的代码:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
就像我说的,那最后的replace
方法似乎没有工作。思考?
字符串总是以''transcript''开头吗? – pelumi