我期待从this页提取每个指令ID:合法Xpath查询的urllib2牵强
import lxml.html as lh
url ='https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
all_instruction_ids = root.xpath(XPATH_ALL_INSTRUCTION_IDS)
我已经试过了无数的XPath由Chrome & Firebug的开发工具,萤火虫和其他浏览器给我的表情加载项:
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/.'
#XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/text()'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a[contains(normalize-space(), "")]'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a'
XPATH_ALL_INSTRUCTION_IDS = ".//*[@id='content']/div/div/div[2]/table/tbody/tr[2]/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "id('content')/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "/html/body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]//a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/*/a"
ÿ当它们传递到xpath()
的方法lxml.html.fromstring()
仔细观察,图像也存在于直播源中(它只是没有显示),所以它必须是别的东西。 – Pyderman