2017-05-06 78 views
1

我使用quanteda进行基于字典的方法进行定量文本分析。我正在与Lowe的Yoshikoder建立我自己的字典。我可以将我的Yoshikoder字典应用到quanteda(见下文) - 但是,该函数仅占字典的第一级。我需要查看每个类别的所有值,包括所有子类别(至少4个级别)。我怎样才能做到这一点?Quanteda:应用Yoshikoder多级字典

# load my Yoshikoder dictionary with multiple levels 
mydict <- dictionary(file = "mydictionary.ykd", 
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto") 

# apply dictionary 
mydfm <- dfm(mycorpus, dictionary = mydict) 
mydfm 
# problem: shows only results for the first level of the dictionary 

回答

1

dfm_lookup(和tokens_lookup)有levels参数缺省为1:5。尝试单独申请查询:

mydfm <- dfm(mycorpus) 
dfm_lookup(mydfm, dictionary = mydict) 

或:

mytoks <- tokens(mycorpus) 
mytoks <- tokens_lookup(mytoks, dictionary = mydict) 
dfm(mytoks) 

更新:

在v0.9.9.55现在固定。

> library(quanteda) 
# Loading required package: quanteda 
# quanteda version 0.9.9.55 
# Using 7 of 8 cores for parallel computing 

> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd") 
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE) 
# Creating a dfm from a corpus ... 
# ... tokenizing texts 
# ... lowercasing 
# ... found 14 documents, 5,058 features 
# ... applying a dictionary consisting of 19 keys 
# ... created a 14 x 19 sparse dfm 
# ... complete. 
# Elapsed time: 0.422 seconds. 

> mydict 
# Dictionary object with 9 primary key entries and 2 nested levels. 
# - Economy: 
#  - +State+: 
#  - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow* 
#  - =State=: 
#  - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work* 
#  - -State-: 
#  - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher* 
#  - Institutions: 
#  - Radical: 
#  - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice* 
#  - Neutral: 
#  - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote* 
#  - Conservative: 
#  - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike* 
#  - Values: 
#  - Liberal: 
#  - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex* 
#  - Conservative: 
#  - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition* 
#  - Law and Order: 
#  - Liberal: 
#  - harassment, non-custodial 
# - Conservative: 
#  - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan* 
#  - Environment: 
#  - Pro: 
#  - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
# 
#  ... 
+0

谢谢您的帮助。我尝试了你的两个建议,第一次运行,但仍然适用字典只有2个键(意味着只有第一级,也如果我设置levels = 5),第二个结果在以下错误:“错误在qatd_cpp_tokens_lookup( x,keys,entries_id,keys_id,FALSE):与请求的类型不兼容。“ – Sera

+0

以任何方式,我认为这个问题相当于之前的步骤(加载字典),因为字典仅仅加载为2(“2键”)的列表,只会计算字典的第一级...?我也尝试过使用Yoshikoder格式的Laver和Garry字典 - 同样的问题。然而,如果Laver和Garry的词典以Wordstat格式加载,它将占所有级别... – Sera

+0

您能发送字典文件和dfm对象,以便我们可以测试吗? –

1

,而我在Quanteda固定它,尝试过坍塌此类别替换功能:

library(xml2) 

read_dict_yoshikoder <- function(path, sep=">"){ 
    doc <- xml2::read_xml(path) 
    pats <- xml2::xml_find_all(doc, ".//pnode") 
    pnode_names <- xml2::xml_attr(pats, "name") 
    get_pnode_path <- function(pn) { 
    pars <- xml2::xml_attr(xml2::xml_parents(pn), "name") 
    paste0(rev(na.omit(pars)), collapse = sep) 
    } 
    pnode_paths <- lapply(pats, get_pnode_path) 
    lst <- split(pnode_names, unlist(pnode_paths)) 
    dictionary(lst) 
} 

用法:

read_dict_yoshikoder("laver-garry-ajps.ykd") 

Dictionary object with 19 key entries. 
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre* 
- Laver and Garry>Culture>Popular: media 
- Laver and Garry>Culture>Sport: angler* 
- Laver and Garry>Environment>Con: produc* 
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci* 

... 
+0

感谢您提供这种替代解决方案 - 但是,在运行该功能时,崩溃= sep似乎有问题。错误说:错误在paste0(rev(na.omit(pars)),collapse = sep): 承诺已经在评估:递归默认参数引用或更早的问题? 调用时间:paste0(rev(na.omit(pars)),collapse = sep) – Sera

+0

糟糕。现在应该修复。 @Sera – conjugateprior

+0

谢谢,这可以作为一种替代解决方案,而量子中的Yoshikoder阅读器尚未修复! – Sera