Quanteda：应用Yoshikoder多级字典

我使用quanteda进行基于字典的方法进行定量文本分析。我正在与Lowe的Yoshikoder建立我自己的字典。我可以将我的Yoshikoder字典应用到quanteda（见下文） - 但是，该函数仅占字典的第一级。我需要查看每个类别的所有值，包括所有子类别（至少4个级别）。我怎样才能做到这一点？Quanteda：应用Yoshikoder多级字典

# load my Yoshikoder dictionary with multiple levels 
mydict <- dictionary(file = "mydictionary.ykd", 
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto") 

# apply dictionary 
mydfm <- dfm(mycorpus, dictionary = mydict) 
mydfm 
# problem: shows only results for the first level of the dictionary

来源

2017-05-06 Sera

dfm_lookup（和tokens_lookup）有levels参数缺省为1:5。尝试单独申请查询：

mydfm <- dfm(mycorpus) 
dfm_lookup(mydfm, dictionary = mydict)

或：

mytoks <- tokens(mycorpus) 
mytoks <- tokens_lookup(mytoks, dictionary = mydict) 
dfm(mytoks)

更新：

在v0.9.9.55现在固定。

> library(quanteda) 
# Loading required package: quanteda 
# quanteda version 0.9.9.55 
# Using 7 of 8 cores for parallel computing 

> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd") 
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE) 
# Creating a dfm from a corpus ... 
# ... tokenizing texts 
# ... lowercasing 
# ... found 14 documents, 5,058 features 
# ... applying a dictionary consisting of 19 keys 
# ... created a 14 x 19 sparse dfm 
# ... complete. 
# Elapsed time: 0.422 seconds. 

> mydict 
# Dictionary object with 9 primary key entries and 2 nested levels. 
# - Economy: 
#  - +State+: 
#  - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow* 
#  - =State=: 
#  - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work* 
#  - -State-: 
#  - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher* 
#  - Institutions: 
#  - Radical: 
#  - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice* 
#  - Neutral: 
#  - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote* 
#  - Conservative: 
#  - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike* 
#  - Values: 
#  - Liberal: 
#  - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex* 
#  - Conservative: 
#  - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition* 
#  - Law and Order: 
#  - Liberal: 
#  - harassment, non-custodial 
# - Conservative: 
#  - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan* 
#  - Environment: 
#  - Pro: 
#  - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
# 
#  ...

来源

2017-05-06 22:20:36

谢谢您的帮助。我尝试了你的两个建议，第一次运行，但仍然适用字典只有2个键（意味着只有第一级，也如果我设置levels = 5），第二个结果在以下错误：“错误在qatd_cpp_tokens_lookup（ x，keys，entries_id，keys_id，FALSE）：与请求的类型不兼容。“ – Sera

以任何方式，我认为这个问题相当于之前的步骤（加载字典），因为字典仅仅加载为2（“2键”）的列表，只会计算字典的第一级...？我也尝试过使用Yoshikoder格式的Laver和Garry字典 - 同样的问题。然而，如果Laver和Garry的词典以Wordstat格式加载，它将占所有级别... – Sera

您能发送字典文件和dfm对象，以便我们可以测试吗？ –

，而我在Quanteda固定它，尝试过坍塌此类别替换功能：

library(xml2) 

read_dict_yoshikoder <- function(path, sep=">"){ 
    doc <- xml2::read_xml(path) 
    pats <- xml2::xml_find_all(doc, ".//pnode") 
    pnode_names <- xml2::xml_attr(pats, "name") 
    get_pnode_path <- function(pn) { 
    pars <- xml2::xml_attr(xml2::xml_parents(pn), "name") 
    paste0(rev(na.omit(pars)), collapse = sep) 
    } 
    pnode_paths <- lapply(pats, get_pnode_path) 
    lst <- split(pnode_names, unlist(pnode_paths)) 
    dictionary(lst) 
}

用法：

read_dict_yoshikoder("laver-garry-ajps.ykd") 

Dictionary object with 19 key entries. 
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre* 
- Laver and Garry>Culture>Popular: media 
- Laver and Garry>Culture>Sport: angler* 
- Laver and Garry>Environment>Con: produc* 
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl* 
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci* 

...

来源

2017-05-07 19:00:23 conjugateprior

感谢您提供这种替代解决方案 - 但是，在运行该功能时，崩溃= sep似乎有问题。错误说：错误在paste0（rev（na.omit（pars）），collapse = sep）：承诺已经在评估：递归默认参数引用或更早的问题？调用时间：paste0（rev（na.omit（pars）），collapse = sep） – Sera

糟糕。现在应该修复。 @Sera – conjugateprior

谢谢，这可以作为一种替代解决方案，而量子中的Yoshikoder阅读器尚未修复！ – Sera

Quanteda：应用Yoshikoder多级字典

回答

相关问题