2015-04-02 125 views
0

我的文件包含超过2000个包含超过18000个句子的摘要,以标记开始,以标记结尾。我想找到使用记事本++的信息,我的文件的示意图如下:从记事本中的文件中提取每个引号之间的文本++

<abstract> 
<sentence>Activationofthe<conslex="CD28_surface_receptor"sem="G#protein_family_or_group"><conslex="CD28"sem="G#protein_molecule">CD28</cons>surfacereceptor</cons>providesamajorcostimulatorysignalfor<conslex="T_cell_activation"sem="G#other_name">Tcellactivation</cons>resultinginenhancedproductionof<conslex="interleukin-2"sem="G#protein_molecule">interleukin-2</cons>(<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>)and<conslex="cell_proliferation"sem="G#other_name">cellproliferation</cons>.</sentence> 
<sentence>In<conslex="primary_T_lymphocyte"sem="G#cell_type">primaryTlymphocytes</cons>weshowthat<conslex="CD28"sem="G#protein_molecule">CD28</cons>ligationleadstotherapidintracellularformationof<conslex="reactive_oxygen_intermediate"sem="G#inorganic">reactiveoxygenintermediates</cons>(<conslex="ROI"sem="G#inorganic">ROIs</cons>)whicharerequiredfor<conslex="CD28-mediated_activation"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-mediatedactivation</cons>ofthe<conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>/<conslex="CD28-responsive_complex"sem="G#protein_complex"><conslex="CD28"sem="G#protein_molecule">CD28</cons>-responsivecomplex</cons>and<conslex="IL-2_expression"sem="G#other_name"><conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expression</cons>.</sentence> 
<sentence>Delineationofthe<conslex="CD28_signaling_cascade"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>signalingcascade</cons>wasfoundtoinvolve<conslex="protein_tyrosine_kinase_activity"sem="G#other_name"><conslex="protein_tyrosine_kinase"sem="G#protein_family_or_group">proteintyrosinekinase</cons>activity</cons>,followedbytheactivationof<conslex="phospholipase_A2"sem="G#protein_molecule">phospholipaseA2</cons>and<conslex="5-lipoxygenase"sem="G#protein_molecule">5-lipoxygenase</cons>.</sentence> 
<sentence>Ourdatasuggestthat<conslex="lipoxygenase_metabolite"sem="G#protein_family_or_group"><conslex="lipoxygenase"sem="G#protein_molecule">lipoxygenase</cons>metabolites</cons>activate<conslex="ROI_formation"sem="G#other_name"><conslex="ROI"sem="G#inorganic">ROI</cons>formation</cons>whichtheninduce<conslex="IL-2"sem="G#protein_molecule">IL-2</cons>expressionvia<conslex="NF-kappa_B_activation"sem="G#other_name"><conslex="NF-kappa_B"sem="G#protein_molecule">NF-kappaB</cons>activation</cons>.</sentence> 
<sentence>Thesefindingsshouldbeusefulfor<conslex="therapeutic_strategies"sem="G#other_name">therapeuticstrategies</cons>andthedevelopmentof<conslex="immunosuppressants"sem="G#other_name">immunosuppressants</cons>targetingthe<conslex="CD28_costimulatory_pathway"sem="G#other_name"><conslex="CD28"sem="G#protein_molecule">CD28</cons>costimulatorypathway</cons>.</sentence> 
</abstract> 

我想提取引号之间或者换句话说文本要删除所有数据,除了是双在整个文本引用例如我期望的输出是这样

CD28_surface_receptor G#protein_family_or_group CD28 G#protein_molecule 
primary_T_lymphocyte G#cell_type 

我以前.*"(.*)".*查找内容然后更换所有与\1取代。它只从每行的最后一行提取带有引号的文本,但是我想从所有文档和每行中提取,因为在我的文件中有更多字符串带有双引号。

+1

为什么你是否发布重复? http://stackoverflow.com/questions/29409502/extracting-text-between-quotation-marks-in-notepad – deceze 2015-04-02 12:32:24

+0

我得到注销,不记得我的密码 – 2015-04-02 12:34:08

+0

我的这个问题还没有解决 – 2015-04-02 12:35:17

回答

3

您可以使用[^"]*"([^"]+)"[^"]*查找内容,并与\1\r\n取代:

enter image description here

或者,让他们制表符分隔,与\1\t取代:

enter image description here

+0

谢谢,这对我很好,我有一个类似问题。一个问题,如果我希望输出包含“引号”标记,我将如何更改正则表达式?编辑:使用“\ 1”\ r \ n的作品,哇正则表达式很简单! ... – gakera 2015-05-13 15:14:23

+1

或者将它们添加到它们应该在的替换字符串中,或​​者将它们移到'(...)'捕获组中:'[^“] *(”[^“] +”)[^“] * ' – 2015-05-13 15:15:32

+0

感谢:DI upvoted这个问题,即使这是有点Engrishian(Engrish印度)我想问一个类似的问题,但这个可怜的人已经因为询问一个副本而受到重击:P – gakera 2015-05-13 15:19:28