查找和基于未知字符

我已经难倒寻找一种方法来查找和替换基于位置的字符替换Python-。基本上我在寻找什么做进入的文档和替换查找和基于未知字符

<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>

随着

<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>

一切之后小数字符必须删除。问题在于，这是针对XML文件中的多个时间戳，并且每个时间戳都完全不同。我读了一点正则表达式，它似乎是一种可能的方法。任何帮助将不胜感激。

XML文件格式的编辑示例：

<?xml version="1.0" encoding="utf-8"?> 
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?> 
<gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd"> 
    <gmd:fileIdentifier> 
     <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString> 
    </gmd:fileIdentifier> 
    <gmd:language> 
     <gco:CharacterString>English</gco:CharacterString> 
       <gmd:date> 
       <gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime> 
       </gmd:date>

@Parfait

来源

2016-06-07 MapZombie

的正则表达式将解决这一和其它类似的问题，你应该继续阅读它们。在这种特定情况下，解析和格式化日期也是一种好方法。 –

我会进一步警告你不要试图处理XML太多不使用库，例如'lxml'或'ElementTree'实际上解析成一个适当的树，虽然你可能会摆脱它，如果你所有的transormations如无并发症。 – holdenweb

它不能强调不够（也许是最高的投票SO答案），[不要正则表达式HTML/XML文件（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-自含标签）。 – Parfait

考虑XSLT（用于转换XML文档的专用声明性语言），它具有非常方便的功能（与其同级XPath共享）substring-before()您可以在划分时间戳的时间段之前提取数据。 Python的lxml模块可以运行XSLT 1.0脚本。

下面的脚本解析XML和XSLT的文件。具体来说，XSLT运行身份变换为是复制文件，然后提取从所有<gco:DateTime>的时间。只有需要gco命名空间在XSLT头中定义注意：

XSLT脚本（如外部保存为在Python中引用的.xsl文件）

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" 
       xmlns:gco="http://www.isotc211.org/2005/gco"> 
<xsl:output version="1.0" encoding="UTF-8" indent="yes" /> 
<xsl:strip-space elements="*"/> 

    <!-- Identity Transform --> 
    <xsl:template match="@*|node()"> 
    <xsl:copy> 
     <xsl:apply-templates select="@*|node()"/> 
    </xsl:copy> 
    </xsl:template> 

    <xsl:template match="gco:DateTime"> 
    <xsl:copy> 
     <xsl:copy-of select="substring-before(., '.')"/>     
    </xsl:copy> 
    </xsl:template> 

</xsl:transform>

的Python脚本

import lxml.etree as ET 

# LOAD XML AND XSL 
dom = ET.parse('Input.xml') 
xslt = ET.parse('XSLTScript.xsl') 

# TRANSFORM XML 
transform = ET.XSLT(xslt) 
newdom = transform(dom) 

# CONVERT TO STRING 
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) 

# OUTPUT TREE TO FILE 
xmlfile = open('Output.xml') 
xmlfile.write(tree_out) 
xmlfile.close()

输出

<?xml version="1.0"?> 
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?><gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd"> 
    <gmd:fileIdentifier> 
    <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString> 
    </gmd:fileIdentifier> 
    <gmd:language> 
    <gco:CharacterString>English</gco:CharacterString> 
    <gmd:date> 
     <gco:DateTime>2016-04-20T11:27:34</gco:DateTime> 
    </gmd:date> 
    </gmd:language> 
</gmd:MD_Metadata>

来源

2016-06-08 00:42:34 Parfait

感谢Parfait，这非常棒。真的很感激它！ – MapZombie

我的档案全部以开头<？xml version =“1.0”encoding =“utf-8”？> <？xml-stylesheet type ='text/xsl'href ='http：//xxxxx.com'？> MapZombie

请发布snippet of actual xml（它的所有头文件，因为您有一个应该定义的名称空间gco'）。你不应该从第三条线开始。 – Parfait

一种方式：

s = "<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>" 
split_on_dot = s.split('.') 
split_on_angle = split_on_dot[1].split('<') 
new_s = "".join([split_on_dot[0], "<", split_on_angle[1]]) 

>>> new_s 
'<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>' 
>>>

这依赖于周期是在输入字符串的唯一时间。我不擅长正则表达式。我认为他们被滥用，但我确定有人会告诉你如何使用正则表达式。只要记住python本身就有很好的字符串操作。

来源

2016-06-07 22:46:16

感谢joel，我需要这个能够解析每个文件的多个未知日期。在每个文件中有大约6个这种格式的日期戳。而且每种格式都是一致的，只用了一个时间段。 – MapZombie

然后，很好，但留意@holdenweb关于xml解析的评论。一旦你有了你想要改变的元素，我的回答就会照顾到事物。 Stephen Holden向我介绍了python，他教导了 –

查找和基于未知字符

回答

相关问题