2017-02-26 40 views
1

我有大约6000〜6500 Microsoft Word.docx文件与各类内他们格式化的回答脚本,顺序:从docx文件中提取python代码块并在沙箱中运行它们的安全方法是什么?

Python编程问题粗体部分

抢答齐全,形式法正确缩进,单间隔,自足代码

不幸的是,似乎没有固定模式将代码块与正常文本区分开来。从最初的50名左右的文件的一些例子:

  1. 整个问题的大胆,在这之后的代码开始突然,在 粗体/斜体

  2. 付诸表决,在评论,在这之后的代码会继续

  3. 完全缺失的问题,只是带有编号列表的代码表示开始

  4. 完全缺失的问题,用C/Python样式注释表示开始

现在,我通过python-docx提取整个无格式文本like this:

doc = Document(infil) 

# For Unicode handling. 
new_paragraphs = [] 
for paragraph in doc.paragraphs: 
    new_paragraphs.append((paragraph.text).encode("utf-8")) 

new_paragraphs = list(map(lambda x: convert(x), new_paragraphs)) 

with open(outfil, 'w', encoding='utf-8') as f: 
    print('\n'.join(new_paragraphs), file=f) 

提取完毕,我会使用运行它们,我明白了PyPy Sandboxing feature是安全的然后像在比赛中一样分配点数。

我完全坚持的是如何以编程方式检测代码的开始和结束。大多数语言检测API是不需要的,因为我已经知道这种语言。这个问题:How to detect source code in a text?建议使用像Google Code Prettifier这样的短语和语法荧光笔,但它们不能解决检测单独程序的问题。

一个合适的解决方案from this programmers.se question似乎是在训练马尔可夫链,但在开始这么庞大的项目之前,我想要一些其他的意见。

此提取码也将在评估后提供给所有学生。

如果问题太宽泛或答案太明显,我表示歉意。

回答

1

Hummm,所以你正在寻找某种格式化模式?这对我来说听起来很奇怪。有什么样的文本或字符串模式可以利用吗?我不确定这是否有帮助,但下面的VBA脚本搜索文件夹中的所有Word文档,并在任何与您在Row1中指定的搜索条件相匹配的字段中输入“X”。它还在ColA中添加了超链接,因此您可以单击链接并打开文件,而不是搜索文件。这是一个屏幕截图。

enter image description here

脚本:

Sub OpenAndReadWordDoc() 

    Rows("2:1000000").Select 
    Range(Selection, Selection.End(xlDown)).Select 
    Selection.ClearContents 
    Range("A1").Select 

    ' assumes that the previous procedure has been executed 
    Dim oWordApp As Word.Application 
    Dim oWordDoc As Word.Document 
    Dim blnStart As Boolean 
    Dim r As Long 
    Dim sFolder As String 
    Dim strFilePattern As String 
    Dim strFileName As String 
    Dim sFileName As String 
    Dim ws As Worksheet 
    Dim c As Long 
    Dim n As Long 

    '~~> Establish an Word application object 
    On Error Resume Next 
    Set oWordApp = GetObject(, "Word.Application") 
    If Err() Then 
     Set oWordApp = CreateObject("Word.Application") 
     ' We started Word for this macro 
     blnStart = True 
    End If 
    On Error GoTo ErrHandler 

    Set ws = ActiveSheet 
    r = 1 ' startrow for the copied text from the Word document 
    ' Last column 
    n = ws.Range("A1").End(xlToRight).Column 

    sFolder = "C:\Users\your_path_here\" 

    '~~> This is the extension you want to go in for 
    strFilePattern = "*.doc*" 
    '~~> Loop through the folder to get the word files 
    strFileName = Dir(sFolder & strFilePattern) 
    Do Until strFileName = "" 
     sFileName = sFolder & strFileName 

     '~~> Open the word doc 
     Set oWordDoc = oWordApp.Documents.Open(sFileName) 
     ' Increase row number 
     r = r + 1 
     ' Enter file name in column A 
     ws.Cells(r, 1).Value = sFileName 

     ActiveCell.Offset(1, 0).Select 
     ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName, 
     SubAddress:="A" & r, TextToDisplay:=sFileName 

     ' Loop through the columns 
     For c = 2 To n 
      If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value), 
        MatchWholeWord:=True, MatchCase:=False) Then 
       ' If text found, enter Yes in column number c 
       ws.Cells(r, c).Value = "Yes" 
      End If 
     Next c 
     oWordDoc.Close SaveChanges:=False 

     '~~> Find next file 
     strFileName = Dir() 
    Loop 

ExitHandler: 
    On Error Resume Next 
    ' close the Word application 
    Set oWordDoc = Nothing 
    If blnStart Then 
     ' We started Word, so we close it 
     oWordApp.Quit 
    End If 
    Set oWordApp = Nothing 
    Exit Sub 

ErrHandler: 
    MsgBox Err.Description, vbExclamation 
    Resume ExitHandler 
End Sub 

Function GetDirectory(path) 
    GetDirectory = Left(path, InStrRev(path, "\")) 
End Function 
相关问题