将字符串转换为字列表？

我试图将字符串转换为使用python的单词列表。我想利用类似以下内容：将字符串转换为字列表？

string = 'This is a string, with words!'

然后转换为这样的事情：

list = ['This', 'is', 'a', 'string', 'with', 'words']

注意标点符号和空格的遗漏。什么是最快的方式去做这件事？

来源

2011-05-31 rectangletangle

试试这个：

import re 

mystr = 'This is a string, with words!' 
wordList = re.sub("[^\w]", " ", mystr).split()

它是如何工作的：

从文档：

re.sub(pattern, repl, string, count=0, flags=0)

返回通过替换模式的最左边的非重叠发生时得到的线字符串由替换repl。如果未找到该模式，则字符串将保持不变。 repl可以是一个字符串或一个函数。

所以在我们的例子：

模式是任何非字母数字字符。

[\ W]是指任何字母数字字符和等于所述字符集 [A-ZA-Z0-9_]

a到z，A至Z，）至9和下划线。

所以我们匹配任何非字母数字字符并将其替换为空格。它通过其分割空间的字符串，并将其转换成一个列表

，然后我们分手（）

所以“你好世界”

成为“世界你好”

与应用re.sub

然后[ '你好'， '世界']

分裂后（）

让我知道是否有疑虑出现。

来源

2011-05-31 00:13:53 Bryan

记住也要处理撇号和连字符，因为它们不包含在'\ w'中。 – Shule 2014-07-30 05:29:26

你可能想要处理格式化的撇号和非破折号连字符。 – Shule 2014-07-30 05:57:42

嗯，你可以使用

import re 
list = re.sub(r'[.!,;?]', ' ', string).split()

注意两个string和list是内建类型的名称，所以你可能不希望使用那些为您的变量名。

来源

2011-05-31 00:10:30 Cameron

正则表达式的单词会给你最大的控制。你会仔细考虑如何处理带有破折号或撇号的单词，如“我是”。

来源

2011-05-31 00:14:40 tofutim

正确地做到这一点非常复杂。为了您的研究，它被称为词标记化。你应该看看NLTK，如果你想看看别人怎么做的，而不是从头开始：

>>> import nltk 
>>> paragraph = u"Hi, this is my first sentence. And this is my second." 
>>> sentences = nltk.sent_tokenize(paragraph) 
>>> for sentence in sentences: 
...  nltk.word_tokenize(sentence) 
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.'] 
[u'And', u'this', u'is', u'my', u'second', u'.']

来源

2011-05-31 00:15:21

使用string.punctuation的完整性：

import re 
import string 
x = re.sub('['+string.punctuation+']', '', s).split()

这种处理换行也是如此。

来源

2011-05-31 00:24:02 mtrw

应该是被接受的anwser。 – Epoc 2017-02-08 11:41:10

最简单的方法：

>>> import re 
>>> string = 'This is a string, with words!' 
>>> re.findall(r'\w+', string) 
['This', 'is', 'a', 'string', 'with', 'words']

来源

2011-05-31 02:19:14 JBernardo

我认为这是对别人绊倒这个帖子上给出的反应迟缓的最简单的方法：

>>> string = 'This is a string, with words!' 
>>> string.split() 
['This', 'is', 'a', 'string,', 'with', 'words!']

来源

2012-12-06 00:22:28 gilgamar

+19

您需要分离并排除单词中的标点符号（例如，“字符串”和“单词！”）。因此，这不符合OP的要求。 – Levon 2012-12-06 00:31:45

-2

你可以尝试这样做：

tryTrans = string.maketrans(",!", " ") 
str = "This is a string, with words!" 
str = str.translate(tryTrans) 
listOfWords = str.split()

来源

2013-08-12 13:49:25 user2675185

这是来自我对不能使用正则表达式的编码挑战的尝试，

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr).split(' ')

撇号的作用看起来很有趣。

来源

2015-05-28 06:30:26 guest201505281433

list=mystr.split(" ",mystr.count(" "))

来源

2015-08-11 15:14:35 sanchit

通过@ mtrw的回答启发，但改善的只有一个字边界去掉标点符号：

import re 
import string 

def extract_words(s): 
    return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()] 

>>> str = 'This is a string, with words!' 
>>> extract_words(str) 
['This', 'is', 'a', 'string', 'with', 'words'] 

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.''' 
>>> extract_words(str) 
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

来源

2017-06-08 09:55:37

你消除字母外，每特殊字符这样：

def wordsToList(strn): 
    L = strn.split() 
    cleanL = [] 
    abc = 'abcdefghijklmnopqrstuvwxyz' 
    ABC = abc.upper() 
    letters = abc + ABC 
    for e in L: 
     word = '' 
     for c in e: 
      if c in letters: 
       word += c 
     if word != '': 
      cleanL.append(word) 
    return cleanL 

s = 'She loves you, yea yea yea! ' 
L = wordsToList(s) 
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

我不确定这是快速还是最佳，甚至是正确的编程方式。

来源

2017-07-30 15:22:07

将字符串转换为字列表？

回答

相关问题