JSFiddleJavaScript正则表达式匹配某些字符串,但未能对其他看似相同的字符串
我使用Facebook的API,从我县的警察局网页日常犯罪报告来拉。他们遵循大多也是标准化的格式,用下面的模式是什么,我都会响起来的,和一些讨厌的矛盾:
- 标题是3-4线后跟两个新行字符
\n\n
之间(代码削减了这一点,不属于下面的输出结果) - 不同类型的犯罪组合在一起,第一行是描述犯罪类型的大写字符串。每个类别由两个新行字符
\n\n
分隔。 - 实际的罪行遵循上述类别标题,每个(大部分时间)由一个新行字符
\n
- 分离作为一个“伪影”的任何的它们从复制和粘贴,几次有各种Unicode字符替换连字符,包括
\u2013
,\u2014
和\u2015
- 报道开始字符串“BEAT”,或在极少数情况下所有罪行“垮掉”
是我遇到的问题是,有时下面的代码捕获一个类别标题det在上面#2中,但在其他帖子中,(似乎)完全相同的字符串和情况并没有被捕获。我使用在服务角代码可以看出下面
me.parsePosts = function() {
var posts = facebookService.getRandomPosts(); // Just a method to return 5 random reports for now
angular.forEach(posts, function(post) {
// Some reports are incorrectly double spaced and inconsistent
// with spacing and capitalization
var fixedPost = post.message
.replace(/^Beat/, 'BEAT') // They were a little inconsistent back in the day
.replace('\n\n###', '') // All posts end with a useless ###
.replace('\u2013', '-') // Pesky unicode characters!
.replace('\u2014', '-')
.replace('\u2015', '-')
.replace('\n\nARRESTED', '\nARRESTED') // would help if this was consistent
.replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...
postSplit = fixedPost.split('\n\n'), // split up the post into potential categories
header = postSplit.splice(0,1); // I don't want the standard header of the post
// Pass in postSplit .join()'d back together for debugging
me.getCategoriesFromPost(postSplit, postSplit.join('\n\n'));
});
};
me.getCategoriesFromPost = function(postArray, post) {
var categoryRegexp = /[A-Z\-&\/: ]+$/,
categories = [], uniqCategories = [];
angular.forEach(postArray, function(a) {
var split = a.split('\n'), // Extract the category from the list of crimes
potentialCategory = split[0].trim(); // There's often an unwanted trailing space
if (potentialCategory.match(categoryRegexp)) {
categories.push(potentialCategory);
}
});
// Every blue moon they repost a category twice, I just want one
// and I'll merge the two together afterwards
uniqCategories = categories.filter(function(a,b) {
return categories.indexOf(a) == b;
});
console.log(uniqCategories); // log off all the categories in the post
console.log(post); // Display the actual post so i can visibly verify it all worked
};
因此,作为一个例子,在一个交:
console.log(uniqCategories);
(original raw text as received from facebookService.getRandomPosts()
):
BURGLARY COMMERCIAL
BEAT E1 SPRINT WIRELESS, 7300 ASSATEAGUE DR, 3/19 0426: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole electronics. 14-25638
BEAT D6 MONTPELIER LIQUORS, 7500 MONTPELIER RD, 3/19 0513: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole liquor, lottery tickets, and an ATM machine. 14-25641
BEAT D4 MACY’S, 10300 LITTLE PATUXENT PKWY, 3/19 0501: Two unknown male suspects, wearing masks, gained entry to the business by breaking the glass door. The suspects were interrupted by a store employee and fled without taking anything. 14-25642
SUSPECT VEHICLE: black Dodge pickup
BURGLARY NON COMMERCIAL
BEAT B3 6600 ASPERN DR, 3/17 2354: Four suspects gained entry to the residence via unknown means. No sign of forced entry. 14-25220
ARRESTED:
Karlin Lamont Harris, 23, of Pirch Way in Elkridge, charged with fourth-degree burglary
Steven Lee Hubbard, 29, of Edgewater, charged with fourth-degree burglary
Jessie Tyler Holt, 22, of Pine Tree Rd in Jessup, charged with fourth-degree burglary
Brittney Victoria McEnaney, 26, of Pasadena, charged with fourth-degree burglary
BEAT C1 6900 BENDBOUGH CT, 3/18 1400: Unknown suspect(s) gained entry to the residence via the front door. No sign of forced entry. The suspect(s) stole jewelry. 14-25392
BEAT B4 7100 DEEP FALLS WAY, 3/18 1100-1440: Unknown suspect(s) gained entry to the residence by forcing a rear basement window. The suspect(s) stole jewelry and electronics. 14-25404
VEHICLE THEFT & ATTEMPTS
BEAT E2 7-11, 9600 WASHINGTON BLVD, 3/18 0409:
05 Acura Tag 1AV8629 14-25277 (Keys left in vehicle.)
而且console.log(post);
返回
["BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]
还在另一篇文章中,console.log(uniqCategories);
(original raw text as received from facebookService.getRandomPosts()
):
ROBBERY COMMERCIAL
BEAT B3 ZIPS DRY CLEANING, 6500 OLD WATERLOO RD, 3/22 1900: An unknown suspect entered the business through an unlocked rear door. The suspect threatened an employee and demanded cash. The employee complied. The suspect fled the business. 14-26959
SUSPECT: B/M, 5’8-5’9, black hoodie and pants, backpack
ROBBERY NON COMMERCIAL
BEAT E7 7-11 PARKING LOT, 9100 MAIER RD, 03/23 1632: Suspect stole cash from an acquaintance and caused an abrasion with an unknown sharp object. Police are investigation the possibility it may be drug related. 14-27243
SUSPECT: B/M, 5’8, 200 lbs, dreadlocks
BURGLARY COMMERCIAL
BEAT E1 MEGATELECOM, 8600 WASHINGTON BLVD #106, 3/22 0933: Unknown suspect(s) gained entry to the business by breaking a window. The suspect(s) stole electronics. 14-26793
BEAT F3 CATTAIL CREEK COUNTRY CLUB, 3600 CATTAIL CREEK DR, 03/22 1600- 03/23 0630: Unknown suspect(s) gained entry to a garage through an unlocked door. The suspect(s) stole golf carts. 14-27127
BURGLARY NON COMMERCIAL
BEAT E2 9300 BREAMORE CT, 03/21 1210 ATTEMPT: Two suspects attempted to gain entry via a rear slider. The resident yelled and the suspects fled, but were later caught by police. 14-26458
ARRESTED:
Travis Donte Mackell, 23, of Baltimore, charged with fourth-degree burglary
Maurice Debuiel Aye, 26, of Baltimore, charged with fourth-degree burglary
BEAT D3 5500 COLUMBIA RD, 3/21: An unknown suspect gained entry to the residence through an unlocked rear slider. The suspect woke the resident, who ultimately got the suspect to leave. It appears he may have entered the wrong residence. 14-26712
SUSPECT: B/M, 5’8, 200 lbs
BEAT B4 7500 HEARTHSIDE WAY, 3/22 1700- 1800: Three unknown black male suspects stole a bicycle, which was unsecured on a bike rack. 14-27185
BEAT E3 9100 BRYANT AVE, 3/23 2213: Unknown suspects gained entry to the residence by prying open the kitchen window. Nothing appeared to be taken. 14-27308
BEAT B3 8000 KEETON RD, 3/23 1930- 2230: Unknown suspect(s) gained entry to the residence through an unlocked window. The suspect(s) stole a computer and jewelry. 14-27314
BEAT A3 9000 FREDERICK RD, 3/23 0205: The suspect kicked in an acquaintance’s door after a verbal altercation and assaulted him. 14-27361
ARRESTED: Michael Wilson Sittig, 34, of Frederick Road in Ellicott City, charged with second-degree assault, third- and fourth-degree burglary, malicious destruction of property, and disorderly conduct
VEHICLE THEFT & ATTEMPTS
BEAT D2 5100 ELIOTS OAK DR, 03/22 2130- 3/23 0700:
12 Hyundai Sonata Red MD 5AN2945 14-27135
和console.log(post)
只返回:
["ROBBERY COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]
我希望它返回["ROBBERY COMMERCIAL", "ROBBERY NON COMMERCIAL", "BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]
在这种情况下,很显然,我的代码前者的实例相匹配的BURGLARY COMMERCIAL
和BURGLARY NON COMMERCIAL
,但不是后者。是什么赋予了?另外,请随时纠正我,告诉我我在.replace()
的墙上做的都是错的,如果有的话,还有更好的办法。感谢一帮帮忙!
这可能有助于在循环的开始记录了'POST'(任何修改都做之前),看看是否还有别的东西......就像一个标签?或者打勾(而不是单引号),这会在一个奇怪的地方切断内容。难以确定 – ochi
我已将从FB接收到的原始文本添加到修剪和修改输出上方的帖子,并通过指向Pastebin的链接将此问题进一步解决。对于它的价值,我对这些字符串重新定义了'JSON.stringify()',并且在类别之前只看到了'\ n \ n'。 – Scott
你能提供一个JSfiddle,所以我们可以运行一些测试吗? –