Regular expressions
The following are my experiments and some are excerpts from other sources
Contents
- 1 Regex for ASCII Characters
- 2 Matching strings - substrings
- 3 Matching HTML Tags
- 4 Numbers
- 5 Remove all control characters
- 6 Matching Date format
- 7 Match string with a character as optional
- 8 Match last word with an optional forward slash
- 9 Match a quoted text zero or more times
- 10 From Tiny MCE editor ip, url, ssn, cc, isbn, zip, phone, hexcolor and user
- 11 Remove characters from string
- 12 Matching first and last character
- 13 Match Capital Letters with optional underscore
- 14 Extracting hostname from HTTP_REFERRER
- 15 Matching newline in multiline text without using DOT with an alternate
- 16 Find the length of a string or a range in length with backreference in regex
- 17 Capturing and not capturing - Subpatterns
- 18 Regular expressions that i should remember
- 19 Regex in C language using regcomp and regexec
- 20 Match a html content which has many children
- 21 Match extension that does not contain certain extension
- 22 Reference
Regex for ASCII Characters
A regular expression that matches ASCII characters consists of an escaped string \x00 where 00 can be any hexadecimal ASCII character code from 00 to FF. A range of ASCII characters can be matched by enclosing two such codes in square brackets.
/[\x00-\xFF]/
The expression above will match all ASCII characters from NULL (hex code 0) to ÿ (hex code 255) as shown in this article or this list. These ASCII characters are divided into three groups:
33 control characters (hex code 00 to 1F as well as 7F) 95 printable characters (hex code 20 to 7E) 128 extended character set (hex code 80 – FF) Note that the first 32 characters (00 to 1F) as well as 7F are control characters and can often be omitted. This requires specifying two character ranges which excludes these character:
/[\x20-\x7E\x80-\xFF]/
value.replace(/[^\x20-\x7E]/gm,);
Matching strings - substrings
Here i am trying to extract the string between a src attribute of any tag
My condition was string between starting src="...". I am extracting the text which are three dots.
Here i am a bit furtuer of including http://www.youtube.com because i want to extract the request uri of a youtube video
$arr= array(); preg_match("/src=(\"|\')http:\/\/www.youtube.com\/(.*?)\"/",$url,$arr); print_r($arr);
Matching HTML Tags
Matching string between two strings
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Source http://stackoverflow.com/questions/299942/regex-matching-html-tags-and-extracting-text
Remove Tags Two
Remove javascript and CSS: <(script|style).*?</\1> Remove tags <.*?>
---
Source http://stackoverflow.com/questions/181095/regular-expression-to-extract-text-from-html
Remove script and tags like that
What about the above versions... i hope it does not include new line and the following
will include newline.
<script(.|\r\n)*?</script>
use other tags in the place of script. Combining this with the above regex we have the new one like this...
<(script|style)(.|\r\n)*?</\1>
Numbers
Numbers with trailing dot
I want to find all numbers ending with . example is 1., 12., 123. and i tried the following and it worked.
- [0-9]+[\.]
One digit and a dot
I want just one number and a dot to be searched and i used the following.
- [0-9][\.]
Other Options
I tried the following and got only matches like 1.0 34.34 here you to make . optional and included
- [0-9]+[\.][0-9]+
Remove all control characters
The pattern [\x00-\x1f] matches all control characters including the NUL character.
str = str.replace(/[\x00-\x1f]/,'')
Matching Date format
- I am trying to match this format 12/21/2010
- where first set first number sh ould not be more than 1
- in the second set the first number should not be more than 3
- in the third set in 2010 for the next 90 years 20 is constant
- the last two numbers can vary
These are the steps i kept in mind to create this and it is working...
fom.xdate.value.match(/(0|1)[0-9]\/(0|1|2|3)[0-9]\/20[0-9][0-9]/)
Match string with a character as optional
var protocol = window.location.href.match(/https?:\/\//) // alert protocol
Here slashes are predefined so it is preceding with a back slash. When i checked w3schools the window.protocol returns http: or https: but not the two trailing slahes... what could be the reason... i have to find it.
also i used http(s):// in regex and it worked and this one too... http(s)?://
Match last word with an optional forward slash
- window.location.href.match(/titles\/?$/)
Match a quoted text zero or more times
Matches an exact word
"\w*"
Matches any quoted string
"([^"\\]|\\.)*"
Match all hyperlinks
this i modified from the above but there should be another direct way to find all hyper links within quotes
"http://([^"\\]|\\.)*"
From Tiny MCE editor ip, url, ssn, cc, isbn, zip, phone, hexcolor and user
ip url ssn cc isbn zip phone hexcolor user
Remove characters from string
PHP
echo preg_replace("/[^a-z0-9]+/i",'',$value);
Javascript
val = val.replace(/[^a-z 0-9]+/gi,''); where g for all occurrences and i for case insensitive.
Matching first and last character
I have a string like '[sample]' and i want match [ as first and ] as last. so it gives [sample] and if (.*) is
included for back reference then you get [sample],sample in an array.
str.match(/^\[.*\]$/)) str.match(/^\[(.*)\]$/))
Match Capital Letters with optional underscore
[A-Z_]{2,}
I want it to match minimum of two characters... also i want to check with the others to impose case sensitivity( the reason being is it is not working in notepad++ so i have to check Match Case option manually in notepad++ and hope this will not be the behaviour with other regex parsers.
Extracting hostname from HTTP_REFERRER
Matching newline in multiline text without using DOT with an alternate
Find the length of a string or a range in length with backreference in regex
Capturing and not capturing - Subpatterns
Regular expressions that i should remember
Regex in C language using regcomp and regexec
Match a html content which has many children
(function(){
var cont = document.getElementById('retailform').innerHTML.match(/\<fieldset(.*?)gene_div(.*?)[\s\S]+\<\/fieldset\>/)[0];
var ele = document.createElement('div');
ele.innerHTML = cont;
document.getElementById('retailform').appendChild(ele);
document.getElementById('keyword_search').disabled = false;
})()
Match extension that does not contain certain extension
^(?!.*\.pdf).*$