Fedora Linux Support Community & Resources Center
  #1  
Old 29th September 2009, 11:49 PM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Question Regex, or, umm, Analysis problem...

Ok, the regex really is not my issue, but forming the right regex equation is. Let me explain what i am trying to do..............

I've got a list of over 200 temporary email providers. This kind of service typically generates a random email @ their domain.com and you can give it to a website when you register, and check it to click email validation links. This is bad for honest guys like me who would never sell or even disclose any user's information.

The problem is that the format of domain names make them hard to determine (procedurally) if is it a FQDN, or a subdomain. I want to block by DOMAIN/FQDN, and not just subdomain (i know, i'm a d*ck). The problem arises when trying to assess whether part of the domain name is a subdomain or not. For example, example.com.it it a FQND, but something like example.ask.me is not. In the both cases, the last section has 2 characters, and the middle section has 3. I could run a check against country codes (CCs), but then other sites like example.eat.it would slip through..... I'm thinking about giving up, and matching against the entire domain part of the email address, and just updating my list a little more frequently....

Thanx
Reply With Quote
  #2  
Old 30th September 2009, 12:15 AM
pete_1967 Offline
Clueless in a Cuckooland
 
Join Date: Mar 2006
Location: Here now, elsewhere tomorrow.
Posts: 4,368
linuxfedorafirefox
Some good examples: http://www.regular-expressions.info/email.html and http://regexlib.com/DisplayPatterns.aspx
__________________
A Drink is Not Just For Christmas - SaskyCom :thumb:


“Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime” so now go and...
RTFM FIRST: http://docs.fedoraproject.org/ & http://rute.2038bug.com/index.html.gz
Reply With Quote
  #3  
Old 30th September 2009, 12:20 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Oh no, validating email is not an issue. I've been doing that with regex and validation links for a long time. The issue here is more of a semantic one, as opposed to syntactic.....

but thanx
Reply With Quote
  #4  
Old 30th September 2009, 12:25 AM
pete_1967 Offline
Clueless in a Cuckooland
 
Join Date: Mar 2006
Location: Here now, elsewhere tomorrow.
Posts: 4,368
linuxfedorafirefox
Well, pretty much only way to do that is to check the last part against valid extensions (e.g. org, gov, mil, com etc) and if it doesn't match, deny. In other words: is the trouble worth the benefits?
__________________
A Drink is Not Just For Christmas - SaskyCom :thumb:


“Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime” so now go and...
RTFM FIRST: http://docs.fedoraproject.org/ & http://rute.2038bug.com/index.html.gz

Last edited by pete_1967; 30th September 2009 at 12:30 AM.
Reply With Quote
  #5  
Old 30th September 2009, 12:30 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Yeah, that was what i was afraid was my only option: matching known extensions. Hmmmm, i may just go back to plan A, and match the entire domain section; but my dumbass already removed the subdomain of each domain........
I wish they would standardize the way domain names worked, to prevent issues like this... I don't like seeing a ".uk", etc at the end of a domain (or rather, beginning, as it were), as it only complicates miscellaneous operations such as stripping subdomains out.

thanx

Last edited by Vector; 30th September 2009 at 12:36 AM.
Reply With Quote
  #6  
Old 30th September 2009, 12:34 AM
pete_1967 Offline
Clueless in a Cuckooland
 
Join Date: Mar 2006
Location: Here now, elsewhere tomorrow.
Posts: 4,368
linuxfedorafirefox
Or use this:
Code:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
If address doesn't match, it's not valid address. Simple: checking for valid address has same result as checking for invalid one and above matches it against RFC 2822.

or this one that checks for top-level domains:
Code:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b
It's always easier to decide what you allow, and drop everything else, than other way around.
__________________
A Drink is Not Just For Christmas - SaskyCom :thumb:


“Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime” so now go and...
RTFM FIRST: http://docs.fedoraproject.org/ & http://rute.2038bug.com/index.html.gz

Last edited by pete_1967; 30th September 2009 at 12:37 AM.
Reply With Quote
  #7  
Old 30th September 2009, 12:38 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Quote:
Originally Posted by pete_1967 View Post
It's always easier to decide what you allow, and drop everything else, than other way around.
True, true.... I may adopt that into the algorithm. You know, another pain in the *ss is the fact that you can even have 4.2 extensions, like example.mobi.eu
Reply With Quote
  #8  
Old 30th September 2009, 12:44 AM
pete_1967 Offline
Clueless in a Cuckooland
 
Join Date: Mar 2006
Location: Here now, elsewhere tomorrow.
Posts: 4,368
linuxfedorafirefox
Which means you need to change 2nd regex above to check for those, and of course you can't have .biz.biz or us.us for example.

You need to make the decision when too much is too much and accept that some people won't use their real email address, but what do you have to lose if they don't? Me registering with 1234567890@yahoo.com and forgetting the account right after isn't any different from me using throw-away temporary address kindly provided by my ISP in unlimited numbers for me.
__________________
A Drink is Not Just For Christmas - SaskyCom :thumb:


“Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime” so now go and...
RTFM FIRST: http://docs.fedoraproject.org/ & http://rute.2038bug.com/index.html.gz
Reply With Quote
  #9  
Old 30th September 2009, 01:00 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Yes, but my ( http://eInformationOrganizer.com ) website does not need to be sending hundreds of Schedule notification reminders to users, if half of them aren't going to real email addresses. There are many cases when input like this harms other users of the site, by slowing it down, causing it to waste too much time and effort providing free services to non-existent email addresses, etc, especially when you pay your hosting provider for a limited number of outgoing emails per day.

Last edited by Vector; 30th September 2009 at 01:13 AM.
Reply With Quote
  #10  
Old 30th September 2009, 02:06 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Ehhh, i think i got a more reasonable solution: Do it backwards .

I know, this is not the way to do it "by the book", but hey, i'm the guy that always gets the book thrown at him. Anyway, in the db table you have just the FQDN to blacklist, and no subdomains. Instead of stripping and matching the input against the blacklist table, it is easier to match the blacklist table against the input. Since YOU (manually) stripped each entry before putting it into the blacklist table, leaving only the FQDN and nothing more, then you can see if each FQDN from the blacklist table matches the user input. This is easier that doing it the other way, (matching user input against db table), and requres no regular expressions, etc; but it is not really 'proper'. This is as simple as a foreach loop in your stored procedure, and shouldn't be too intensive, as there probably will never be more than 300 entries in the table and it only runs when a user registers or changes their email address .

for example (pseudocode):
Code:
foreach (record){
    if (strpos(input,record) !== false){
        return something;
    }
}
You see, this way, if you have example.com in the db, then you match input like asdfasfasssssf@assfffffassfas.example.com, etc.

Of course this code would be in your sp, not php.

Last edited by Vector; 30th September 2009 at 02:13 AM.
Reply With Quote
  #11  
Old 30th September 2009, 04:39 AM
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Nanjing, China
Posts: 1,332
linuxfedorafirefox
You don't have to go over the whole blacklist. Doing an sequential search in a big list is not a good idea once the list grows very long. If you organize your blacklist into an efficient structure, then search can be made fast even for a large database.

Don't go over the database unless strictly necessary! Construct *queries* from user input and *search* into the database. You don't care about anything other than whether a record *is there* and this kind of search can be made very efficient. (Interestingly, many search problems are equivalent to regex matching problems which you asked in your the first post, so we aren't really adding anything new here.)

There are probably many ways to implement a good searchable database, depending on the properties of your data. Simple hash tables could be fine if the size is not too large, while prefix trees could be useful for larger databases to avoid key collision.

BTW I don't think you're using the word "FQDN" to mean what it really means... From what I read about on Wikipedia, what you wanted to do was to NOT block by FQDN alone..
__________________
Code:
from rlyeh import cthulhu
cthulhu.fhtagn()
Reply With Quote
  #12  
Old 30th September 2009, 04:43 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
I'm pretty good with dba (mine has about 100 tables and 1000 stored procedures, because it is comprised of, and manages several Information Management Systems that i've developed), so i've got the query and table structure parts covered. As far as FQDN, that is everything NOT including subdomain. For my.example.com, the example.com is the FQDN and the my. is the subdomain. The table for the blacklisted FQDNs will consist only of the FQDN itself, and should be no larger than 300 records. This is simple and fast, only backward from the normal method.
Reply With Quote
  #13  
Old 30th September 2009, 05:16 AM
aleph Offline
Banned (for/from) behaving just like everybody else!
 
Join Date: Jul 2007
Location: Nanjing, China
Posts: 1,332
linuxfedorafirefox
Well, yes, "example.com" is an FQDN but so is "my.example.com", if I understand the idea correctly.
__________________
Code:
from rlyeh import cthulhu
cthulhu.fhtagn()
Reply With Quote
  #14  
Old 30th September 2009, 05:25 AM
Vector Offline
Banned
 
Join Date: Jul 2006
Location: Transgression
Age: 34
Posts: 1,183
linuxfedorafirefox
Well, the "my." would be a Cname (Canonical Name) or "alias" to the FQDN. You *could* say that "my." is the FQDN for that particular host, but i would disagree, even if wikipedia says so. In almost *any* case where FQDN is mentioned, they are talking about the root/parent domain name ("example"), with it's TLD (".com"); everything else is a child/sub/host. Anyway, in this case, it is only the FQDN that i'm interested in. If you play with DNS records, you'll get a better idea of what i mean.

Thanx
Reply With Quote
  #15  
Old 1st October 2009, 09:08 AM
barue Offline
Registered User
 
Join Date: Sep 2009
Location: Florida
Posts: 24
linuxfedorafirefox
IMO validation is nice and definitely needed to an extent, but...

Certain verification systems require different levels of validation conceptually, not just semantically.

In this case, if you do Mx record lookup you will get the same hostnames/IPs of the mailservers regardless of whether its a FQDN or subdomain.

For example:
mxlookup: john@this.is.sub.domain.com
mxlookup: john@domain.com

Will only return Mx records if they are valid and both will return the same Mx records if "this.is.sub" is a valid subdomain of "domain.com".

Also, AFAIK, this is the only way to identify email addresses which may be using DNS or IP redirection to mask their true domain (spoofing).
Reply With Quote
Reply

Tags
analysis, problem, regex, umm

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


Current GMT-time: 05:21 (Wednesday, 26-11-2014)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat
Berkhamsted - Loa Janan Photos - Abancay