19

Possible Duplicates:
What is regularity?
In laymans terms: what is the definition of a regular and an irregular language?

I've read that you can't parse HTML with regular expressions because HTML is not a regular language. I tried searching Wikipedia, but I didn't understand a word of what the various related articles said. Can someone explain, in simpler terms, what's a regular (or non-regular) language, and why non-regular languages can't be parsed with regexes?

9

Regular expression (at least in their standard form) aren't able to handle recursion: there is no way to write a regex that checks the balancing of an ARBITRARY level of nested tags. The problem is that HTML has a recursive form, so a parser for it should (in theory!) be prepared to an arbitrary level of nesting. In contrast, regex doen't have this ability: with regex you can neither check if the parentheses in an arithmetic expression match!

4

When discussing computability there are three important levels of abstraction of machines:

Finite state automaton

A finite state automaton is a simple state machine. It can be proved that deterministic and non-deterministic state machines are equally powerful, so the details of those can be ignored for now. The languages that a finite state automaton can parse are called regular languages. A regular language typically lack the nesting of tags with a start and end marker that HTML has.

Push-down automaton

If a data storage in the form of a stack is added to a finite state automaton you get a push-down automaton which can parse HTML as it can "remember" the opening tag when finding the closing one. The languages which are parseable by a PDA are called context free languages. Common programming languages such as C, C++ etc belongs in this class.

Turing Machines

Turing machines are the next step; they are an abstract model for a general computer that be used to compute "anything".

3

You might want to take a look at

  1. Regular Language in laymans terms - SO
  2. What is Regularity -SO
2

Actually one of the definitions of a regular language is that the language could be parsed by regular expressions. :-)

So, joking aside let me try to explain it in simpler terms:

Basically there are three main classes in the hierarchy of languages. The simplest is the regular language, which could be parsed with regular expressions. The second is context free grammar, that includes things like mathematical expressions like ( a + b ) * c (most of the computer languages would fall into this category), and the last is the class of languages that could be processed by Turing machines or through recursion. Turing machine is a simplest form of theoretical computer that can perform the same computation that any modern computer can, albeit not as efficiently. The last class of language include all the languages that could be parsed by machine.

So the fundamental difference between these things is the way that parsing machines are constructed.

The regular expression parse has no memory to store intermediate results. The parsers for context free grammars also contain a stack where they could store a result of computation, and the Turing machine contains the stack and a tape that it could write intermediate results and move it left or right.

So going back to the original questions about HTML parsing. The main reason that HTML could not be parsed with regular expressions is that a regular expression could return only a finite number of results. Take a look at the typical HTML code:

   <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Imte 3</li>
   </ul>

You can try to parse it with something like /(<ul>)(<li>.*</li>)+(</ul)>, but in the result of the 2nd matching group is still HTML code that you have to parse again. Also between these <li> tags there could be an infinite amount of HTML code of different structure. That something has to interpret and re-parse again.

However, you can write an HTML parser that will use regular expressions to separate the tags, attributes and text nodes. But then you would have to augment your parser with some code that will interpret results of this breakdown and call your regular expression again to parse out the inner parts of these tags.

Also, you life will be complicated by the fact that unlike XML, there is no requirement for HTML is to be well formed. If you forget to close the tag, most of the browsers would make an educated guess and will try to display the page anyway. To keep things even more complicated the original HTML did not require to close some tags like <p> <br> <hr> etc. at all.

Here are some examples of HTML that would be really hard to parse, but it is pretty common on the interwebs:

<div>
   Blah blah <p>
   Blah blah <p>
   <i>Blah
   </p>
</div>

Question which of the two <p> tags is closed by the </p> tag?

Hope this helps, and good luck.

2

A regular language is the most restrictive kind of language in the Chomsky Hierarchy that defines 4 classes of languages with different characteristics that need 4 differend kind of automaton to be recognised.

Regular expressions have the power of a finite state machine, that's not powerful as a Turing Machine o a Pushdown automaton.

Regular languages are defined by rules regarding their productions, you should know what a grammar is, and you have rules of the form A -> B | C | D where A, B, C, D can be terminal symbols or non-terminal (able to produce something with their productions).

For example:

S -> A | B | C
A -> a A | a
B -> AB | B
C -> b | c

this is a grammar where capital letters are non terminals and small letters are symbols.

Regular Languages, as said before, have some restrictions:

  • first of all, the left part of a production MUST BE a single terminal: A -> b is allowed, BB -> c or Aa -> c | D aren't. Because left part is not free anymore, the symbol is now related to the context in which you can find it.
  • every rules must have in their right part a terminal with a non-terminal on the left or on the right of the terminal one. But not on both ways. For exampe you can't have A -> cD and D -> Bb in the same grammar.
  • you can have epsilon-rules (rules that produce empty string) but they can't be placed at the end of a production. For example: D -> epsilon is ok but you can't have a rule like A -> cD.

Mainly these restrictions apply to the kind of language that regular languages are able to formulate. For example a regular language is not able to recognise a string that is similar to this one: aaabbbaaa or aabbaa where count of left as is the same of right as because rules don't permit you to formalize this thing (and that one of the reasons because you can't parse HTML with regular expressions).

This because there is a pairing between a regular expression and a Deterministic finite automaton. Usually this is what happens when you want to match a regular expression: a specific, optimized, automaton is built but a DFA can recognise just regular languages.

2

What a lot of contentious answers for a question that has a definitive answer.

I'll throw my hat in the ring - hopefully this is helpful. A regular language is anything that can be matched via a regular expression, which in turn can be constructed directly from a state machine (or a finite automaton).

Let's start with some definitions:

  • An alphabet is a set of characters.

  • A word over an alphabet is a string of characters from that alphabet (repetitions are ok).

  • A language over an alphabet is a set of words over that alphabet.

Now, some examples:

  • Any word in the alphabet can be part of a regular language.

  • Any word repeated 0 or more times can be part of a regular language. This is called a Kleene-closure and is written like (abc). (abc) includes the empty string, "abc", "abcabc", "abcabcabc", ad infinitum.

  • Any word repeated a specific number of times can be part of a regular language. This may be written like "abc"{3} and represents "abcabcabc".

  • The concatenation of any two words in a regular language can be part of a regular language.

  • The disjunction (or) of any two words in a regular language can be part of a regular language. This may be written like (abc)|(def) and represents either "abc" or "def".

The best way I can think of to describe why these are all regular is this: if you had a program that examined some word to determine whether it was part of your language, it could look at that word one character at a time without remembering the previous characters it had seen, and when it got to the end it would know for sure whether that word was part of the language or not. The way it does this is via a state machine (mentioned earlier), which consists of a set of states and transition rules between those states. A transition rule tells you what to do if you are "in" a given state and you see a particular character next.

For example, the language containing only the word "abc" would have states 1, 2, 3, and 4. We call state 1 the "start state" because we always start there, and state 4 is the "accept state" - if at the end of the word we are at the accept state, then we know the word is in our language. We also have a null state - once you get to this state, you can never get out no matter what.

Since this language only accepts "abc", we have the following transitions:

  • 1: a -> 2
  • 2: b -> 3
  • 3: c -> 4

For each state, we also have an implicit transition: if you see any other character besides the one mentioned in the rule above, you go straight to the null state (because we know your word does not match).

So, this is all you have to work with with a regular language - at any given time, while processing the word, you only know which state you are in and what the next character is.

It should also help to give an example of what is NOT regular, and the quintessential example here is the language (a^n)(b^n) - this means "some number of a's followed by the same number of b's". Why isn't this regular? Suppose you wanted to make a machine as described above that would accept all words of the language. This would entail scanning some number of a characters and then making sure that the rest of the word contained the same number of b characters. This, however, is impossible, because at any given point we only know what state we are in and what character comes next. In particular, we cannot "remember" how many a characters we have seen. To be able to do so requires that we have an infinite amount of states - one new state to record each possible number of a characters that we have seen. Since we cannot have infinitely many states in our machine, we cannot construct the machine, and such this language would not be regular.

Sorry the post was so long, but I think if I had no knowledge of the topic, I would want everything broken down to this level for me. Feel free to ask for clarifications if anything doesn't make sense.

1 accepted

In terms of regular expressions, a regular language is one that can be fully described using sets of strings. One of the examples in Wikipedia:

{"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}

HTML is one example of a language which is a higher order of complexity than that. Each <foo> is a single entity, a node in the language. Nodes in turn have children and grandchildren, etc. For a regular language/expression, an entity/node is a single character. In a regular language nodes do not have a hierarchy, there is simply a sequence of them. A character does not have parents or children, just adjacent characters.

Edit: Another way of explaining it: we store HTML in a string. But we eventually store that string as 0s and 1s so it is no more helpful to think of HTML as a string than it is to think of it as 0s and 1s - those are just the underlying implementation mechanism. The language itself is described purely as a hierarchy of nodes. Using regular expressions to parse HTML is working at the wrong level - there is a higher order of meaning embedded in the sets of strings which a string-set-reader can't describe.

0

The regular languages are the ones that can be recognized by the set of finite automatons. Others say the languages recognized by regular expressions. These two are equivalent as one is representable by the other and vice-versa.

To parse html you'd have to know the current 'depth' in the html tree. This is something that finite automatons can't do as they have no 'memory' to remember things.

0

In simple terms, a regular language can be entirely represented by a (complex) regular expression. This why you can parse it with Regex. A non-regular language can't.

-4

Putting it in simple terms, a regular language is a "simple" class of languages (basically anything you can recognize with a regex). A CFG has more complex rules, such that you can define much more words in the language. For example, there is no regex rule you can use to find words in the language {ab, aabb, aaabbb, ...} = {a^nb^n}. However you can define a context-free grammar which will recognize words of this sort.

Technically speaking, you can parse HTML with a regex (i.e. you can find a tag "HELLO"), but the point is that you cannot correctly parse the entire document using regex rules (i.e. you cannot find the HELLO tag and its matching closing tag).

For more information read up on Regular Languages and Context-Free Grammars

-5

I can't think of any reason properly written HTML won't be regex-able. However, improperly written HTML (as, lets face it, most HTML is) might give you trouble, since even HTML with a missing tag here or there will most likely get rendered just fine on a browser, it will be non-regex-able. As to what a regular language is, no idea.