You want to grep all the hrefs from a files.
You write a simple grep, then you realize you actually want the non greedy, so you add the perl flag, then you realize the you don't need it, you can just use the negated selection, but now you don't get the last ", but actually you dont want it, not even the prefix, you just need the value:
grep -o 'href=".*"' grep -oP 'href=".*?"' grep -o 'href="[^"]*'
And now the hardships really start, ok let put back the flag and lets try to group the value we want, but doesn't matter grep doesn't care at all, so let go all out: look ahead, look up if its lt equals or equals lt, so, should we just add the look behind so the osd is satisfied:
grep -oP 'href="([^"]*)' grep -oP '(?<=href=")[^"]*' grep -oP '(?<=href=").*?(?=")'
Wait a second, what about the \K, they said is the solution for everything, (it basically drops the so far matched pattern), so its seems we are back to very good level of verbosity, lets make some final touches and just remove any trailing slashes at the end, and we are back to the look ahead:
grep -oP 'href="\K[^"]*' grep -oP 'href="\K.*?(?=/?")'
This process can't be the best we can have. There should be way to convey what we want without these constrains.
Let's try by increasing the number of available operations and also reduce the possible compositions that are possible. And by making the operations just named things like functions (and with auto complete) we avoid of the negatives of having more operations. Less composition means reduced efficiency, but we can just compile it back down to a state machine (even at compile time). So instead of writing a state machine, we will just write code that looks like code. Prototype:
(i) => {
if (i.drop_start('href="')) { // drop = not in the match
if (i.match_until('"')) {
i.drop_trailing('/') // no if = optional
return i
}
}
return null
}
And as long the the function is pure all the call happening in belong to the implementation, it should be possible to compile to a state machine and then optimize it at that level, and even produce the equivalent regex. And now its possible to also debug the matching because, during development, we can just keep the code instead of the compiled version.
And of course if the above code can produce the state machine, the following code that is semantically equivalent will also be able to used:
(i) => {
if (i.drop_start('href="') && i.match_until('"')) {
i.drop_trailing('/')
return i
}
return null
}
Or going all the way to other end:
Match.drop_start('href="').match_until('"').optionally.drop_trailing('/')
But, keep in mind this is not a regex builder, there should be no limitations as long as you use the function provided, and also a function provided could be match_xml_attribute_value. Now how is going to create this?
done_