FOP - Filter Orderer and Preener

General information, announcements and questions about the EasyList subscriptions.
Michael
Contributor
Contributor
Posts: 4124
Joined: Sun Aug 23, 2009 8:08 pm

Post by Michael »

FOP, unlike other similar programs, validates the commit comment to ensure that it matches a certain style using the below regular expression:

Code: Select all

^(A|M|P)\:\s(\((.+)\)\s)?(.*)$
The caret (^) indicates that the regular expression starts matching at the beginning of the comment. It then checks for one of three letters to indicate the type of change (using the regular expression: (A|M|P)) and then requires a colon, which has been escaped with a backslash to prevent it from being interpreted (\:) followed by exactly one space (\s).

There is then the option of writing a comment ((\((.+)\)\s)?). This comment is optional, as indicated by the question mark at the end (there must be either one or zero instances of the contents of the brackets), and must be surrounded by round brackets, which have again been escaped. Inside the round brackets, there must be at least one character of any type (.+), but the regular expression will match as much as it can (it is "greedy"). The literal brackets surrounding this section allow this part of the comment to be extracted later for validation. After the bracketed comment, if it is present, there must be exactly one space.

The final section of the regular expression ((.*)$) captures all remaining characters (which may be of any type, as indicated by the full stop) until the end of the commit comment ($). This section, which is in literal brackets to enable it to be validated later, should contain the address of a page relevant to the alteration in the commit.

Note that only an indicator character is required if the comment indicates that a modification (M:) being made and that if a problem or addition is indicated (A: or P:) it is required that there were changes to the subscription files before they were sorted by FOP.

This strict validation (it is intended to be strict for consistency) means that it is no longer possible to use the tag "[NSFW]" or "[EasyTest]" in a commit comment, instead requiring the tag to be surrounded with round brackets, although the regular expression could be adjusted to permit this if desired. MonztA, what do you think?
MonztA
EasyList Author
EasyList Author
Posts: 8121
Joined: Thu Jul 26, 2007 4:19 pm
Location: Germany

Post by MonztA »

Khrin is the only one who uses [ and ]. Khrin, do you mind using round brackets from now on?
Khrin
EasyList Author
EasyList Author
Posts: 3562
Joined: Fri Mar 26, 2010 8:50 pm

Post by Khrin »

Yes, is not a problem. My major concerns were about URLs that contain square brackets.
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

I don't know if this is intentional, but by looking at this regex I suspect I'll be able to commit with this comment:

Code: Select all

A: 
(e.g., A/M/P + : + ' ')
I guess the .* should also be .+
Michael
Contributor
Contributor
Posts: 4124
Joined: Sun Aug 23, 2009 8:08 pm

Post by Michael »

Famlam wrote:I don't know if this is intentional, but by looking at this regex I suspect I'll be able to commit with this comment:

Code: Select all

A: 
You won't - the url is later validated by the urlparse function, which originates from the urllib.parse module. Generally, a protocol (e.g. http), domain name (e.g. example.com) and path (e.g. /) is required for the comment to be accepted. Full details of the requirements are present in the checkcommit function of FOP.
Famlam wrote:I guess the .* should also be .+
Personally, I don't think that it really matter which it is - the address will be found to be invalid anyway if it is an empty string.
Michael
Contributor
Contributor
Posts: 4124
Joined: Sun Aug 23, 2009 8:08 pm

Post by Michael »

Famlam has just spotted another flaw in FOP: it did not detect changes to the order of filters. This has been resolved in FOP 2.997.
Michael
Contributor
Contributor
Posts: 4124
Joined: Sun Aug 23, 2009 8:08 pm

Post by Michael »

Just to let everyone know, I will release FOP 3 later today, which will likely be my last revision to the program.
Michael
Contributor
Contributor
Posts: 4124
Joined: Sun Aug 23, 2009 8:08 pm

Post by Michael »

"[T]oday" appears to have been somewhat stretched in definition, but I have finally released FOP 3.0 regardless.
hermawanadhis
ABPindo Author
ABPindo Author
Posts: 68
Joined: Sun Jun 05, 2011 5:16 am

Post by hermawanadhis »

Please advice me about how to organizing http://code.google.com/p/indonesianadbl ... ce/browse/
do i still need generate_subscriptions.pl? how can i organizing using FOP in windows system? so we can producing subcriptions, TPL and others.

thanks for your answer.
Ares2
Emeritus Contributor
Emeritus Contributor
Posts: 4572
Joined: Thu Sep 27, 2007 12:49 pm

Post by Ares2 »

hermawanadhis wrote:do i still need generate_subscriptions.pl? [...] so we can producing subcriptions, TPL and others
The perl script has been replaced with combineSubscriptions.py which can be found here: https://hg.adblockplus.org/sitescripts/ ... iptions.py

If you manage to get it to work, it will

1. Join several local and/or remote files with %include statements to a final subscription file
2. Include a checksum in subscription files
3. Create Internet Explorer Tracking Protection Lists .tpl
4. Compress the subscriptions with deflate to .gz (requires the program "p7zip")

hermawanadhis wrote:how can i organizing using FOP in windows system?
I'm not sure about how to use it on Windows (maybe some other author can explain what you need?), but generally, you just run it from the commandline in the same folder as your Mercurial or Git repository. Make sure you take a look at project specific options like COMMITPATTERN (forces a certain style for commit messages) and IGNORE (files that should not be sorted), ....... before using it.
User avatar
Crits
Liste FR Author
Liste FR Author
Posts: 682
Joined: Sun Dec 18, 2011 6:21 pm
Location: France

Post by Crits »

@Famlam

In the latest revision of FOP, blocking/whitelisting rules are apparently no longer lowercased automatically.

Could you explain what has motivated this change? In particular, what is the effect of the case of the filters (not CSS selectors) when they are used by ad-blockers? I'm just curious :D
arflech
Senior Member
Senior Member
Posts: 77
Joined: Thu Feb 24, 2011 5:49 pm

Post by arflech »

I know that for some websites, the path component of the URL matters, and of course the query string matters; compare, for example, the following three URLs...
valid and working: https://www.youtube.com/watch?v=soKy1J6yjc4
"not available": https://www.youtube.com/watch?v=soky1J6yjc4 (first K changed to k, making a valid but nonexistent video ID)
"account has been terminated": https://www.youtube.com/wAtch?v=soKy1J6yjc4 (first a changed to A, causing YouTube to interpret it as the userpage for the banned user "wAtch")

Note, however, that once YouTube has parsed a link as a userpage, the username component is then case-insensitive; https://www.youtube.com/google and https://www.youtube.com/GOogle go to the same place, canonically known as https://www.youtube.com/user/Google (so I guess that "banned user" was some wanker who tried to gum up the works by calling hirself "watch").

In fact, the only parts of a URL that are reliably case-insensitive are the scheme and hostname, although they are canonically lowercase.
(FTR, the parts of a CSS selector that are reliably case-insensitive are the element names, attribute names, pseudo-classes, and pseudo-elements.)
User avatar
Crits
Liste FR Author
Liste FR Author
Posts: 682
Joined: Sun Dec 18, 2011 6:21 pm
Location: France

Post by Crits »

I've got an issue with the FOP script, maybe due to the latest changes:
||ad.adserverplus.com^$image,~image,popup
The location of its "~" is changed by the script, thus the image,~image workaround is not useful anymore:
||ad.adserverplus.com^$~image,image,popup

Do I need to specify an option to the script to avoid this?
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

Hi, this is an issue with Python 3.3.
I'll try to fix it ASAP (but likely next week). In the meantime, you can apply the bandaid in https://hg.adblockplus.org/easylist/dif ... 36b/FOP.py
EDIT: just fixed it right now, since I just remembered that this doesn't have to take long :)
https://hg.adblockplus.org/easylist/rev/f39ee75f7a40
User avatar
Crits
Liste FR Author
Liste FR Author
Posts: 682
Joined: Sun Dec 18, 2011 6:21 pm
Location: France

Post by Crits »

That fixed it, thanks! (both of your commits :D )
User avatar
Crits
Liste FR Author
Liste FR Author
Posts: 682
Joined: Sun Dec 18, 2011 6:21 pm
Location: France

Post by Crits »

FOP removes the wildcard of exception rules like:
@@/flash-ads/*$domain=a.com|b.com

(yet it doesn't remove the wildcard present in /flash-ads/*$domain=a.com|b.com)

However, according to ABP, the exception rule without the wildcard is considered as "slow".

FOP 3.5
Python 3.3.0

Thanks!
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

User avatar
fanboy
EasyList Author
EasyList Author
Posts: 12220
Joined: Wed Sep 05, 2007 8:17 pm

Post by fanboy »

would be nice if FOP would sort #@# filters..
MonztA
EasyList Author
EasyList Author
Posts: 8121
Joined: Thu Jul 26, 2007 4:19 pm
Location: Germany

Post by MonztA »

It does but we decided to let it sort by the actual filter rather then the excepted domain.
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

(so you can see which filters are bad or good in a blink of an eye, as well as consistency with the @@ rules that also do not sort upon the domain but upon the filter)
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

I've updated FOP so that it automatically combines the domains of two rules when they should be combined:
https://hg.adblockplus.org/easylist/rev/24d213c89378

This should make commits like https://hg.adblockplus.org/easylist/rev/445cae1a331b belong to the past :)
MonztA
EasyList Author
EasyList Author
Posts: 8121
Joined: Thu Jul 26, 2007 4:19 pm
Location: Germany

Post by MonztA »

Yay!
hermawanadhis
ABPindo Author
ABPindo Author
Posts: 68
Joined: Sun Jun 05, 2011 5:16 am

Post by hermawanadhis »

Famlam wrote:I've updated FOP so that it automatically combines the domains of two rules when they should be combined:
https://hg.adblockplus.org/easylist/rev/24d213c89378

This should make commits like https://hg.adblockplus.org/easylist/rev/445cae1a331b belong to the past :)
thanks for update v3.8, after updating my element hide filter not sort automatically, example

Code: Select all


wowkeren.com###Ad728
lintas.me###Advertisement
mivo.tv###BFAslidein
icinema3satu.com###FloatAlamindawa
indogamers.com###IklanIDGS
chip.co.id###TopBannerBg
173.199.167.192,forumbokep.com,idfl.us,kampus.us,krucil.com,krucil.net,semprot.com,sodasusu.com###ad_global_below_navbar
combining filter is ok, but others not sort alfabetic
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

It now sorts the items based upon the filter rather than the first character. Since it combines domains now, sorting upon the domain doesn't really help anymore in finding filters (since a filter for domain xyz could be located around the 'x', but it could also be located around the 'a' if the filter is combined to anysite,xyz##filter).
User avatar
Lain_13
RU AdList Author
RU AdList Author
Posts: 1041
Joined: Fri Aug 20, 2010 11:20 am

Post by Lain_13 »

Famlam
I prefer to use grep for this purpose: hg grep string_to_search -r tip
You can even use regular expressions.

BTW, I came here today to share changes which I made to FOP when I started to use it to sort things in the RU Adlist.

1. I've modified sort function to handle cases ~popup,popup (block everything including popups) and also I prefer to have third-party option first in the list.

Code: Select all


def sortfunc (option):
# For identical options, the inverse always follows the non-inverse option ($image,~image instead of $~image,image) with exception for popup filter
if option[0] == "~": return option[1:] + "{"
if option == "popup": return option + "}"
# Also let third-party will always be first in the list
if option.find("third-party") > -1: return "0" + option
return option

def filtertidy (filterin):
""" Sort the options of blocking filters and make the filter text
lower case if applicable."""
optionsplit = re.match(OPTIONPATTERN, filterin)

if not optionsplit:
# Remove unnecessary asterisks from filters without any options and return them
return removeunnecessarywildcards(filterin)
else:
# If applicable, separate and sort the filter options in addition to the filter text
filtertext = removeunnecessarywildcards(optionsplit.group(1))
optionlist = optionsplit.group(2).lower().replace("_", "-").split(",")

domainlist = []
removeentries = []
for option in optionlist:
# Detect and separate domain options
if option[0:7] == "domain=":
domainlist.extend(option[7:].split("|"))
removeentries.append(option)
elif option.strip("~") not in KNOWNOPTIONS:
print("Warning: The option \"{option}\" used on the filter \"{problemfilter}\" is not recognised by FOP".format(option = option, problemfilter = filterin))
# Sort all options other than domain alphabetically with a few exceptions
optionlist = sorted(set(filter(lambda option: option not in removeentries, optionlist)), key = sortfunc)

# If applicable, sort domain restrictions and append them to the list of options
if domainlist:
optionlist.append("domain={domainlist}".format(domainlist = "|".join(sorted(set(domainlist), key = lambda domain: domain.strip("~")))))

# Return the full filter
return "{filtertext}${options}".format(filtertext = filtertext, options = ",".join(optionlist))
Actual changes:

Code: Select all


def sortfunc (option):
# For identical options, the inverse always follows the non-inverse option ($image,~image instead of $~image,image) with exception for popup filter
if option[0] == "~": return option[1:] + "{"
if option == "popup": return option + "}"
# Also let third-party will always be first in the list
if option.find("third-party") > -1: return "0" + option
return option
and

Code: Select all


        # Sort all options other than domain alphabetically with a few exceptions
optionlist = sorted(set(filter(lambda option: option not in removeentries, optionlist)), key = sortfunc)
instead of

Code: Select all


        # Sort all options other than domain alphabetically
# For identical options, the inverse always follows the non-inverse option ($image,~image instead of $~image,image)
optionlist = sorted(set(filter(lambda option: option not in removeentries, optionlist)), key = lambda option: (option[1:] + "~") if option[0] == "~" else option)
2. I've extended list of commands to perform on commit to better handle situations when commits were made from different sources and merge changes automatically. I'm not sure I did it right but it works for me in most of the cases

Code: Select all


REPODEF = collections.namedtuple("repodef", "name, directory, locationoption, repodirectoryoption, checkchanges, difference, pull, checkupdate, update, merge, commit, push")
GIT = REPODEF(["git"], "./.git/", "--work-tree=", "--git-dir=", ["status", "-s", "--untracked-files=no"], ["diff"], ["pull"], ["update", "--check"], ["update"], ["merge"], ["commit", "-m"], ["push"])
HG = REPODEF(["hg"], "./.hg/", "-R", None, ["stat", "-q"], ["diff"], ["pull"], ["update", "--check"], ["update"], ["merge"], ["commit", "-m"], ["push"])
and

Code: Select all


    # Allow users to abort the commit process if they do not approve of the changes
except (KeyboardInterrupt, SystemExit):
print("\nCommit aborted.")
return

print("Comment \"{comment}\" accepted.".format(comment = comment))
try:
print("\nConnecting to server. Please enter your password if required.")
# Update the server repository as required by the revision control system
for command in repository[6:]:
if command == repository.commit:
command += [comment]
command = basecommand + command
subprocess.Popen(command).communicate()
print()
except(subprocess.CalledProcessError):
print("Unexpected error with the command \"{command}\".".format(command = command))
raise subprocess.CalledProcessError("Aborting FOP.")
except(OSError):
print("Unexpected error with the command \"{command}\".".format(command = command))
raise OSError("Aborting FOP.")
print("Completed commit process successfully.")
Actual change in the second part:

Code: Select all


        print("\nConnecting to server. Please enter your password if required.")
# Update the server repository as required by the revision control system
for command in repository[6:]:
if command == repository.commit:
command += [comment]
instead of

Code: Select all


        # Commit the changes
command = basecommand + repository.commit + [comment]
subprocess.Popen(command).communicate()
print("\nConnecting to server. Please enter your password if required.")
# Update the server repository as required by the revision control system
for command in repository[7:]:
Khrin
EasyList Author
EasyList Author
Posts: 3562
Joined: Fri Mar 26, 2010 8:50 pm

Post by Khrin »

One of these filters blocks FOP:

Code: Select all

##.home-ad728
##.pc-ad
##.sidebar-ad-cont
searchenginejournal.com###bg-atag
searchenginejournal.com###bg-takeover-unit
searchenginejournal.com##.footer-unit
||searchenginejournal.com^*/sponsored-
||searchenginejournal.com^*-takeover-
Everything goes correctly, when after sorting the filters what seems to be an error message appear and the window closes up fast, so I'm unable to read what's written. After that it's impossible to do any changes, until when a new copy of the repository is downloaded.
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

The problem is the third letter in 'atđhe.net###between_links', which was added earlier this week and ends up in the diff because it's alphabetically close to the bg-atag stuff. Just wondering why FOP didn't complain earlier (or wasn't FOP used to add it?). Anyway, I'll fix it...
User avatar
fanboy
EasyList Author
EasyList Author
Posts: 12220
Joined: Wed Sep 05, 2007 8:17 pm

Post by fanboy »

I used FOP when I added it.. didn't see any complaints with the script at the time
Famlam
EasyList Author
EasyList Author
Posts: 1782
Joined: Sun May 09, 2010 11:37 am
Location: The Netherlands

Post by Famlam »

Strange. I could reproduce it every time it was in the diff.
Anyway, should be fixed (or actually: bandaided) now. I couldn't find a simple solution to for instance replace issues by '?', so it just returns the unformatted diff. (Don't forget to hg pull before trying it out ;) )
easyfan
Guest

Post by easyfan »

can FOP order https://github.com/easylist/easylist/bl ... c_hide.txt by domain name?
py v3.5.2 did not order
Locked