There's enough raw patterns in here to justify organizing the file.
Now that whitespace and comments are supported, I've been dividing it into sections
More critical problems should be near the top as I would rather the script identify a file as a backdoor instead of as a spammer.
I don't know the history behind a lot of these or the implication of the code, so I'm sure I mis-categorized many. There are also many that I have not done yet.
The pattern files are large and complex enough to justify some whitespace and comments to explain what each entry is.
Added logic to check if the line is empty or if the first character is equal to '#' before using it as a pattern. Simply skips over empty and commented lines.
cat php-malware-scanner-master/whitelist.txt | sort -k 2,2 -k 1,1 | less
More of an OCD thing than anything, but might as well sort primarily by file path, secondarily by hash value.
cat whitelist.txt | sort -k 2 | less
No reason this shouldn't be sorted perfectly to keep like files together.
No white list rules changed... just plain sorting.
preg_replace should be shortened to just replace as it will also match str_replace, str_ireplace, ereg_replace, eregi_replace and many others I'm sure. Should increase number of hits.
'preg_replace' base64 strings: (removed)
cHJlZ19yZXBsYWNl
ByZWdfcmVwbGFjZ
wcmVnX3JlcGxhY2
'replace' base64 strings: (added)
cmVwbGFjZ
JlcGxhY2
yZXBsYWNl
JHZpc2l0Y291bnQgPSAkSFRUUF9DT09LSUVf is correct. encoded version of "$visitcount = $HTTP_COOKIE_"
I seem to have added a couple of extra characters than what I should have. Not sure where they came from.
Because base64 converts from an 8 bit to a 6 bit character system, you can get 3 unique base64 strings from a single ascii string depending on the position of the first character.
for example:
base64_encode("system");
base64_encode(" system");
base64_encode("( system");
The above 3 input strings all produce very different base64 signatures even though they all contain the same keyword 'system'. This is because the first letter of system, 's' fall on indices 0,1,2 respectively.
I updated several of the base64 samples to include their offset counterparts as the originals would only catch about 1 in 3 of the actual present matches.
Added Extra Patterns for scanning from samples i found on my server.
Added extra-check it checks for googlebot and htaccess useful for cleaning up left over files.