Example usage:
I want to see if a giant block of base64 code contains any references to the string 'base64'.
The naive approach is to convert the string to it's base64 equivalent, YmFzZTY0.
There are two problems with this approach. The first is that the string will be different depending on the position of the first character 'Y' in the input string. Possible offents are 0 bits, 2 bits or 4 bits. The above example only calculates the 0 bit offset. There should be 3 separate base64 strings to look for.
The second problem is that base64 strings use a 6 bit encoding, so the characters don't align the same as 8 bit encoding. This leads to character bleeding at the beginning and ends of a string where the string will change depending on its immediate context. This script calculates the maximum constant string length that should be present. Unfortunately it requires trimming characters which can often lead to very short strings.
Found a bug in my base64 converter
My base64 conversion script is supposed to find the maximum length string that is guaranteed to be present if the input plain text string is somewhere in the original plain text code, however there was an off by 1 error which made some patterns 1 character longer than they should have been. Short patterns (ie 4 chars) were prone to false positives because they really were 3 character patterns which is too short to be useful. Long patterns were likely missing results.
Should be fixed now.
shortened the base64 fingerprints of 'base64_decode' to just 'base64'. will also catch cases of base64_encode which isn't quite so bad but still worth finding.
interesting note from the php.net manual on create_function:
Caution
This function internally performs an eval() and as such has the same security issues as eval(). Additionally it has bad performance and memory usage characteristics.
This file just contains a list of internal php 7 functions (probably incomplete depending on extensions etc) and their 3 base64 fingerprints. It is designed to be used as either a pattern file to explore potential patterns that may be effective, or simply as a reference to translate between plain text php and the 3 different base64 versions.
This is a file of base64 patterns that represent strings that would be present if any of the functions in php7 were encoded to base64. I'll probably add structure later by grouping them with their plain text translation.
This file is useful to swap out with patterns_raw.txt to gain additional insights into other strings to search for in base64.
There's enough raw patterns in here to justify organizing the file.
Now that whitespace and comments are supported, I've been dividing it into sections
More critical problems should be near the top as I would rather the script identify a file as a backdoor instead of as a spammer.
I don't know the history behind a lot of these or the implication of the code, so I'm sure I mis-categorized many. There are also many that I have not done yet.
The pattern files are large and complex enough to justify some whitespace and comments to explain what each entry is.
Added logic to check if the line is empty or if the first character is equal to '#' before using it as a pattern. Simply skips over empty and commented lines.
cat php-malware-scanner-master/whitelist.txt | sort -k 2,2 -k 1,1 | less
More of an OCD thing than anything, but might as well sort primarily by file path, secondarily by hash value.
cat whitelist.txt | sort -k 2 | less
No reason this shouldn't be sorted perfectly to keep like files together.
No white list rules changed... just plain sorting.
preg_replace should be shortened to just replace as it will also match str_replace, str_ireplace, ereg_replace, eregi_replace and many others I'm sure. Should increase number of hits.
'preg_replace' base64 strings: (removed)
cHJlZ19yZXBsYWNl
ByZWdfcmVwbGFjZ
wcmVnX3JlcGxhY2
'replace' base64 strings: (added)
cmVwbGFjZ
JlcGxhY2
yZXBsYWNl
JHZpc2l0Y291bnQgPSAkSFRUUF9DT09LSUVf is correct. encoded version of "$visitcount = $HTTP_COOKIE_"
I seem to have added a couple of extra characters than what I should have. Not sure where they came from.
Because base64 converts from an 8 bit to a 6 bit character system, you can get 3 unique base64 strings from a single ascii string depending on the position of the first character.
for example:
base64_encode("system");
base64_encode(" system");
base64_encode("( system");
The above 3 input strings all produce very different base64 signatures even though they all contain the same keyword 'system'. This is because the first letter of system, 's' fall on indices 0,1,2 respectively.
I updated several of the base64 samples to include their offset counterparts as the originals would only catch about 1 in 3 of the actual present matches.
Added Extra Patterns for scanning from samples i found on my server.
Added extra-check it checks for googlebot and htaccess useful for cleaning up left over files.