A friend with a WordPress website asked for help cleaning up from spam about ~400 thousands comments (about ~200M), he had guests enabled to comment.
I chosen to use python with curses (ncurses) and this pybayesantispam python module.
The curses GUI is a list with an abstract from the comment content and a column with the current spam rating, a cursor to select a comment and some key to tag it as (s)pam or (h)am, the bayesian module require some manual training, (n)ext (p)revious to browse the comments table.
The code is ugly, had a very limited time, but I been impressed how fast got a good result.
The current code include GeoIP and show the country code but isn’t used to filter the spam, in this case was an overhead, but would have been easy to add more weight factors: create a whitelist of emails from manually tagged ham, blacklist from manually tagged spam, guests users have id 0, blacklist/whitelist of ip from manually tagged comments.
1) import the whole CSV as a table, look for load data infile
(or just use some tool like phpMyAdmin), here will be vit_orig
.
2) create a table for each column to normalize, I usually use something like an id uint(11) auto-increment primary-key, descr varchar(255)
3) populate the created tables using the import, for each table, vit_tipo
in the example, use something like:
INSERT INTO vit_tipo( descr )
SELECT tipo
FROM vit_orig
GROUP BY tipo
4), create the new table (vit_rel_mmmist
in the example) with the needed <table>_id
relations, then populate it with ids regenerated finding text from the original vit_orig
and the normalized created at point 2:
INSERT INTO vit_rel_mmmist (marca_id, modello_id, misura_id, indice_id, stagione_id, tipo_id)
SELECT ma.id , mo.id , mi.id , ind.id, st.id, ti.id
FROM vit_orig2
LEFT JOIN vit_marca ma ON marca=ma.descr
LEFT JOIN vit_modello mo ON modello=mo.descr
LEFT JOIN vit_misura mi ON misura=mi.descr
LEFT JOIN vit_indice ind ON indice=ind.descr
LEFT JOIN vit_stagione st ON stagione=st.descr
LEFT JOIN vit_tipo ti ON tipo=ti.descr
In the example marca, modello, misura, indice, stagione, tipo
are the column names of vit_orig
.
Setting timezone for apache/php/mysql (logs timestamp also)
# apache2 on debian (it get system tz)
dpkg-reconfigure tzdata
# this doesn't work (maybe useful for .htaccess/vhosts?)
echo "SetEnv TZ Europe/Rome" > /etc/apache2/conf-available/tz-rome.conf
service apache2 reload
# php
echo 'date.timezone = "Europe/Rome"' > /etc/php5/apache2/conf.d/30-tz-rome.ini
# mysql (untested)
default-time-zone = "Europe/Rome"
~root:
apt-get install lighttpd mysql-server php5-common php5-cgi php5 php5-mysql
lighty-enable-mod fastcgi-php
service lighttpd force-reload
chown www-data:www-data /var/www
chmod 775 /var/www
usermod -a -G www-data pi
~pi:
newgrp www-data # needed only once after usermod
echo '<?php phpinfo(); ' > /var/www/index.php