xenogenesi::blog
memento
3d adt android apache2 app apt aria2 build bullet cflags chromium codeigniter debian demoscene dependencies dpkg driver emulator freeglut gcc gfx git glut htaccess javascript json kernel linux make metalink minimal mysql opengl php python raspbian realtime rpi specs template toolchain update-alternatives video wifi wordpress

delete thousands of spam comments from WordPress

A friend with a WordPress website asked for help cleaning up from spam about ~400 thousands comments (about ~200M), he had guests enabled to comment.

I chosen to use python with curses (ncurses) and this pybayesantispam python module.

The curses GUI is a list with an abstract from the comment content and a column with the current spam rating, a cursor to select a comment and some key to tag it as (s)pam or (h)am, the bayesian module require some manual training, (n)ext (p)revious to browse the comments table.

The code is ugly, had a very limited time, but I been impressed how fast got a good result.

The current code include GeoIP and show the country code but isn’t used to filter the spam, in this case was an overhead, but would have been easy to add more weight factors: create a whitelist of emails from manually tagged ham, blacklist from manually tagged spam, guests users have id 0, blacklist/whitelist of ip from manually tagged comments.

py-curses-wp-spam.tgz

Import and normalize CSV file in MySQL

1) import the whole CSV as a table, look for load data infile (or just use some tool like phpMyAdmin), here will be vit_orig.

2) create a table for each column to normalize, I usually use something like an id uint(11) auto-increment primary-key, descr varchar(255)

3) populate the created tables using the import, for each table, vit_tipo in the example, use something like:

INSERT INTO vit_tipo( descr ) 
SELECT tipo
FROM vit_orig
GROUP BY tipo

4), create the new table (vit_rel_mmmist in the example) with the needed <table>_id relations, then populate it with ids regenerated finding text from the original vit_orig and the normalized created at point 2:

INSERT INTO vit_rel_mmmist (marca_id, modello_id, misura_id, indice_id, stagione_id, tipo_id)
SELECT ma.id , mo.id , mi.id , ind.id, st.id, ti.id
FROM vit_orig2
LEFT JOIN vit_marca ma ON marca=ma.descr
LEFT JOIN vit_modello mo ON modello=mo.descr
LEFT JOIN vit_misura mi ON misura=mi.descr
LEFT JOIN vit_indice ind ON indice=ind.descr
LEFT JOIN vit_stagione st ON stagione=st.descr
LEFT JOIN vit_tipo ti ON tipo=ti.descr

In the example marca, modello, misura, indice, stagione, tipo are the column names of vit_orig.

apache2/php/mysql (logs) timezone

Setting timezone for apache/php/mysql (logs timestamp also)

# apache2 on debian (it get system tz)
dpkg-reconfigure tzdata

# this doesn't work (maybe useful for .htaccess/vhosts?)
echo "SetEnv TZ Europe/Rome" > /etc/apache2/conf-available/tz-rome.conf
service apache2 reload

# php
echo 'date.timezone = "Europe/Rome"' > /etc/php5/apache2/conf.d/30-tz-rome.ini

# mysql (untested)
default-time-zone = "Europe/Rome"

lighttpd+mysql+php on Raspbian

~root:

apt-get install lighttpd mysql-server php5-common php5-cgi php5 php5-mysql
lighty-enable-mod fastcgi-php
service lighttpd force-reload
chown www-data:www-data /var/www
chmod 775 /var/www
usermod -a -G www-data pi

~pi:

newgrp www-data # needed only once after usermod
echo '<?php phpinfo(); ' > /var/www/index.php