RO CSVI

Debugging Google Base Export

Many people have reported issues with the Google Base upload after exporting products using CSV Improved. In most cases it is not CSV Improved exporting a bad file but the file containing characters Google Base does not like. These characters are often not even easily visible but they can be discovered using the right tools.

Procedure

This guide is based on a Linux operating system, same or similar tools might be available on Windows but no guarantees.

Manual mode: Locating the offending character

The first step is to check if your XML file is well-formed. This means it only checks if there are no open tags as a bonus we also get a message if it finds offending characters. The same characters Google doesn't like.

To do this, log into your system and go to where your XML file is. Once there, execute the following command:

xmlwf myfile.xml

In case the file is perfectly fine, you will receive no message. This means your file is ready to upload to Google Base. If there is a problem with the file you will receive a message that includes both the line number and the character position of the offending character. This is important information because now it is clear where the problem is. The message looks like this:

myfile.xml:7:291: not well-formed (invalid token)

In this example the offending character is on line 7 and position 291.

Semi-automatic: Locating the offending character

For this task I have written an analyze script in bash and PHP. What the code does is look for any errors the xmlwf generates and analyze the XML file for the error. The offending character is then put into a logfile. Here is the bash script:

#!/bin/sh
# Set some variables
FOLDER='/home/mysite/jobs/'
FILE='myfile.xml'
LOGFILE='filecheck.log'
ANALYZEFILE='analyze.log'
ANALYZESCRIPT='analyze.php'
 
# Lets go to the appropiate folder
cd $FOLDER
 
# Check if a new file has been created
if [ -e $FILE ]; then {
# Check if the file is well formed
`/usr/bin/xmlwf $FOLDER/$FILE | grep -q 'not well-formed'`
 
if [ $? -eq 0 ] ; then {
echo "XML not valid" >> $FOLDER/$LOGFILE
message=`/usr/bin/xmlwf $FOLDER/$FILE`
echo $message >> $FOLDER/$LOGFILE
 
# Analyze the error
TEXT=${message%%:*}; rest=${message#*:}
# Get the line number
LINEPOS=${rest%%:*}; rest=${rest#*:}
# Get the position where the error is
CHARPOS=${rest%%:*};
 
# Put the offending line in a file to be analyzed
`sed $LINEPOS'q;d' $FOLDER/$FILE > $ANALYZEFILE`
 
# Analyze the file
`/usr/bin/php -f $FOLDER/$ANALYZESCRIPT analyzefile="$ANALYZEFILE" charpos="$CHARPOS" >> $FOLDER/$LOGFILE`
}
fi
}
fi
 
# Add a linebreak
echo "" >> $FOLDER/$LOGFILE

Here is the PHP script:

/* Get the variables */
list(,$analyzefile) = explode("=", $argv[1]);
list(,$charpos) = explode("=", $argv[2]);
list(,$all) = explode("=", $argv[3]);
 
/* Read the file to be analyzed */
$content = file_get_contents($analyzefile);
 
if (!$all) {
/* Output the offending character */
echo "Position: ".$charpos." ";
echo "Character: ".substr($content, $charpos, 1);
echo " ";
echo "Ordinal: ".ord(substr($content, $charpos, 1));
echo "\n";
}
else {
$counter = 1;
while (strlen($content) > 0) {
echo "Position: ".$counter." ";
$counter++;
echo "Character: ".substr($content, 0, 1);
echo " ";
echo "Ordinal: ".ord(substr($content, 0, 1));
echo "\n";
$content = substr($content, 1, strlen($content));
}
}
?>

Recommended is to save the bash script to a file called checkfile.sh and the PHP script to a file named analyze.php.

Using the scripts

Save both files in the same location and copy the offending XML file to the location to. The bash script has a few settings:

# Set some variables
FOLDER='/home/mysite/jobs/'
FILE='myfile.xml'
LOGFILE='filecheck.log'
ANALYZEFILE='analyze.log'
ANALYZESCRIPT='analyze.php'

Adjust the variables as needed. The logfiles will be created if they do not exist.

FOLDER = location where the scripts and XML file are
FILE = the name of the XML file to check
LOGFILE = file where the result is saved
ANALYZEFILE = temporary file used to store the offending line
ANALYZESCRIPT = PHP script that analyzes the ANALYZEFILE

To run the script execute the following command:

sh +x checkfile.sh

The result can be found in the file called filecheck.log. This file can be viewed with the following command:

vi filecheck.log

The output can look like this:

XML not valid
/home/mysite/jobs/myfile.xml:7:291: not well-formed (invalid token)
Position: 291 Character:  Ordinal: 160

Now it is easy to find this character in the offending XML file by opening the XML file like this:

vi myfile.xml

This opens the VI editor and show the myfile.xml file. Jump to the problem line 7 with this command:

:7

Press enter after that. Now it is time to look for the strange character that was shown in the filecheck.log.

Alternatively the analyze.log file can be opened which contains only the single line with the offending character, like this:

vi analyze.log

Now it might be possible to remove the problem character here from the XML file, it is better to remove it from the source (database/VirtueMart/spreadsheet), otherwise a newly created XML file will still contain this problem. The advantage now is, you know where to look for the file.