Google BigQuery – File Sanitizing | H. Thirukkumaran's website

One of the first steps in loading data into BigQuery is to extract data from your local datasource and then sanitizing it for importing it into BigQuery. Initially I encountered lot of errors during the import. You can also see that in my ETL demo post. These errors happened because BigQuery has some parsing rules for the files to be imported. In this post I will provide some tips on how to clean your file before uploading to Google Cloud Storage and importing it to BigQuery.

ASCII char 0 – NUL

If you are using BCP in SQL Server or Unix or Linux systems then the files you generate have a high chance of containing this character. When you extract the data from the source system like database try removing these NUL characters. BigQuery treats this as End-Of-Line character and hence it will throw an error when loading files with this character. It will display an error that the row containing this character does not have the value for all columns. The way you get rid of NUL is by one of the following ways,

If the RDBMS you are using has this character then use LEN(column_name) = 0 condition to replace the column value with NULL or Empty String.
If you want to manually replace NUL character then you can do it for small files by opening it in Notepad++ like editor which shows hexcode for invisible characters. Do a find and replace of \0 to empty string.
For large files you can replace the character using string functions.
In Linux you can use tr command as shown in the example: tr < inputfile.csv -d ‘\000’ > sanitizedfile.csv

Delimiter in the column value

Sometimes the delimiter we used for separating columns is contained in one of the column values. This causes the error that more than n columns provided for import. In your source system you need to replace the delimiter column with some other character or use a delimiter character that will not be in your column values. One other common scenario is to use , as delimiter but the column values like address will have the , in them. In such cases enclose the column values within double quotes. This will increase your file size.

Improper closing of quotes

If your column values happen to have quotes in them then BigQuery throws an error. BigQuery treats quote as start of column value enclosed in quotes. If the number of quotes is even say you enclose a particular column value or a substring in column value is enclosed within “” or ‘ that is fine. if there is only one ” or ‘ in the column value then it throws error. The solution is to enclose that column value within the other quotes for example if your column value has one ‘ then enclose that within “”. If your column value has one ” enclose it within ”.

Unicode Replacement Character

If your export system does exporting of data only in ASCII encoding then you may run into this problem. The Unicode characters will be replaced in the file with Unicode Replacement Character �. This causes error when BigQuery parses such files. Replace this with space or allow your extraction process to extract Unicode characters properly.

Troubleshooting

These three are the most frequently encountered problems. BigQuery allows you to use most of the characters as delimiter and also provides options to specify the values are quoted values or jagged arrays and other useful options. If for some reason you still get error while importing then set the allow error rows value to 0 during the import and do the import. BigQuery will show the byte offset at which the error occured. This will mostly be the beginning of the row which caused error in the import.

If you need sample files to import large number of records then you can use Wikipedia analytics files from https://dumps.wikimedia.org/other/analytics/

All the best with these tips.

If you need help or suggestions for your products you can get it touch with me here or you can post your question to my social accounts below.

Google BigQuery – File Sanitizing

ASCII char 0 – NUL

Delimiter in the column value

Improper closing of quotes

Unicode Replacement Character

Troubleshooting

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta