This is a brief overview of Google BigQuery targeting programmers. The data collected and stored by the enterprises are now outgrowing their current hardware and software infrastructure. To handle Big Data and get real-time reports from Big Data, Google offers a SAAS solution. It is called BigQuery. In simple terms this is just a DBMS that can handle large volumes of data. Note that I intentionally stated it as DBMS and not RDBMS. This is also the database that Google uses to index internet.

Bime Analytics showcases the power of BigQuery by querying more than 1 billion rows.

To begin with I will explain what BigQuery is not and what it cannot do.

  • BigQuery tables are not transactional tables. They are only reporting tables.
  • You cannot update records in BigQuery table.
  • You cannot delete selected records in BigQuery table.
  • You cannot establish relationship for integrity between tables.

The reason is, unlike traditional RDBMS where the data is stored at row level, the BigQuery database system stores the rows in columnar structure. Because it uses columnar structure it stores only distinct values for that particular columns and maps the record index with column indexes. The storage size of data is minimized when stored in columnar format. During query execution it looks only at the specified columns for filtering, grouping etc and hence the performance is high compared to traditional RDBMS.

Because it uses columnar structure it is very expensive to insert records 1 by 1 as in a transactional database. It is also because of the columnar structure the update operation is expensive and hence it is not allowed. Same is the case with delete operation. To solve the update and delete problem you can follow a work around. The work around is to move the data that you need to a temp table from the original table and then drop the original table and then copy the data from temp table to original table.

Some of the key features of BigQuery are explained below.

  • DataTypes – The biggest advantage in BigQuery is that it supports JSON data types for columns and record types for columns. Yes you can store child table records as a column inside the parent table. This breaks the norm of normalization but it is needed.
  • SQL Support – You can retrieve the data you need using SQL like programming language that is supported by BigQuery. You can execute SQL statements from browser by opening the BigQuery console. As stated earlier the language is not SQL and it does not have all the functions supported by the ANSI SQL or other SQL standard. You can learn the SQL syntax and supported functions from here. You will see some DDL, DML, DCLs missing in this SQL. Some of the highlights of BigQuery SQL are listed below.
    • SQL Supports REGEX based search for string columns.
    • Union tables by specifying table names as wild card characters.
    • Query a field value from a JSON data type column.
  • API Integration – You can connect to BigQuery and execute jobs to load data from files to the table or move data from one table to another by using Google BigQuery API. You can find the complete reference of BigQuery API here. It has client libraries that are available in variety of programming languages like Java, Python, PHP, C#.

I will explain the features of BigQuery and how to import data and export data and compare it against Hadoop in the upcoming articles.

Credits and References

Credits and References