I have been using Google BigQuery for over 18 months. It is unbelievably simple to use BigQuery for your BigData needs. In this post I will show how to load a data from file to a BigQuery table via the GUI tool provided by Google. In the next post I will show how to design and develop an ETL process for Google BigQuery via their API interface. I have designed this to be a slideshow so click on an image in any step and read the captions below the image in the slideshow and click right or left button to see all the screenshots in that step. Before I begin I will quote a funny answer I gave to my coworker when he asked me why I chose BigQuery. You need to be bigdata nerd to understand my sarcasm.
Sign up for Google Cloud account using your Gmail account, if you would like to try then sign up for a trial account. You need to add your credit card even for trial account. Google provides $300 initial credit for your trial account. This credit is sufficient for you to try out most of the services in Google Cloud.
After finishing the sign up for Google Cloud, the first step is to create a bucket (Google equivalent for folder) in Google Cloud Storage. This is where users will be uploading their files for processing. The bucket name should be unique across Google Cloud Storage because the bucket and files inside it can be shared via HTTP url with others. Buckets cannot be nested. Follow the instructions on the screenshots and create your first bucket.
This was my biggest challenge. Where do I find a huge amount of data for my testing? Luckily Wikipedia shares its data. You can get its data from http://dumps.wikimedia.org. I have prepared two test files to help you test. One is a small file with 30 records and another is a big file with 1 million records. I used page view log file for my testing. A typical page view file will have over 5 million records in it.
Now do the same kind of processing for the big file with 1 million records. Since we expect to encounter some errors while importing rows we allow up to 100,000 rows to fail in the import process. Once the data is imported run a simple sql like query to get the result you want. Always check the validator before running the query to get an estimated data volume to be used for the query. The higher the data the higher the billing so choose only the columns you need.
Credits and References
Credits and References
- Cost based on bucket types – Pricing details
- Geographic Location of Buckets – Geo locations
- BigQuery SQL Reference – https://cloud.google.com/bigquery/query-reference
- BiQuery Pricing Reference – https://cloud.google.com/bigquery/pricing
Recent Comments