Tiny transfers

27 Aug 2017

It’s hard to pin down why JSON transfers is so popular. But one thing’s for sure, it is not the most efficient when it comes to data transfers.

The nested nature of JSON makes it prone to redundant storage and transferring redundant JSON files only puts strain on the server’s and the client’s bandwidth.

It is always a good idea to deflate your JSON data before transferring to the other side.

Working dataset

The first thing we need is a rich dataset. Google’s doodle database is the perfect fit. Use the 2 scripts to download all doodles from the year 2016 …

# fetch.sh

DATA_PATH="raw"
mkdir -p $DATA_PATH

for (( MONTH = 1; MONTH <= 12; MONTH++ )) do
  ZERO_MONTH=$(printf %02d $MONTH)

  URL="https://www.google.com/doodles/json/2016/$ZERO_MONTH?full=1"
  FILEPATH="$DATA_PATH/2016-$ZERO_MONTH.json"

  if [[ -f "$FILEPATH" ]]
  then
    FILESIZE=$(wc -c < "$FILEPATH")
    NULL_FILESIZE=2

    if [[ $FILESIZE -eq $NULL_FILESIZE ]]
    then
      wget -c $URL -O "$FILEPATH"
    else
      echo "SKIP: $FILEPATH"
    fi
  else
    wget -c $URL -O "$FILEPATH"
  fi
done

… and consolidate them in a single file.

// aggregate.js

const fs = require('fs');
const path = require('path');

const rawDirPath = path.join(__dirname, 'raw');
const allDoodlesPath = path.join(__dirname, 'doodles.all.json');

const allDoodles = fs
  .readdirSync(rawDirPath)
  .reduce((_allDoodles, fileName) => {
    const filePath = path.join(rawDirPath, fileName);
    const fileDoodles = JSON.parse(fs.readFileSync(filePath));

    return _allDoodles.concat(fileDoodles);
  }, []);

fs.writeFileSync(allDoodlesPath, JSON.stringify(allDoodles));

If everything goes well, you’ll end up with the following files. The exact sizes might differ (more on this later).

$ du -sh raw/*
724K  raw/2016-01.json
592K  raw/2016-02.json
1.2M  raw/2016-03.json
644K  raw/2016-04.json
660K  raw/2016-05.json
872K  raw/2016-06.json
488K  raw/2016-07.json
1.9M  raw/2016-08.json
1.1M  raw/2016-09.json
512K  raw/2016-10.json
644K  raw/2016-11.json
836K  raw/2016-12.json

$ du -sh doodles.all.json
8.1M  doodles.all.json

Understanding the data

Before we can begin the cleaning process, we need to understand the structure of our data.

// structure.js

const fs = require('fs');
const path = require('path');

const allDoodlesPath = path.join(__dirname, 'doodles.all.json');
const allDoodles = require(allDoodlesPath);

const keys = {};
allDoodles.forEach(doodle => {
  Object.keys(doodle).forEach(k => {
    keys[k] = doodle[k].constructor;
  });
});

console.log(keys);

This gives us the following list.

alternate_url            : String,
blog_text                : String,
call_to_action_image_url : String,
collection_id            : Number,
countries                : Array,
doodle_args              : Array,
doodle_type              : String,
height                   : Number,
hires_height             : Number,
hires_url                : String,
hires_width              : Number,
history_doodles          : Array,
id                       : Number,
is_animated_gif          : Boolean,
is_dynamic               : Boolean,
is_global                : Boolean,
is_highlighted           : Boolean,
name                     : String,
next_doodle              : Object,
persistent_id            : Number,
prev_doodle              : Object,
query                    : String,
related_doodles          : Array,
run_date_array           : Array,
share_text               : String,
standalone_html          : String,
tags                     : Array,
title                    : String,
translations             : Object,
url                      : String,
width                    : Number,
youtube_id               : String,

We can work on this initial list of keys to workout more semantic types - URLs, IDs, self references, etc.

history_doodles : Array<Doodle>,
next_doodle     : Doodle,
prev_doodle     : Doodle,
related_doodles : Array<Doodle>,

alternate_url            : URL,
call_to_action_image_url : URL,
hires_url                : URL,
standalone_html          : URL,
url                      : URL,

Next, we will extract data instances where duplicates are a possibility. This happens with collection attributes (like arrays) over a closed set (like countries of the world).

countries : Array<Country>,
tags      : Array<Tag>,

Normalisation

Now we can perform the actual normalisation. There are 3 steps:

Extract unique instances for all Models:

Doodle
Country
Tag

Replace redundant instances with unique IDs
Save all model instances as separate JSON files

// normalise.js

// step 1
const uniqueDoodles = {};
const uniqueCountriesSet = new Set();
const uniqueTagsSet = new Set();

allDoodles.forEach(doodle => {
  doodle._id = generateDoodleHash(doodle);

  uniqueDoodles[doodle._id] = doodle;

  doodle.countries.forEach(country => {
    country = country.trim().toLowerCase();

    uniqueCountriesSet.add(country);
  });

  doodle.tags.forEach(tag => {
    tag = tag.trim().toLowerCase();

    uniqueTagsSet.add(tag);
  });
});

// step 2
const uniqueCountries = Array.from(uniqueCountriesSet);
const uniqueTags = Array.from(uniqueTagsSet);

allDoodles.forEach(doodle => {
  if (doodle.next_doodle !== null) {
    const nextDoodle = doodle.next_doodle;
    const nextDoodleHash = generateDoodleHash(nextDoodle);

    doodle.next_doodle = nextDoodleHash;
  }

  if (doodle.prev_doodle !== null) {
    const prevDoodle = doodle.prev_doodle;
    const prevDoodleHash = generateDoodleHash(prevDoodle);

    doodle.prev_doodle = prevDoodleHash;
  }

  doodle.related_doodles = doodle.related_doodles.map(relatedDoodle => {
    const relatedDoodleHash = generateDoodleHash(relatedDoodle);

    return relatedDoodleHash;
  });

  doodle.history_doodles = doodle.history_doodles.map(historyDoodle => {
    const historyDoodleHash = generateDoodleHash(historyDoodle);

    return historyDoodleHash;
  });

  doodle.countries = doodle.countries.map(country =>
    uniqueCountries.indexOf(country.trim().toLowerCase())
  );

  doodle.tags = doodle.tags.map(tag =>
    uniqueTags.indexOf(tag.trim().toLowerCase())
  );
});

// step 3
function writeJSON(filepath, json, pretty = false) {
  fs.writeFileSync(filepath, JSON.stringify(json, null, pretty ? 2 : 0));
}

writeJSON('doodles.all.norm.json', allDoodles);
writeJSON('countries.json', uniqueCountries);
writeJSON('tags.json', uniqueTags);

We can also apply the same process to other properties, such as the different types of URLs. Find all the possible origins and common pathnames, and store them as a separate JSON file. This means we can further reduce file sizes by replace long repeating URLs.

All doodles can have any of the following URL types:

alternate_url
call_to_action_image_url
hires_url
standalone_html
url

And all of those URLs have the following common starting paths:

https://lh3.googleusercontent.com
https://www.google.com/logos/doodles
https://www.google.com/logos

// normalise.js

// ...

const linkTypes = [
  'alternate_url',
  'call_to_action_image_url',
  'hires_url',
  'standalone_html',
  'url',
];

const urlPrefixes = [
  'lh3.googleusercontent.com',
  'www.google.com/logos',
  'www.google.com/logos/doodles',
];

allDoodles.forEach(doodle => {
  linkTypes.forEach(linkType => {
    const link = doodle[linkType];

    switch (true) {
      case link.startsWith('https://lh3.googleusercontent.com'):
        doodle[linkType] = link.replace('https://lh3.googleusercontent.com', 0);
        break;

      case link.startsWith('//www.google.com/logos'):
        doodle[linkType] = link.replace('//www.google.com/logos', 1);
        break;

      case link.startsWith('/logos'):
        doodle[linkType] = link.replace('/logos', 1);
        break;

      case link.startsWith('//www.google.com/logos/doodles'):
        doodle[linkType] = link.replace('//www.google.com/logos/doodles', 2);
        break;
    }
  });
});

writeJSON('doodles.all.norm.json', allDoodles);
writeJSON('urls.json', urlPrefixes);

Measuring performance

$ du -sh *.json
4.0K  countries.json
8.1M  doodles.all.json
1.9M  doodles.all.norm.json
12K tags.json
4.0K  urls.json

The raw data is 8.1MB. After normalising, we get 1.9M + 4.0K + 12K + 4.0KB, which is still 1.9MB.

We have reduced our transfer sizes by over 4 times.

Cleaning up the gunk

There’s still more work we can do. Right now, we are transferring all the 32 key-value pairs. Most of which we might not even need.

Once we decide what attributes we must keep, we can create a schema for our data.

Schema a representation of a plan or theory in the form of an outline or model.

// normalise.js

// ...

const schema = [
  /*
  'alternate_url',
  'blog_text',
  'call_to_action_image_url',
  'collection_id',
  'countries',
  'doodle_args',
  'doodle_type',
  'height',
  'hires_height',
  'hires_width',
  'history_doodles',
  'id',
  'is_animated_gif',
  'is_dynamic',
  'is_global',
  'is_highlighted',
  'name',
  'persistent_id',
  'query',
  'related_doodles',
  'share_text',
  'standalone_html',
  'tags',
  'translations',
  'width',
  'youtube_id',
  */

  'hires_url',
  'next_doodle',
  'prev_doodle',
  'run_date_array',
  'title',
  'url',

  '_id', // unique ID for each doodle
];

const cleanDoodles = allDoodles.map(doodle => schema.map(key => doodle[key]));

writeJSON('doodles.clean.json', cleanDoodles);
writeJSON('schema.json', schema);

Measuring performance

Needless to say, removing such a large portion of our data will have strong impact on file size. Going from 1.9MB to 116KB is more than 16 times smaller.

$ du -sh doodles.clean.json
116K  doodles.clean.json

Packaging and Compression

Packaging

The final part is packaging and compression. At the end of normalisation, we end up the following files:

schema.json
countries.json
tags.json
urls.json

Fetching these 4 files require 8 round trips - from the client, to the server, and back to the client. On slow or intermittent connection, chances of any of those requests failing are high. We cannot start work on client-side until we have all the parts.

We can bypass this small issue by writing everything to a single file:

// normalise.js

// ...

writeJSON('meta.json', {
  schema,
  countries: uniqueCountries,
  tags: uniqueTags,
  urls: urlPrefixes,
});

// 12K

Compression

We can now compress our final JSON file with bzip2 or use a higher compression ratio algorithm, like LZMA or brotli.

$ du -sh doodles.clean.json
116K  doodles.clean.json

$ bzip2 -kf doodles.clean.json

$ du -sh doodles.clean.json.bz2
24K doodles.clean.json.bz2

Measuring performance

Packaging reduces the number of requests to the server and compression reduces our final size by another 4.8 times.

To put things in contrast, we started out with 1 huge 8.1MB file and ended up with 2 files that add up to 40KB. This is a 200 times reduction.

The end

Here, we worked with 386 doodles, released in the year 2016 alone and although this gives us 200 times better transfers, going from 8MB to 40KB isn’t all that helpful.

But Google has been releasing doodles since 1998. There are 3245 doodles released at time of writing this post that add up to 60MB.

If we apply the same pipeline on their entire dataset, we get out 52KB + 340KB of metadata and compressed data. This is still 150 times better and more importantly, under 1MB. This is HUGE!!

The rationale behind this is pretty straightforward. As the data size increases, so does redundancy. Reducing redundancy by performing normalising, proper filtering and finally performing compression will always bring down sizes in redundant data. And higher redundancy will mean better gains.

The entire code used here is available in this gist.

TryCatch

Tiny transfers

Working dataset

Understanding the data

Normalisation

Measuring performance

Cleaning up the gunk

Measuring performance

Packaging and Compression

Packaging

Compression

Measuring performance

The end