Friday, July 27, 2012

Varnish Cache Purge, Ban and Ban Lurker

Lets walk through some basics of varnish before understanding purge and ban

Varnish?
From DOC: Varnish is a web application accelarator.
Varnish can cache & serve all your static properties [css, js, images, parsed PHP pages, HTML]
Reduces load on the webservers even on high traffic.
Can act as a load balancer even [provided with proper director configurations].

Varnish uses VCL [Varnish Configuration Language] to override the defaults and tweak varnish based on usecase

Varnish caches contents [cache object] against a key.
By Default the key is Hash(HostName and Request URL)
We can override the defaults by editing vcl_hash sub-routine in vcl file

Do Cache objects live long in varnish?
In varnish every cache object is stored against a ttl value.
Every object will be auto-magically removed out of cache once they reach the expiry.
TTL can be configured globally as default while starting varnishd with -t option.
Also can be overridden in VCL using bresp.ttl value.

What if I had to manually invalidate a cache object?
There comes purge and ban as savior :)

Purge?
Invalidates [removes] specified cache object actively.
Method?
CURL -X "PURGE" url
Means?
Hit varnish with request method as purge. You can use any equivalent of CURL
Can anybody purge my contents?
Use acl purge {} directive to allow IPs/IP Class from which purge request can be sent.

BAN?
Invalidates cache objects passively. Supports regex.
Consider ban as a filter over already available cache objects.
Method?
ban req.http.host == "example.com" && req.url ~ "\.png$"
Means?
filter all png objects from example.com
The above code should be placed in vcl_recv
Authentication mechanism is same as purge

Purge vs Ban How do they differ?
Purge:
Invalidates cached object actively [sets the ttl of object to 0 and removes the moment purge request is sent]
Ban :
A ban is a filter maintained by varnish not a command.
It is always applied before delivering an object from the cache.
There might be multiple bans in the same varnish instance.
A ban is applicable only for the contents that were present by the time it was created.
It will never prevent new objects being cached or delivered.
Too many ban lists per instance will consume too much cpu.
Long lived cache [assume infinite TTL] objects with no hits will remain untouched by bans and consumes memory.

Why CPU & Memory consuming?
Every request before being served it might need to be matched across multiple ban list before deliver.
Matching here means a regex match. Hence It is going to consume CPU.
Consider heavy traffic systems. The frequency of requests to ban fiter check might be a concern.
Ban is clearly a filter.
It will take care of removing objects that are actively getting hits and that match ban list.
But it will not take care of idle cache objects with high TTL values even if they match the ban.
Hence, the memory consumed by them is never released till their TTL expires although we had already invalidated them.

How to overcome this?
Use Ban Lurker.

What problem does this solve?
1. Banned objects can be discarded in background.
2. The size ban-list can be reduced.
Ban Lurker is a varnish process who will be actively walking the cache and invalidate objects against the ban list.
This is a kind of enable/disable feature by default off [enable it: param.set ban_lurker_sleep 0.1].
Read more about ban lurker here

Final Point:
Purge will not refresh the invalidated object from backend. It will happen only in next cache miss.
Incase you want to force a cache miss and refresh content from backend you need to set
req.hash_always_miss to true
In that cache varnish will miss the current object in the cache, thus forcing a fetch from the backend

Wednesday, July 11, 2012

npm - package manager for node and package.json a overview

npm: node package manager [!an acronym :P]

Till the time I got to know about this guy, code upload across servers use to be a tough task.
He is the one to look out incase you are to develop an application in node and host it across servers
package.json is his weapon ;)
In this article I'm just planning to touch npm basics and use of package.json and it is tl;dr ;)

What is npm?
From here : npm is a package manager for node. You can use it to install and publish your node programs. It manages dependencies and does other cool stuff.
Basically node uses commonjs style module system.
Every module is an independent piece of javascript code which can be plugged in and out of the core of your application.
Modules can be custom built or built for a generic purpose like redis, mysql, async, log4js.
And it will be always good to know if there are any pre-built modules available for our need before we start building our own.
npm does it for you like a charm :)

How do I install npm?
npm by default is shipped along with node.
So, zero step installation.
sudo apt-get install node
npm help
To search for a package. Just emit the following command
Ex:
#npm search keywords
npm search redis 
Locate yours and install it via
npm install pacakage-name
Where will my installed packages go?
npm installation can be done in two modes local [Default] or global
Local:
npm install redis would follow
if(cwd == node_modules)
  install in ./redis directory
else
  install in ./node_modules/redis directory
Global:
npm install -g redis would follow
prefix/lib/node_modules
So, That is it?
Wait we still have our main picture :) npm help or npm help action would help a lot

package.json => It should be a pure json not a javascript object
Actually package.json has many many many options.
I suggest to have npm help json or this as a reference
Check out node_redis, async

Let us consider async's package.json for example
{ "name": "async"
, "description": "Higher-order functions and common patterns for asynchronous code"
, "main": "./index"
, "author": "Caolan McMahon"
, "version": "0.1.22"
, "repository" :
  { "type" : "git"
  , "url" : "http://github.com/caolan/async.git"
  }
, "bugs" : { "url" : "http://github.com/caolan/async/issues" }
, "licenses" :
  [ { "type" : "MIT"
    , "url" : "http://github.com/caolan/async/raw/master/LICENSE"
    }
  ]
, "devDependencies":
  { "nodeunit": ">0.0.0"
  , "uglify-js": "1.2.x"
  , "nodelint": ">0.0.0"
  }
}

Try
npm search async

NAME(name)            DESCRIPTION(description)                                          AUTHOR(author)    DATE              KEYWORDS(keywords)
async                 Higher-order functions and common patterns for asynchronous code =caolan            2012-07-03 12:17

Some important canditates I use:
"name" => unique & represents your module in npm global repo
"devDependencies" => Will only be installed iff "npm install --dev" is done
"repository" => Where to look for the source code of your module? incase of a npm published module
"version" => Very important param. Should be in x.y.z format. Used in the hash to locate your module in global node repo.
"dependencies" => What are all the modules do your module depends on?
"scripts" => "start" => what should happen when you hit "npm start" on your application folder
          => "test" => what should happen when you hit "npm test" on your application folder
[Check Acquiring Fame ]

Why dependencies?
Basically any modules we use in an application would have dependencies itself
For ex: node_redis has hiredis dependency

It will be difficult for a programmer as such to resolve those dependencies manually
Hence to make our life easier, on
npm install redis
npm will pick up the internal dependencies of a module from its package.json and will manage to resolve them.
This process continues until the dependency tree is satisfied [means recursively].
Check out this SO post :)

Why scripts => start/test?
We follow different procedures on deployment of different services
For Ex:
nohup node index.js &
forever start index.js
node index.js
Hence it would be hard to remember which matches to what?
So, specifying the start up script in your package.json will make your life as much easier as this
npm start
The same applies for test.
Hence the command to start a node application will be the same across your environment

Why version?
npm indexes your module based on hash of (name + version) inorder to resolve version based dependencies
For Ex:
In dependencies section I can specify
"*" => anything is ok for me
">0.6.7" => anything > than 0.6.7 is ok for me
"~0.6.0" => anything > 0.6.0 and < 0.6.x is ok for me "0.6.7" => I need 0.6.7
So, to handle all such dependencies npm indexes the version along with name of the module

How do we build our projects?
Basically we rely on dependencies attribute much.
The local node framework we had designed for our system allows developers to work independently on their module.
Modules people work on is hosted independently on git and they are just added as dependencies in package.json of the whole application.
For deployment we just push the package.json to our servers and run
npm install && npm run :)
All our developers' custom built modules [from git] and their internally dependencies [from npm global] are resolved recursively by npm
So, we hardly push any code to live servers. npm takes care of building the whole application in no time :)
Uploading code with resolved dependencies is hell lot of code and binaries.
Sometimes modules can have compile environment dependencies.
Also our modules internally have dependency to different versions of same module. So, we didn't want to go with global installation once.
Every local module get their dependencies resolved at their level. Hence no need to worry about version clash. [We optimize our package.json a bit though]
So, we left it to npm + package.json to do our task :P They are doing a really great job :)