Snowplow Analytics and log streaming into BigQuery for real-time dashboards?


#1

Hi everyone,

I was very excited when I learned about Fastly’s new log streaming into Google BigQuery feature:

I am currently working with a large publisher. They don’t use Fastly yet (unfortunately) but are currently rolling out Snowplow Analytics. Like most tracking solutions, Snowplow collects data by requesting a little tracking pixel which’s URL contains a list of key-value parameters, e.g. like this:

/i?stm=1504563596539&e=pv&url=...&page=...&tv=js-2.7.2&tna=...&aid=...&p=web&tz=America%2FNew_York&lang=en-US&cs=UTF-8&...

In general, any analytics solution needs a server-side collector component that receives GET or POST requests and passes the data along to the subsequent processes, e.g. Snowplow’s enrichment component and eventually a data sink.

Based on Fastly’s new feature, my idea is to point the tracking requests to Fastly and stream the logs into BigQuery to build real-time dashboards. However, the tutorial doesn’t mention a way to extract the URL’s payload and strean not only the default parameters to BigQuery but also the custom URL parameters that Snowplow works with.

What do you think about this idea? Does anyone know if it is possible, maybe using VCL?

Thanks,
Ian


#2

Hi Ian

Yes, it’ll be possible using VCL. From what you’ve said, all that’s needed is to extract the value each of the parameters that you need to store and add them to the log format. All that would be necessary is to know the parameter names beforehand; looking at the protocol this looks like it is the case.

You can use VCL like this in order to achieve it (either via the UI, as a VCL snippet or using Custom VCL):

#set a header with the value of the 'tv' parameter
set req.http.snowplow-tv = subfield(req.url.qs, "tv", "&");

You can then pick this up in the log by adding it to the format that produces your JSON output. It will use a different format depending on if the logging endpoint is configured to use logging version 1 or 2.


Accessing query string parameters
#3

Hi Justin,

Thank you very much for your quick response, that’s awesome news!

Snowplow requests can contain a large number of URL parameters. Is there a limit regarding number of headers?

As Snowplow’s URL parameter values can contain JSON objects, is there a character limit for a header’s value?

Or is there an overall limit for the total headers payload, especially in conjunction with log streaming requests?

It wouldn’t be a problem if there were certain limits because the idea is to use log streaming only for some few key metrics, e.g. pageviews, sessions and visitors. In addition to the log streaming, we would set it up so that the entire URL payload is also retrieved on our upstream server, the so-called Snowplow Collector.

Best regards,
Ian


#4

Hi Ian

Yes, there is a limit on the number of headers. We’ve listed them here. Max total headers is 96 (including internal ones), so practically you have 80+ to work with. I don’t believe there is a limit on a single header but there is a 64K limit on total size of all headers, which is the same as the log line limit.


#7

Hi Justin,

Basic log streaming to BigQuery is working great, now I’d like to set up the headers. Because we will have a quite large number of URL parameters that we want to set, I’d like to use a VCL snippet.

When we take a request URL like https://...global.ssl.fastly.net/i?...&duid=d0f6acca-c6cb-41c5-a768-8f296bd69d8b&... we can use [\?|\&]duid\=([^&]+) to get the value of the duid parameter.

I added your code example from above ( set req.http.tv = subfield(req.url.qs, "tv", "&");) to a VCL snippet that runs with the hit subroutine (vcl_hit) but unfortunately the value in BigQuery is (null) which might be a -1 from VCL?

I also tried this: set req.http.tv = regsub(req.url, "[\?|\&]tv\=([^&]+)", "\1");

Thanks in advance,
Ian


#8

Hi Ian

Apologies for the delay in getting back to you here.
I just tested the solution end-to-end and it’s working for me.

One difference is your snippet is in vcl_hit (which won’t run on every request) and mine is in vcl_recv.

If changing it to the above doesn’t fix it for you, I’d recommend opening a support ticket so we can look into the specifics of your case, and if there’s a wider issue that we need to address I’ll swing back here and update this post.


#9

Hi Justin,

Thank you very much for your response and sorry for not updating you on this. I had already moved the code to vcl_recv and it is now working perfectly, we love it!

Best regards,
Ian