Conference in NOLA
I decided to post this first on ASUG.com, as the SCN site is undergoing, let's say, growing. Departing from my normal practice of writing up notes about all the sessions I attended, the people I meet, and miscellany, I'm going to focus on the 2-part presentation by Dr. Claudia Imhoff today at the BU & Analytics conference.
In a town known for excessly (different than Lost Wages), starting a presentation with the challenge of a "drinking game" where we must take a shot every time the phrase "big data" is uttered, seems like risky business. Claudia pulled it off, other than for those who came in late with no clue. She went through a short tech evolution chart, and I played along, not only naming an acoustic coupler, but the acronym expansion for modem. Her slide showing the difference in a recent 8-year period between papal installations clearly illustrated the rapid pace of change.
Next (in my notes), she talked about the cost ratio between collecting and analyzing data; the former is rapidly getting cheaper while the latter is expensive. A term she wants us to toss out is "unstructured data." In essence, everything that exists has some structure, so a better definition would be "multi-structured."
Another turn of phrase I liked was not saying "data lake" or "data swamp" or worse, but to use "data refinery" in the sense that huge amounts of crude information might be captured somewhere, with the intent of doing something with it. Data lake implies it just sits there.
We had a short Q&A about the saying "unrelated data" like weather and traffic. My comment was there may be correlation and/or causation. Bad weather will slow down traffic. Claudia amended her statement to indicate the data sources are unrelated, and the causation is not there. We later got into a debate about "one source of the truth." Maybe you had to be there.
My primary takeaway from her talk was to mentally separate business intelligence platforms into 3 categories. One is what we would now call an enterprise data warehouse; I later asked and was told BW and Business Objects fit there. The second is investigative computing platform, in the sense of a sandbox that can be played with to find general trends. The third is operational.
The slide shot at the end shows how this plays out at eBay, which tracks every customer click (as does every other online site in one way or another). Their EDW is 14PB, their investigative layer has 36PB, and the Hadoop side has 50PB.
It's not a simple environment. They have built out systems only to need to replace/repair them when the need crystallized.
The best analogy she provided for these separate analytic areas was the first is like a dress, the second like a pattern, and the third is the cloth.
Two other minor takeaways:
- There is a trend toward ETL containers, which I liken to portable source code to perform a specific business function (a transport).
- There is a case study for the UK mobile company O2 where thet tried to target customers on the Channel Tunnel train before they switched off their roaming en route to France.
I have not found a link to that study; if someone finds it please advise.