The Wayback Machine - https://web.archive.org/web/20160709032857/http://java.sys-con.com:80/node/3857083

Welcome!

Java IoT Authors: Pat Romanski, John Basso, Elizabeth White, Liz McMillan, Dana Gardner

Related Topics: Open Source Cloud, Java IoT, @CloudExpo, @BigDataExpo

Open Source Cloud: Blog Post

Handle Big Data with Speed and Efficiency | @BigDataExpo #API #Cloud #BigData #MachineLearning

Here's how two part-time DBAs maintain mobile app ad platform Tapjoy’s massive data needs

The next BriefingsDirect Voice of the Customer big data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

Examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs with extreme efficiency.

To learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty we're joined by David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.

Abercrombie

We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.

The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.

Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff --  patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?

The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in:

More Stories By Dana Gardner

At Interarbor Solutions, we create the analysis and in-depth podcasts on enterprise software and cloud trends that help fuel the social media revolution. As a veteran IT analyst, Dana Gardner moderates discussions and interviews get to the meat of the hottest technology topics. We define and forecast the business productivity effects of enterprise infrastructure, SOA and cloud advances. Our social media vehicles become conversational platforms, powerfully distributed via the BriefingsDirect Network of online media partners like ZDNet and IT-Director.com. As founder and principal analyst at Interarbor Solutions, Dana Gardner created BriefingsDirect to give online readers and listeners in-depth and direct access to the brightest thought leaders on IT. Our twice-monthly BriefingsDirect Analyst Insights Edition podcasts examine the latest IT news with a panel of analysts and guests. Our sponsored discussions provide a unique, deep-dive focus on specific industry problems and the latest solutions. This podcast equivalent of an analyst briefing session -- made available as a podcast/transcript/blog to any interested viewer and search engine seeker -- breaks the mold on closed knowledge. These informational podcasts jump-start conversational evangelism, drive traffic to lead generation campaigns, and produce strong SEO returns. Interarbor Solutions provides fresh and creative thinking on IT, SOA, cloud and social media strategies based on the power of thoughtful content, made freely and easily available to proactive seekers of insights and information. As a result, marketers and branding professionals can communicate inexpensively with self-qualifiying readers/listeners in discreet market segments. BriefingsDirect podcasts hosted by Dana Gardner: Full turnkey planning, moderatiing, producing, hosting, and distribution via blogs and IT media partners of essential IT knowledge and understanding.

@ThingsExpo Stories
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
It’s 2016: buildings are smart, connected and the IoT is fundamentally altering how control and operating systems work and speak to each other. Platforms across the enterprise are networked via inexpensive sensors to collect massive amounts of data for analytics, information management, and insights that can be used to continuously improve operations. In his session at @ThingsExpo, Brian Chemel, Co-Founder and CTO of Digital Lumens, will explore: The benefits sensor-networked systems bring to ...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
"There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...
A critical component of any IoT project is what to do with all the data being generated. This data needs to be captured, processed, structured, and stored in a way to facilitate different kinds of queries. Traditional data warehouse and analytical systems are mature technologies that can be used to handle certain kinds of queries, but they are not always well suited to many problems, particularly when there is a need for real-time insights.
SYS-CON Events announced today that CDS Global Cloud, an Infrastructure as a Service provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. CDS Global Cloud is an IaaS (Infrastructure as a Service) provider specializing in solutions for e-commerce, internet gaming, online education and other internet applications. With a growing number of data centers and network points around the world, ...
Large scale deployments present unique planning challenges, system commissioning hurdles between IT and OT and demand careful system hand-off orchestration. In his session at @ThingsExpo, Jeff Smith, Senior Director and a founding member of Incenergy, will discuss some of the key tactics to ensure delivery success based on his experience of the last two years deploying Industrial IoT systems across four continents.
The security needs of IoT environments require a strong, proven approach to maintain security, trust and privacy in their ecosystem. Assurance and protection of device identity, secure data encryption and authentication are the key security challenges organizations are trying to address when integrating IoT devices. This holds true for IoT applications in a wide range of industries, for example, healthcare, consumer devices, and manufacturing. In his session at @ThingsExpo, Lancen LaChance, vic...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Fifty billion connected devices and still no winning protocols standards. HTTP, WebSockets, MQTT, and CoAP seem to be leading in the IoT protocol race at the moment but many more protocols are getting introduced on a regular basis. Each protocol has its pros and cons depending on the nature of the communications. Does there really need to be only one protocol to rule them all? Of course not. In his session at @ThingsExpo, Chris Matthieu, co-founder and CTO of Octoblu, walk you through how Oct...
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
IoT offers a value of almost $4 trillion to the manufacturing industry through platforms that can improve margins, optimize operations & drive high performance work teams. By using IoT technologies as a foundation, manufacturing customers are integrating worker safety with manufacturing systems, driving deep collaboration and utilizing analytics to exponentially increased per-unit margins. However, as Benoit Lheureux, the VP for Research at Gartner points out, “IoT project implementers often ...
The idea of comparing data in motion (at the sensor level) to data at rest (in a Big Data server warehouse) with predictive analytics in the cloud is very appealing to the industrial IoT sector. The problem Big Data vendors have, however, is access to that data in motion at the sensor location. In his session at @ThingsExpo, Scott Allen, CMO of FreeWave, discussed how as IoT is increasingly adopted by industrial markets, there is going to be an increased demand for sensor data from the outermos...
CenturyLink has announced that application server solutions from GENBAND are now available as part of CenturyLink’s Networx contracts. The General Services Administration (GSA)’s Networx program includes the largest telecommunications contract vehicles ever awarded by the federal government. CenturyLink recently secured an extension through spring 2020 of its offerings available to federal government agencies via GSA’s Networx Universal and Enterprise contracts. GENBAND’s EXPERiUS™ Application...
The cloud promises new levels of agility and cost-savings for Big Data, data warehousing and analytics. But it’s challenging to understand all the options – from IaaS and PaaS to newer services like HaaS (Hadoop as a Service) and BDaaS (Big Data as a Service). In her session at @BigDataExpo at @ThingsExpo, Hannah Smalltree, a director at Cazena, provided an educational overview of emerging “as-a-service” options for Big Data in the cloud. This is critical background for IT and data profession...
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.
In addition to all the benefits, IoT is also bringing new kind of customer experience challenges - cars that unlock themselves, thermostats turning houses into saunas and baby video monitors broadcasting over the internet. This list can only increase because while IoT services should be intuitive and simple to use, the delivery ecosystem is a myriad of potential problems as IoT explodes complexity. So finding a performance issue is like finding the proverbial needle in the haystack.
Early adopters of IoT viewed it mainly as a different term for machine-to-machine connectivity or M2M. This is understandable since a prerequisite for any IoT solution is the ability to collect and aggregate device data, which is most often presented in a dashboard. The problem is that viewing data in a dashboard requires a human to interpret the results and take manual action, which doesn’t scale to the needs of IoT.
Internet of @ThingsExpo has announced today that Chris Matthieu has been named tech chair of Internet of @ThingsExpo 2016 Silicon Valley. The 6thInternet of @ThingsExpo will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.