The Road to Success with Big Data – Expectations vs. the Reality

crossroadBig Data is complex. The technologies in Big Data are rapidly maturing, but are still in many ways in an adolescent phase. While Hadoop is dominating the charts for Big Data, in the recent years we have seen a variety of technologies born out of the early starters in this space- such as Google, Yahoo, Facebook and Cloudera. To name a few:

  • MapReduce: Programming model in Java for parallel processing of large data sets in Hadoop clusters
  • Pig: A high-level scripting language to create data flows from and to Hadoop
  • Hive: SQL-like access for data in Hadoop
  • Impala: SQL query engine that runs inside Hadoop for faster query response times

It’s clear, the spectrum of interaction and interfacing with Hadoop has matured beyond pure programming in Java into abstraction layers that look and feel like SQL. Much of this is due to the lack of resources and talent in big data – and therefore the mantra of “the more we make Big Data feel like structured data, the better adoption it will gain.”

But wait, not so fast. You can make Hadoop act like a SQL data store. However, there are consequences, as Chris Deptula from OpenBI explains in his blog, A Cautionary Tale for Becoming too Reliant on Hive. You are forgoing flexibility and speed if you choose Hive for a more complex query as opposed to pure programming or using a visual interface to MapReduce.

This goes to show that there are numerous areas of advancements in Hadoop that have yet to be achieved – in this case better performance optimization in Hive. I come from a relational world – namely DB2 – where we spent a tremendous amount of time making this high-performance transactional database – that was developed in the 70’s – even more powerful in the 2000s, and that journey continues today.

Granted, the rate of innovation is much faster today than it was 10, 20, 30 years ago, but we are not yet at the finish line with Hadoop. We need to understand the realities of what Hadoop can and cannot do today, while we forge ahead with big data innovation.

Here are a few areas of opportunity for innovation in Hadoop and strategies to fill the gap:

  • High-Performance Analytics: Hadoop was never built to be a high-performance data interaction platform. Although there are newer technologies that are cracking the nut on real-time access and interactivity with Hadoop, fast analytics still need multi-dimensional cubes, in-memory and caching technology, analytic databases or a combination of them.
  • Security: There are security risks within Hadoop. It would not be in your best interest to open the gates for all users to access information within Hadoop. Until this gap is closed further, a data access layer can help you extract just the right data out of Hadoop for interaction.
  • APIs: Business applications have lived a long time on relational data sources. However with web, mobile and social applications, there is a need to read, write and update data in NoSQL data stores such as Hadoop. Instead of direct programming, APIs can simplify this effort for millions of developers who are building the next generation of applications.
  • Data Integration, Enrichment, Quality Control and Movement: While Hadoop stands strong in storing massive amounts of unstructured / semi-structured data, it is not the only infrastructure in place in today’s data management environments. Therefore, easy integration with other data sources is critical for a long-term success.

The road to success with Hadoop is full of opportunities and obstacles and it is important to understand what is possible today and what to expect next. With all the hype around big data, it is easy to expect Hadoop to do anything and everything. However, successful companies are those that choose combination of technologies that works best for them.

What are your Hadoop expectations?

- Farnaz Erfan, Product Marketing, Pentaho

This blog was originally posted here.

Is Hadoop Knowledge a Must-Have for Today’s Big Data Scientist?

big data businessFinding data scientists and other highly technical resources that understand the complexity of big data is one of the most common roadblocks to getting value from big data. Typically, these resources need to understand Hadoop and new programming methods to read, manipulate and model big data.

As big data analytics tools advance, addressing these technologies will become less difficult, so big data scientists must master additional skills.

To make a real business impact, data scientists must have:

1. Innate analytical skills
They must have a natural curiosity for experimenting with data and often begin analysis without a clear picture of the end goal. This is a different paradigm than solving a specific, identified problem through coding or by running a query.

2. Business finesse
Sexy dashboards ultimately fail if a business doesn’t act on what the data is indicating. To succeed, data scientists must know how to translate the impact of their insights to the business.

3. Collaboration skills
Teamwork and the ability to collaborate across an organization separate those who use data to drive change from those who merely build interesting algorithms.

Big data advancements have brought technologies such as Hadoop to democratize big data to all. However, individuals skilled at data manipulation and programming in Hadoop remain scarce. Fortunately, new, innovative and easy to use big data discovery applications have broaden big data access to those without much technical skills.

So the question is: Will these new types of discovery applications for big data demand a different kind of data scientist going forward – one with analytical, interpersonal and business skills? Or would in-depth understanding of emerging technologies such as Hadoop continue to be the most important skills in ‘data scientists’?

- Farnaz Erfan, Product & Solution Marketing, Pentaho

This blog was originally posted on SmartData Collective.

How Predictive Analytics Saved Tesla?

tesla-model-s-officialIn the last couple of weeks the feud between The NY Times Editor, John Broder – and Tesla Motors’ CEO, Elon Musk has played out in the media.

It all started when Broder took a highway trip between Washington D.C. and Boston, cruising in Tesla’s Model S luxury sedan. The purpose of the trip was to range test the car between two new supercharging stations. This 200 miles trip was well under the Model S’s 265-mile estimated range. But nonetheless the trip was filled with anxiety for Broder. Fearful of not reaching his charging destination, he had to turn off the battery-draining amenities such as radio and heater (in a 30 degree weather) to finally reach his destination – feet and knuckles “frozen”.

In rebutting Broder’s claims, Tesla’s chief executive, Elon Musk, has charged that the story was faked, that Mr. Broder intentionally caused his car to fail. On his Tesla blog, he released graphs and charts, based on driving logs that contest many of the details of Mr. Broder’s article.

With the logs now published, one thing is clear — Tesla’s use of predictive analytics helped them warn Broder on what is ahead. By calculating the range based on the energy consumption, Tesla signaled Broder to charge the vehicle in time. Had Tesla not been able to call its log files as witness, this futuristic motor tech company could have experienced serious brand damage.

What’s interesting is that Tesla’s story is not unique. Today, virtually anything that we use, an appliance, a mobile phone, an application, generates some sort of data – machine-generated data. And the truth exists behind that data. Such data, when analyzed and mined properly, provides indicators that solve problems, ahead of time.

Having real-time access to machine-generated data to foresee problems and improve performance is exactly why NetApp is using Pentaho. Using Hadoop and Pentaho Business Analytics to process and drive insights from 2-5 TBs of incoming data per week, NetApp has built a solution that sends alerts and notifications ahead of the actual hardware failure. The solution has helped NetApp predict its appliance interruptions for the E-Series storage units, offering new ways to exceed customer SLAs and protect the brand’s image.

Tesla, NetApp or other, if you run a data-driven business, the more your company can act on that data to improve your application, service or product performance, the better off your customers and the better your brand will be.

Pentaho Business Analytics gives companies fast and easy ways for collecting, analyzing and predicting data patterns. Pentaho’s customers see the value of analytics in many different facets and use cases. NetApp’s use case will be featured in Strata’s upcoming conference on Thursday, February 28, 2012.

Join us to find out more.

- Farnaz Erfan, Product and Solution Marketing, Pentaho

Looking to the Future of Business Analytics with Pentaho 4.8

Last week Pentaho announced Pentaho 4.8, another milestone in delivering the future of analytics. It has been an exciting ride. Our partners’ and our customers’ feedback have kept us ecstatic and ready to excel further into the future.

Pentaho 4.8 is a true testament on what the future of analytics needs. The future of analytics is driven by the data problems that businesses face every day – and is dependent on the information users and their expectations for solving those problems.

Let me give you a good example. I recently had the pleasure to meet with one of our customers – BeachMint. BeachMint is a fashion and style ecommerce company who uses celebrities / celebrity stylists to promote its retail business.

This rapidly growing online retailer needed to keep tabs on its large twitter and facebook communities to track customer sentiment and social influence. It then uses the social data to define customer cohorts and design marketing campaigns that best target each cohort.

For BeachMint insight to data is extremely important. But on one hand, the volumes and variety of data – in this case unstructured social data and click-through ad feeds – has increased its complexity. And on the other hand, the speed in which it gets created has accelerated rapidly. For example, in addition to analyzing the impact of customer sentiments on their purchasing behavior, BeachMint also needed to gain up-to-the-minute information on the activity of key promotional codes – to immediately identify those that leak out.

Pentaho understands these data challenges and user expectations. In this release Pentaho takes full advantage of its tightly coupled Data Integration and Business Analytics platform – to simplify data exploration, discovery and visualization for all users and all data types – and to deliver this information to users immediately – sometimes even at a micro-second level. In this release Pentaho delivers:

- Pentaho Mobile – the only Mobile BI application with the power to instantly create new analysis on the go.

- Pentaho Instaview – the industry’s first instant and interactive big data visualization application.

Want to find out more? Register for Pentaho 4.8 webinar and see for yourself.

- Farnaz Erfan, Product Marketing, Pentaho

Is Your Big Data Hot or Not?

Data is the most strategic asset for any business. However, massive volumes and variety of data has made catching it at the right time and right place, discovering what’s hot – and needs more attention – and what’s not, a bit trickier these days.

Heat grids are ideal for seeing a range of values in data as they provide a gradient scale, showing a change in data intensity through the use of colors. For example, you can see what’s hot in red and what’s normal in green; and everything else in various shades of color in between. Let me give you two examples of how companies have used heat grids to see if their data is hot or not:

Example #1 – A retailer is looking at week-by-week sales of a new fashion line to understand how each product line is performing as items get continually discounted throughout the season. Data is gathered from thousands of stores across the country and then entered into a heat grid graph that includes:

  • X axis – week 1 through 12, beginning from the launch of a new campaign (e.g. Nordstrom’s Summer Looks)
  • Y axis – product line (e.g. shoes, dresses, skirts, tops, accessories)
  • Color of the squares – % of discount (e.g. dark red = 70%, red = 60%, orange = 50%, yellow = 30%, green = full price)
  • Size of the squares – # of units sold

Looking at this graph, the retailer can easily see that most shoes sell at the beginning of the season – even without heavy discounts. This helps the retailer predict inventory levels to keep up with the demand for shoes.

It also shows that accessories almost never sell at regular prices, nor do they sell well when the discount levels are higher than 70%. Knowing this, the retailer can control its capital spending by not overstocking on this item. The retailer can also increase profit per square footage of their store by reselling its accessories earlier in the season to avoid high markdowns and inventory overstocks at the end of the season.

Example # 2 – A digital music streaming service provider is using analytics to assess the performance of its sales channels (direct vs. sales through different social media sites such as Facebook and Twitter) to guide future marketing and development spend. For that, the company uses a heat grid to map out:

  • X axis – various devices (iPhone, iPad, Android Smartphone, Android Tablet, Blackberry)
  • Y axis – various channels (direct site, Facebook, Twitter, …)
  • Color of the circles – # of downloads (0-100 = red, 100-1000=orange, 1000-10000 = yellow, 10000+ = green)
  • Size of the circles – app usage hours per day – the bigger the size, the more usage

This graph helps the music service provider analyze data from millions of records to quickly understand the popularity and usage patterns of their application on different devices, sold through different channels.

Heat grids can be use in variety of other forms, such as survey scales, product rating analysis, customer satisfaction studies, risk analysis and more. Are you are ready to find out whether your big data is hot or not? Check out this 3 minute video to learn how heat grids can help you.

Understanding buyers/users and their behavior is helping many companies including ideeli – one of the most popular online retailers – and Travian Games – top German MMO (massively multiplayer online) game publisher – gain better insight from their hottest asset – their big data!

What is your hottest business asset?

-          Farnaz Erfan, Product Marketing, Pentaho

This blog was originally posted on Smart Data Collective.

4 Questions to Ask Before You Define Your Cloud BI Strategy

These days, when it comes to enterprise software, it seems that it is all about the cloud. Some software applications such as Salesforce, Marketo, and Workday, have made quite a name for themselves in this space. Can Business Intelligence follow the same path to success? Does it make sense to house your BI in the cloud? I believe that it depends. Let’s explore why.

There are four criteria that impact the decision for a cloud vs. on-premise BI strategy.  Let’s take a look at how they affect your approach.

Question 1: Where is the data located?

Your BI Strategy should vary depending on the location of data.  If your data is distributed, some data may already be in the cloud, e.g. web data / clickstreams; and some on-premise, such as corporate data. For real-time or near real-time analytics, you need to deploy your BI as close to the source as possible. For example, when analyzing supply chain data out of an on-premise SAP system, where your database, application and infrastructure are all sitting on-premise, it is expensive and frankly impractical to move the data to the cloud before you start analyzing it.

Your data can also be geographically distributed. Unless your cloud infrastructure is co-located with your data geo zones, your BI experience can suffer from data latency and long refresh intervals.

Question 2: What are the security levels of data?

It’s important to acknowledge that data security levels are different in the cloud. You may not be able to put all your analytics outside of the company firewall. According to Cisco’s 2012 Global Cloud Networking survey, 72% of respondents cited data protection security as the top obstacle to a successful implementation of cloud services.

Question 3: What are the choice preferences of your users?

Customer preference is extremely important today. The balance of power has shifted, and users and customers are now the ones who decide whether an on-premise or a cloud deployment is suitable for them. What’s more, each customer’s maturity model is different. As an application provider or business process automation provider, you need to cater to your individual customers’ business needs.

Question 4: What operational SLAs does your Cloud BI vendor oblige you to?

Your operational SLAs can depend on cloud infrastructure providers, obliging you to service quality levels different from what you need. Pure cloud BI vendors provide their BI software over the public Internet through a utility pricing and delivery scheme. As much as this model provides an attractive alternative when resources are limited, it’s not for everyone. In most cases, the SaaS BI vendor depends on IaaS vendors (such as Amazon, Savvis, OpSource, etc.) for storage, hardware, and networks. As a result, the SaaS BI vendors’ operational processes have to align with the infrastructure vendors’ for housing, running, and backup/recovery of the BI software. Depending on your BI strategy, these nested and complex SLAs may or may not be the right choice.

Large enterprises, or even mid-market companies inspired by growth, typically develop an IT strategy that is provider-agnostic and has the flexibility to be hosted on-premise or in the in the cloud.   This strategy helps companies avoid lock-in and inflexibility down the road.

As cloud technology remains one of the hottest trends in IT today, it is important to assess whether cloud is the right choice for BI. The reality is that it depends. The center of gravity for BI is still on premise; however, it will move to the cloud over time mostly through the embedded BI capabilities of enterprise SaaS applications. Successful organizations will be the ones that can navigate the boundary between the two strategies and provide greater flexibility and choice by offering a product that can be deployed on-premise, in the cloud, or a hybrid of both.

What is your Business Intelligence Cloud strategy?

– Farnaz Erfan, Product Marketing, Pentaho

This blog was originally posted on Smart Data Collective.

The Diary of a Construction Manger in Love with His BI Tool

Hi, my name is Bob and I am a construction manager. I oversee all aspects of managing the operations of a construction project, including budgets, staffing, and the compliance of the entire construction project.

In 10+ years of my experience, I have never had a Business Intelligence (BI) tool. I had to create spreadsheets to track daily activities, calculate risks and build formulas to measure impact. Given the size of the projects I worked on, this was extremely complex.  As a result, I would spend a lot of my time putting out fires to problems that I knew could have been prevented if I had the right information.

Recently my company introduced BI to our team. Since I’m using BI for the first time, I decided to create an activity log similar to a diary of my project.

Let me share some highlights with you:

October 28, 2011

We are 4 weeks into the project. We have the crew working on the ground. The foundation is done. The structural engineer has finished his design. We are ready to roll.

January 11, 2012

This morning I received an alert about my Preventative vs. Corrective Maintenance. My monthly work mix by type looks like this: preventative 36%; repair 24%; rebuild 5%; and modify 35%. My preventative costs have gone down from an optimal 40% to 36% and my repair costs have increased correspondingly.

When I drilled down into the repairs, I see that we are responding to higher than normal number of heating and insulation work items. I am going to talk to Edward – my HVAC contractor – about it.

February 29, 2012

I have been monitoring our electrical work. Our average Cost per SQ Foot is 13% less than industry average. This is a breakthrough thanks to the changes I have made monitoring the project with BI and making data-driven decisions. It lets me monitor these costs on an ongoing basis, so I can take preventative actions to stay below industry average to protect our funding and even justify additional headcount.

March 16, 2012

Productivity Rate is one of my favorite indicators – because it truly provides me with real-time info about the performance of my team. On average, our productivity rate stays on optimal levels. However, the plumbing trade group’s actual cost is exceeding the estimated costs. This will affect my cost-to-complete and margins, as I have to pay overtime for this contractor.

But I don’t have to worry… my BI tool lets me drill into this indicator to see whether the reason is ‘labor’ or ‘supply’ related. Drill-thru was something a spreadsheet could never let me do.

March 30, 2012

Two weeks have passed since I shifted resources for plumbing. Our productivity rates have improved since then and the project is looking on time and on budget.

With 40 more days to go, I want to make sure we deliver on time and meet our SLA with the building owners. I see no bottlenecks. Cycle Times – the average time to complete an activity – shows me that we are actually 4 days ahead of the schedule.

May 21, 2012

I’m very happy to report that we are done with the construction. The ROI on this project was greater than we expected and my client is very happy. Next weekend is the Memorial Day weekend. I have the time and money I need to take a nice vacation with my wife and son.

-As told by Bob, a fictional construction manager.

Even though the story is fictional, it’s based on reality. Business users and project managers – such as facility managers, supply chain logistic specialists, even dairy farmers – use Pentaho business intelligence to make their jobs easier and to make smarter, data driven decisions – just like our fictional friend, Bob.

Who knew BI could be so handy for construction managers?

What is your secret BI story? Drop me a line.

- Farnaz Erfan, Product Marketing, Pentaho

This blog was originally posted Smart Data Collective.