HadoopDevelopment: July 2015

Thursday, 16 July 2015

Hadoop with R

Introduction

Apache Hadoop provides a robust and economic platform for storing and process big data. R programming language is used by many data analysts for statistical analysis. In this article, I talk about putting these two together to form a powerful platform for big data analysis.

Apache Hadoop

Apache Hadoop has become synonymous with Big Data. Nobody talks about Big Data without doing something with Hadoop. Hadoop helps to complete your job faster by distributing the computations to a cluster of commodity machines. This makes it possible for organizations to cut their data management costs by as much as 90% and yet build a fault-tolerant data processing system.

Hadoop has two core components, HDFS for distributed storage and Mapreduce for distributed processing. The hadoop architecture can be represented by the below diagram.

Hadoop cluster consists of two types of nodes (or machines). There is one master node and multiple worker nodes. Name node and data node are processes that are part of HDFS, Hadoop distributed file system. Job tracker and Task tracker are part of Map reduce – the distributed processing system of hadoop. User jobs are divided into two types of tasks, mappers and reducers. Mappers do the filtering of data and convert the data into key value pairs. Reducers process each key and produce an aggregated output. Mappers take the input from HDFS and store their output in local file system. Reducers get the output of mappers and store the final output in HDFS. Since all mappers and reducers have a share nothing architecture, hadoop provides a very highly scalable parallel processing architecture.

R Programming language

R programming language has been used for statistical computing. With the increased interest in data analytics, usage of R has increased significantly. It is estimated that more than 70% of the data scientists use R for statistical analysis. R is an opensource product and is free. It is supplied as part of the GNU public license. R has outperformed many of the expensive and paid products for statistical processing. The R language itself is easy to learn and provides many libraries that provide functions to model and analyze data. R also provides extensive libraries for prediction as well as machine learning.

R provides many built in functions for machine learning as well as prediction modeling. One of my favorite is Holt-Winters model that provides time series modeling of data with some randomness, trend as well as seasonality. It is also called the triple exponential model. For example ,if you have data in a file that has a single column as sales per day for the last five years for a particular store, then you can build a Holt Winters model like below:

>salesTS <- ts(sales,frequency=52,start=c(2010,1))

>hw<-HoltWinters(salesTS, seasonal=”add”, alpha=0.3,beta=0.2,gamma=12)

>p<-predict(hw,8,prediction.interval=TRUE)

>plot(p)

You will get the graph like below which gives the 8 future points along with upper and lower bounds:

Prediction plot of R — R Prediction plot

R-Hadoop

Now, can we put the power of hadoop and convenience of R together? R-hadoop is one such attempt. You write your mapper and reducer functions in R and the jobs are submitted to Hadoop which in turn distributes the work to R running on each machine in the cluster. The architecture can be represented in the below diagram:

You can initiate your map-reduce job through the R-hadoop server. R-hadoop server submits the job to Job Tracker. Job tracker schedules the map and reduce tasks on task trackers running on each worker node. The map and reduce tasks execute the tasks by running the mapper code on the R-hadoop on the worker node. The R-hadoop mapper gets the input as keys and values, processes the data and stores them again as keys and values for the reducer. The reducer task collects the keys and values and calls the reduce function on R-hadoop for each key with a list of values. R-hadoop does not parallelize the algorithm itself. It distributes the work so that keys and values are distributed. Suppose you have to execute the above prediction for 200 stores and for each store it takes 10 minutes, then you can distribute this work on your hadoop cluster so that all the 200 stores can be processed within an hour.

Installation and setup

Though R-hadoop is not difficult to setup, it takes lot of trial and error to make it work properly. Following steps need to be followed to set this up correctly:

On each machine in the cluster:

Do a package installation of R (on Ubuntu, you can add the line deb http://ftp.osuosl.org/pub/cran/bin/linux/ubuntu precise/

to /etc/apt/sources.list and then use apt-get to install r-base and r-base-dev)

2. Start R with sudo R and add the following packages:

(“codetools”, “Rcpp”,”plyr”,”stringi”,”magrittr”,”stringr”,”reshape2″,”caTools”,”functional”, “digest”, “RJSONIO”)

Quit to Linux command line and download the rmr package from any of the mirrors. Following is one of the mirrors:

wget http://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz

Install the package using the below command:

sudo R CMD INSTALL rmr2_3.3.1.tar.gz

If it throws up any errors that some package is missing or outdated, reinstall that package and try again.

Following steps need to be executed on the master node only:

Install R studio server on the master node. I found following instructions for installing R studio server that is very useful:

sudo apt-get install gdebi-core

wget http://download2.rstudio.org/rstudio-server-0.99.464-i386.deb

sudo gdebi rstudio-server-0.99.464-i386.deb

This automatically starts Rstudio server, so after we make the configuration changes, you will have to restart the server.

6. This is an important step for connecting the server to hadoop. Find out your hadoop installation path and hadoop streaming jar file and set two environment variables as below:

Edit the/etc/R/Renviron.site file and add the below lines at the end:

#following required for R-Hadoop

HADOOP_CMD=/usr/local/hadoop/bin/hadoop

HADOOP_STREAMING=/usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar

Your hadoop path and jar file may be different based on your hadoop version.

7. Now you need to restart Rstudio server for above changes to take effect:

sudo rstudio-server restart

8. You can connect to the Rstudio server from a browser on any machine with the ip address of the server machine and port 8787. I use firefox and it comes up fine. It asks for an id and password that will be a user id and password on the Linux system.

9. You will get a screen like below when you login and you are all set to use R-hadoop:

I have used the Linux user id spider that was created using sudo adduser spider.

Running a sample program From Rstudio

we can submit hadoop jobs from Rstudio. We need to write a mapper function and a reducer function and then call the mapreduce function in rmr2 package to submit the job to hadoop.

Step 1.

Create the input files in HDFS. I will use a file with retail sales data for multiple stores with the below format as the input file:StoreId,date of sale,total daily salesNYT1,2010-01-01,1221NYT1,2010-01-02,1206

NYT1,2010-01-03,1001

NYT1,2010-01-04,1193

NYT1,2010-01-05,1067

NYT1,2010-01-06,1077

NYT1,2010-01-07,1131

NYT1,2010-01-08,1250

NYT1,2010-01-09,1261

NYT1,2010-01-10,1009

hadoop fs –mkdir data/in

hadoop fs -put sales.csv data/in/sales.csv

Step 2.

Write the mapper in Rstudio:

library(rmr2)

mapper = function(k, line) {

line[[1]]<-lapply(line[[1]],as.character) # <-this is to remove any factors

keyval(line[[1]], line[[3]]) # <- create keyvalue pair output from mapper

}

Note that the mapper gets one input split of data as a list. So line above is not a single line but a list of lines. Since R is good a t vector processing, it makes sense not to call mapper for each line of input.

Step 3.

Write the reducer in Rstudio. We will use my favorite HoltWinters for triple exponential smoothing and prediction.

reducer = function(key, sales.list) {

# Reject lists that are too small for the algorithm

if( length(sales.list) < 100 ) return;

valTS <-ts( as.numeric(sales.list), frequency=7,start=c(2010,1)) #<- convert to time series data

myModel<-HoltWinters(valTS, seasonal=”add”,alpha=0.3,beta=0.2, gamma=7) #<- model using HoltWinters. Gamma represents seasonality.

predictSales<-predict(myModel,7,prediction.interval=TRUE) #<- predict next 7 day sales with upper and lowerbounds

keyval(key, predictSales) #output the predicted values along with key

}

Step 4.

Finally submit the mapreduce job: mapreduce(input=”/user/spider/data/in”, input.format=make.input.format(“csv”, sep = “,”,mode=”text”), output=”/user/spider/data/out”, output.format=make.output.format(“csv”, sep = “,”,mode=”text”), map=mapper, reduce=reducer )

Note that absolute paths are specified for the input and output.

You will see mapreduce job executing like below:

You can check the output using hadoop. For just one store, it will look like below:

hadoop fs –cat data/out/part*

NYT1,2338.51949369753,2432.89119831428,2244.14778908078

NYT1,2116.48417153055,2216.78491253985,2016.18343052126

NYT1,2251.52104468871,2359.36936564803,2143.67272372938

NYT1,2183.65383703299,2300.62907811683,2066.67859594915

NYT1,2193.71659308069,2321.31048815327,2066.12269800812

NYT1,2228.88574017237,2368.47932443543,2089.2921559093

NYT1,2330.63829440411,2483.49715245174,2177.77943635647

The lines contain the store id, predicted value, upper bound and the lower bounds for the next 7 days. If there are multiple stores in the input, then each store will have 7 lines each.

Behind the scenes

The map reduce job is submitted to hadoop by the Rstudio server. Hadoop in turn uses the streaming jar with the mapper and reducer functions. The mapper function is run on R (Separate instance from Rstudio server) and the output key values are sent to reducer. Reducer function is run on another instance of R. The input for the reducer is consolidated from the mappers and all the key values are grouped together. Finally the reducer output is stored back into HDFS.

Advantages of R-hadoop

R-hadoop distributes your R jobs on multiple machines on the cluster. This enables parallel processing if similar R functions have to be run on multiple keys. For example if same analysis has to be done on 10000 customers of a bank, 5000 stores of a retail chain, thousands of credit card customers, millions of customer transactions etc. Though individual algorithm is not distributed, each key can be processed in parallel, leading to significant time savings.

Disadvantages of R-hadoop

Since each map or reduce task runs on separate R instances, the overhead per task is more. Also if you have an algorithm that runs on large amount of data for hours, then R-hadoop does not help in parallelization of the algorithm.

R-hadoop and EMR

EMR is the elastic map reduce service provided by Amazon Web Services. EMR allows one to provision a hadoop cluster on demand and release the resources once the job is done. EMR provides bootstrap scripts that enable you to install any required software before the mapreduce job is started. Using the bootstrap scripts, one can set up R-hadoop on the cluster including the R-server and submit the jobs automatically or through the browser. We did this for an enterprise so that they could also install graphic analysis libraries along with R and run R jobs on hadoop to get the analysis results through EMR.

Conclusion

R-hadoop is very convenient for distributing your analysis using R so that processing for multiple keys can be distributed across the cluster. Those data-scientists who are well versed with R will find it very easy to use R-hadoop. For cases where algorithm itself has to be parallelized, R-hadoop may not be useful and other alternatives like Spark machine learning library may be used.

Saturday, 11 July 2015

BigData Visualization Tools

Big Data is more valuable when visualized and analyzed

Data visualizations are everywhere today. From creating a visual representation of data points to impress potential investors, report on progress, or even visualize concepts for customer segments, data visualizations are a valuable tool in a variety of settings. When it comes to big data, weak tools with basic features don’t cut it. The following 39 tools (listed in no particular order) are some of the best, most comprehensive, sophisticated-yet-flexible visualization tools available — and all are capable of handling big data.

Many of these tools are Open-Source, free applications that can be used in conjunction with one another or with your existing design applications, using JavaScript, JSON, SVG, Python, HTML5, or drag-and-drop functionality with no programming required at all. Others are comprehensive business intelligence platforms capable of sophisticated data analysis and reporting, complete with a multitude of ways to visualize your data. Whether you need to analyze data and determine the best ways to present it to clients or partners, or you have a visual layout in mind and need a tool to bring your concept to life — there’s a tool on this list to serve your needs.

ProfitBricks Cloud Computing – IaaS – Optimized for Big Data

ProfitBricks Cloud Computing – IaaS is the best platform for all of your big data workloads and projects. Every cloud server instance has dedicated CPU cores, dedicated RAM, and 80Gbps connections between servers and servers and storage – enabling the best price/performance ratio in the industry. With predictable performance, and low latency, your jobs will finish in record time – every time. Trust your next big data workload with ProfitBricks. Try us for free with our 14-day free trial. Signup today – no credit card is required.

1. Polymaps
Need to display complex data sets over maps? Polymaps is a free JavaScript library and a joint project from SimpleGeo and Stamen. This complex map overlay tool can load data at a range of scales, offering multi-zoom functionality at levels ranging from country all the way down to street view.

Key Features:

Uses Scalable Vector Graphics (SVG)
Show data at country, state, city, neighborhood, and street views
Basic CSS rules control design
Imagery in sphericalMercatorr tile format

Cost: FREE

2. NodeBox

@Nodebox

A family of open-source tools developed by the Experimental Media Research Group, NodeBox offers capabilities ranging from a cross-platform graphics library to a Mac app that creates 2D visuals coded with Python.

Key Features:

Integrates with standard design applications
Cross-platform, node-based GUI
NodeBox1 – Mac app for Python-coded, 2D visuals
Import data in a variety of formats, including Excel
Animation-capable
Build generative designs with minimal programming skills

Cost: FREE

3. Flot

@flotcharts

A JavaScript plotting library for jQuery, Flot is a browser-based application compatible with most common browsers — including Internet Explorer, Chrome, Firefox, Safari and Opera. Flot supports a variety of visualization options for data points, interactive charts, stacked charts, panning and zooming, and other capabilities through a variety of plugins for specific functionality.

Key Features:

Supports lines, plots, and filled areas in any combination
Use combinations of display elements in the same data series
Plot categories and textual data
Add HTML with standard DOM manipulation
Produce interactive visualizations with a toggling series
Direct canvas access for drawing custom shapes

Cost: FREE

4. Processing

@ProcessingOrg

Processing was originally created as a means to teach computer fundamentals in a visual context, but is now used by students, designers, researchers, artists and hobbyists to create learning modules, prototypes and for actual production. Users can create simple or complex images, animations, and interactions.

Key Features:

2D, 3D and PDF output
Interactive programs
Open GL integration
More than 100 libraries for add-on functionality
Create interactions, textures, motion and animation

Cost: FREE

5. Processingjs.org

@Processingjs

The sister site of Processing, Processing.js is the tool you need to transition your complex data visualizations, graphics, charts and other visuals to a viable web format without any extensions or plugins. That means you can write code using the standard Processing language and insert it into your website, while Processing.js makes it functional without additional coding requirements.

Key Features:

Allows Processing code to be run by any HTML5 browser
Integrate animated and interactive visualizations into any web page
No major additional coding necessary

Cost: FREE

6. Tangle

Tangle is a JavaScript library and tool that takes visualizations beyond the visual, allowing designers and developers to create reactive programs that provide a deeper understanding of data relationships. For example, a web-based conversion calculator that converts currency or measurements.

Key Features:

Allow readers to change parameters
Based on defining variables, formats and classes
Create charts, graphs and other data visualizations using Tangle classes
Capable of creating dynamic displays
Create controls and views using multiple variables simultaneously

Cost: FREE

7. D3.js

A JavaScript library for creating data visualizations with an emphasis on web standards. Using HTML, SVG and CSS, bring documents to life with a data-driven approach to DOM manipulation — all with the full capabilities of modern browsers and no constraints of proprietary frameworks.

Key Features:

Bind arbitrary data to DOM
Create interactive SVG bar charts
Generate HTML tables from data sets
Variety of components and plugins to enhance capabilities
Built-in reusable components for ease of coding

Cost: FREE

8. FF Chartwell

@FontFont

FF Chartwell transitions simple strings of numbers into editable data visualizations for further customization using OpenType features. It’s an extension that can be used with a standard design suite, such as Adobe Creative Suite, to simplify the process of designing charts and graphs.

Key Features:

Use simple data strings to generate charts and graphs
Useful for creating components of a larger infographic
No-code functionality saves time
Integrates with design applications
Multiple types of visualizations

Cost:

All 7 weights – $129
Individual weights – $25 each (bars, vertical, lines, pies, radar, rings, rose)

9. Google Maps

@GoogleMaps

Google Maps offers several APIs for developers, such as Google Earth, Google Maps Images, and Google Places. These tools enable developers to build interactive visual mapping programs for any application or website.

Key Features:

Embed maps into web pages
Pull data about establishments, places of interest and other locations
Enable web visitors to utilize Google Earth within the constraints of your site

Cost: Contact for a quote

10. SAS Visual Analytics

@SASsoftware

SAS Visual Analytics is a tool for exploring data sets of all sizes visually for more comprehensive analytics. With an intuitive platform and automatic forecasting tools, SAS Visual Analytics allows even non-technical users to explore the deeper relationships behind data and uncover hidden opportunities.

Key Features:

Deploy on-premise or in a public or private cloud
Drag-and-drop autocharting chooses the best layout for data
Pop-up boxes identify potentially important correlations
Scenario analysis enables predictions based on variable changes
Save views as reports, images or SAS mobile apps
Create web-based, interactive reports
Easy integration of action elements for users to manipulate data

Cost:

Free demo with full features (no ability to save reports between sessions)
Call for a quote

11. Raphael

@RaphaelJS

A JavaScript library for creating vector graphics on the web, Raphael uses SVG and VML so that every graphic created is also a DOM object. Raphael’s goal is to enable vector graphics creation with cross-browser compatibility.

Key Features:

Include Raphael.js in a web page for functionality
Create a variety of charts, graphs and other data visualizations
Multi-chart capabilities

Cost: FREE

12. Inkscape

@Inkscape

Inkscape offers functionality similar to that of more expensive applications, such as Corel Draw and Illustrator, yet it’s an Open Source editor for vector graphics. Inkscape supports many advanced SVG features for ease of use and encourages developer collaboration in a community environment.

Key Features:

Handles complex graphic tasks similar to standard software
Native SVG format
Create website mockups
Bitmap import and display capabilities
Files stored as vector graphics

Cost: FREE

13. Leaflet

@LeadletJS

An Open-Source JavaScript library, Leaflet is a tool for creating mobile-friendly, interactive maps. Developed by Vladimir Agafonkin and a team of contributors, Leaflet was designed with the goals of simplicity, performance and usability.

Key Features:

Works on all major desktop and mobile browsers
Various plugins for extended capabilities
Incorporate interactive features
Multiple available map layers
CSS3 features for streamlined user interaction
Eliminates tap delay on mobile devices

Cost: FREE

14. Crossfilter

Exploring large multivariate data sets in a browser is made possible by Crossfilter, a JavaScript library that’s capable of handling data sets with more than a million records. Crossfilter uses semantic versioning and creates data visualizations easily using values, objects and other components and commands for customization. It was actually built to power analytics for Square Register to enable merchants to manipulate sales and purchase data.

Key Features:

Uses semantic versioning
Explore large multivariate datasets
Fast incremental filtering and reducing
Improves performance of live histograms

Cost: FREE

15. OpenLayers

Insert a dynamic map on any web page with OpenLayers. It implements a JavaScript API for building web-based geographic applications and works in most modern web browsers with no server-side dependencies. It’s an open-source software with a new edition in the works, OpenLayers 3, which incorporates the most recent HTML5 and CSS features and enhance 3D capabilities.

Key Features:

Works in most modern web browsers
No server-side dependencies
Creates embeddable, dynamic maps
Functional zoom, geo-location and dozens of other functions

Cost: FREE

16. Kartograph
A Python library and JavaScript library in one, Kartograph caters to developers who want to create Illustrator-friendly SVG maps and interactive maps that will work across all major browsers.

Key Features:

Two libraries: Python and JavaScript
Kartograph.js creates interactive maps in minutes
Stand-alone; no server required
Kartograph.py creates compact SVGs using Visvalingam simplification
Layer data sets on maps for multi-layer visualization

Cost: FREE

17. Microsoft Excel
Microsoft Excel is widely noted for its data manipulation and analysis capabilities, but it’s often used to create powerful data visualizations. The latest edition of Excel is packed with visualization tools, including recommended charts, quick analysis of the different ways to display your data, and a multitude of control options to change the look and layout of your visualizations.

Key Features:

Perform data analysis and create visualizations in the same program
Compare various ways to represent your data
Change tile, layout and other format options
Excel recommends the best visualization for your data
Compatible with Microsoft Office products

Cost:

Stand-alone – $109.99
Complete Office Home & Professional Suite – $219.99
Complete Office Professional 2013 Suite – $399.99

18. Modest Maps
A free, extensible library for developers who want to incorporate interactive maps into their applications, Modest Maps is a collaborative project by Stamen, Bloom and MapBox.

Key Features:

Used as the foundation for building mapping tools
Used with several extensions, such as MapBox.js, HTMAPL, and Easey
Designed to provide basic controls

Cost: FREE

19. CartoDB
Visualize hundreds to millions of data points with CartoDB, which allows you to upload data and visualize it within minutes. It also enables geospatial analysis to explore, refine and obtain insights from your data.

Key Features:

Explore data and get insights
Edit data directly on maps
Compatible with PostGIS for more powerful analysis
CartoCSS for advanced styling
Supports raster and vector data

Cost:

Newbie Server – Free (up to 5 tables)
Magellan Plan – $29 per month (up to 10 tables)
John Snow Plan – $49 per month (up to 20 tables)
Coronelli Plan – $149 per month (unlimited tables)

20. Google Charts
Google Charts offers a variety of data visualization formats, ranging from simple scatter plots to hierarchical treemaps. Visualizations are fully customizable, and you can connect to your data in real time through dynamic data.

Key Features:

Take advantage of the same charts Google uses
Assemble multiple charts into intuitive dashboards
Cross-browser compatibility
Cross-platform portability (iOS and Android devices)
Choose from a variety of charts

Cost: FREE

21. Gephi

@Gephi

Gephi is an Open-Source application that runs on Windows, Linux and Mac OS. The platform allows users to both visualize and explore data, including complex analysis of links, social networks, and more for a greater understanding of data relationships.

Key Features:

Plugins for greater customization
Deep data analysis to examine relationships
Built-in 3D rendering engine
Real-time visualization
Dynamic filtering
Intuitive interface with built-in workflow organization

Cost: FREE

22. Flare

An ActionScript library for creating data visualizations that run in Adobe Flash Player, Flare is an Open-Source application that’s been used by multiple well-known organizations and publishers to create powerful visualizations, including Slate, the IBM Visual Communication Lab, and ABC News.

Key Features:

Capable of complex, interactive graphics
Supports data management, visual encoding, animation and interaction
Variety of visualization formats from timelines to multi-layer graphs illustrating relationships

Cost: FREE

23. Envision.js
Create fast and interactive HTML5 visualizations with Envision.js, a library capable of displaying real-time data, time series, finance visualizations, AJAX-driven financial charts and custom visualizations, including fractals.

Key Features:

Built-in templates for various charts and graphs
Incorporate Visualizations, Interactions and Components for customization
Custom flotr chart types

Cost: FREE

24. Miso
An Open-Source tool in development, Miso incorporates Datasets, Storyboards and d3.charts for interactive storytelling and data visualization. Miso is a joint project between The Guardian and Bocoup, with support from GlobalDevelopment and The Bill and Melinda Gates Foundation.

Key Features:

High-quality interactive storytelling
Data visualization content
JavaScript client-side data management and transformation library
Create reusable charts with D3.js

Cost: FREE

25. The R Project
The R Project for Statistical Computing runs on UNIX, Windows and Mac OS. Designed for statistical computing and graphics, it’s considered a different implementation of S and contains some native S code that remains unaltered within R, although there are some significant differences.

Key Features:

Data manipulation, calculation and graphical display
Integrated tools for instant analysis
Conditions, loops, user-defined recursive functions, and input and output facilties
Define new functions for increased capabilities

Cost: FREE

26. Tableau Public

@Tableu

Tableau is an easy-to-use tool for creating interactive data visualizations quickly and embed them on your website. Designed to be used by developers and non-developers alike, Tableau is used by bloggers, journalists, researchers, advocates, professors and students.

Key Features:

Once online, others can download and manipulate visualizations
Desktop application but completed graphics are stored on a public server
Store up to 50MB of data (with free plan)
Drag-and-drop interface; no programming skills required

Cost:

Public Edition – Free
Personal Edition – $999
Professional Edition – $1,999

27. Timeline JS

@knightlab

Build interactive timelines in 40 different languages with Timeline JS, an Open-Source tool capable of pulling in media from multiple sources. With built-in support for Twitter, Flickr, Google Maps, YouTube, Vine and other applications, Timeline JS has a lot of functionality — which can be extended further by those with JSON capabilities for custom installations.

Key Features:

Build timelines using Google Spreadsheet data
Simply upload a spreadsheet and generate embed code
Embed audio and video in timelines from 3rd-party apps
WordPress plugin
Feed data from a database with JSON

Cost: FREE

28. Quadrigram

@quadrigam

Quadrigram allows users to create completely customized visualizations using their own data and various components from a built-in library of everything from charts and graphs to quadrification and stacked flow. Based on a Visual Programming Language (VPL), Quadrigram can pull multiple data sources to create endless variations of prototypes and data visualizations.

Key Features:

Complete library of interactive visualizations
Build animations, dashboards and more
Sketch ideas and create rapid prototypes
Cloud-based computing for quick data processing
Server-side integration of R and Gephi
Leverage multiple publicly-available datasets

Cost (prices converted from Euros):

Academic – $8.09 per month (1 user, 10MB storage)
Personal – $25.63 per month (1 user, 1GB storage)
Professional – $79.60 per month (1-2 users, 5GB storage)
Workgroup – $335.93 per month (1-10 users, 50GB)
Enterprise – Contact for a quote

29. Prefuse
Prefuse is a data visualization tool that has been used by the IBM Visual Communication Lab to create visualizations for its Many Eyes tool. The Prefuse toolkit provides a visualization framework for JavaScript, while the Prefuse Flare toolkit offers visualization and automation tools for ActionScript and Adobe Flash Player.

Key Features:

Data modeling, interaction and visualization
Optimized data structures for a variety of visual layouts
Supports animation, dynamic search and database connectivity
Uses Java 2D graphics library

Cost: FREE

30. Many Eyes
Many Eyes is an experiment created by IBM Research and the IBM Cognos Software Group. This tool provides a platform for creating a variety of visualizations to illustrate data point relationships, compare sets of values, create line and stack graphs, analyze text or view the various parts of a whole in a pie chart or treemap.

Key Features:

Choose from a multitude of ways to display data
Upload data sets for public use
Displays data using Java and Flash
Get feedback through user ratings
Full control to delete your data sets and visualizations
Use existing data sets from other users or use your own

Cost: FREE

31. Cytoscape

@Cytoscape

Visualize complex networks and integrate with any type of attribute data with Cytoscape. With special features for specific areas of analysis, such as bioinformatics, semantic web, social network analysis, Cytoscape is packed with features to create fascinating graphic representations of data relationships.

Key Features:

Apps for problem domains
Advanced analysis and modeling using apps
Visualize human-curated pathway datasets
Visualize social networks for interpersonal relationships
Use in combination with other tools (e.g. R, NetworkX)

Cost: FREE

32. NetworkX
NetworkX is based on the Python programming language, capable of creating graphs, digraphs and multigraphs based on data sets comprised of multiple media formats. Python is a multi-platform language for creating more cross-compatible data visualizations.

Key Features:

Study the structure, dynamics and functions of complex networks
Nodes can contain any media type, such as images and XML
Edges capable of holding arbitrary data, such as weights or a time-series
Generators for various graph types – classic graphs, random graphs, synthetic networks

Cost: FREE

33. Arbor.js
Arbor is built with web workers and jQuery, creating a data visualization tool for use with canvas, SVG, or positioned HTML elements. Arbor is designed to enable developers to create code that emphasizes the uniqueness of their data sets rather than the physics required for various layouts.

Key Features:

Capable of handling real-time color and value tweens
Force-directed layout album plus abstractions
Actual screen-drawing is up to the user

Cost: FREE

34. iCharts

@iCharts

iCharts is a web-based application capable of producing compelling data visualizations for the web. Incorporate charts and graphs into a website or application or distribute completed visualizations through social media or the iCharts ChartChannel.

Key Features:

Brand visualizations with your company logo
Add tags and descriptions for better discovery
Enable 3rd-party sites to re-embed visualizations to expand reach
Enable social sharing
Create interactive, explorable charts
Activate custom forms for lead generation
Analytics reports on chart views, shares and embeds

Cost:

Basic – Free (public charts only)
Gold – $25 per month (private charts)
Platinum – $75 per month (branded charts)
Enterprise – Contact for a quote (full features)

35. Databoard One of the latest tools from Google, Databoard is a part of Google’s Think platform, geared to business owners. Explore insights directly from Google research studies to find data quickly, and create custom infographics to embed in your website or share on your social networks.

Key Features:

Explore Google research studies for data
Instantly generate graphic components
Build custom graphics by incorporating multiple components
Focused primarily on mobile data

Cost: FREE

36. Q Research Software

@qstatistics

A powerful database for both research and data visualizations, Q Research Software is a valuable tool for preparing market research reports complete with targeted accompanying visualizations. Export to Word, Excel and PowerPoint in graphic format, CSV files, PDFs and choose from dozens of tools and components for complete customized visualizations.

Key Features:

Editable Office graphics
Multiple chart types (line, bubble, pie, column, etc.)
Histograms and scatter plots
Update tables with real-time data
Create variables, apply filters and perform statistical testing

Cost:

Standard License – $1,499 per year (all features)
Transferable License – $4,497 per year (install on multiple computers)

37. Dapresy

@dapresy

Designed for research analysts, Dapresy allows users to build infographics for slides and dashboards with a drag-and-drop interface for ease of use. It’s a comprehensive platform that handles the entire reporting process from data analysis to visually appealing presentation tools and dashboards.

Key Features:

Simply import fieldwork, Dapresy handles data processing
Charts, tables, cross-tabulations comprehensive statistical analysis
Build dynamic elements for the marketing dashboard
Pack data from a 200-slide presentation into a few dynamic Dapresy slides
Idea Box for inspiration

Cost: Contact for a quote

38. Visualize Free
Based on the commercial visualization tool InetSoft, Visualize Free is a free alternative for sifting through multiple data sets and variants to identify trends and manipulate data with a few simple clicks.

Key Features:

Upload your data in Excel or CSV format
Drag-and-drop components to build visualizations
Sandboxes for analysis and sales data
Share publicly or privately

Cost: FREE

39. Jolicharts

@Jolicharts

Embed charts and graphs into your applications with Jolicharts, which is compatible with multiple data sources and can handle complexity of connecting multiple sources. With integrated elastic calculation capabilities, Jolicharts can handle big data with ease.

Key Features:

Drag-and-drop interface to create stunning dashboards
Export dashboards to XLS, PDF or JPG formats
Filter to securely separate user data
REST-based API for compatibility with any application
Cloud-based application keeps your data and visualizations accessible
HTML5 dashboards for accessing data from any device

Cost (prices converted from Euros):

Forever Free Plan (5 databases per dashboard, up to 50MB calculation power)
2GB – $39.12 per month
10GB – $86.34 per month
50GB – $295.45 per month
250GB – $565.27 per month