Traditionally, “Big Data” referred to solving computationally intensive problems using massive amounts of data. Because of the costs involved, Big Data was initially limited to government agencies or large academic institutions that had access to the most advanced computing resources and sought to solve complex challenges such as predicting weather patterns or mapping DNA sequencing. In the ’80s and ’90s, however, certain consumer industries such as telephone companies and credit card providers learned to mine their massive databases of call records and charge receipts to find “nuggets” of information. Subtle trends and aggregated statistics gave marketing analysts insight into ways to price their services or predict when a cardmember was primed to purchase that trip to Hawaii. More recently, the advent of cloud storage and processing services offered by Amazon and Google, coupled with large-scale open source databases such as Hadoop have drastically reduced the costs to capture, store and analyze large amounts of data.
The sheer abundance of available data suggests that the importance of “Big Data” will only continue to grow. For example, consider that at the end of 2011, an estimated 30 billion individual items were being tracked with unique RFID tags, and Twitter and Facebook were creating a combined 40 TB of log data every day. In addition, the increased willingness of companies and consumers to share data about their daily operations and activities creates an enormous amount of user-generated data. By combining traditional transactional data (e.g., sales, calls, trades, etc.) with user-supplied data that often includes references to products, places and events, the dimensionality of the data increases along with its sheer volume. No longer are companies simply drilling down into their data, but now companies are asking questions that cut “across” datasets looking for trends and opportunities that, absent the marriage of multiple data sources, would have otherwise gone unrecognized. Moreover, companies that have access to much of the consumer data (posts, tweets, check-ins, networking data, etc.) are realizing the value of the data. The availability of valuable data and the important trends that an analysis of such data can reveal have created opportunities for companies to monetize access to the data. All of this has brought Big Data to the forefront of current IT trends.
The increased focus on Big Data has brought with it challenging questions around how companies operating in this realm can protect their proprietary intellectual property, and, in particular, what is “patentable.” Conventional database and storage patents typically focused on the hardware (e.g., high-throughput network storage systems) or the database management software systems used to implement the transactional and/or analytical processes that stored and accessed the data. But as companies move to standard, off-the-shelf (often open source) platforms and cloud-provisioned hardware and storage, the IP created by the new breed of companies becomes more nebulous and difficult to identify, and even harder to patent. After overcoming the somewhat esoteric (and often controversial) requirements of business methods implemented in software when it comes to patents, the process must still be novel (no one person has practiced the process before) and non-obvious (the system not simply a combination of known processes operating as intended).
One approach to identifying potentially patentable subject matter in a “Big Data” environment is to break the process down into three phases – ingestion and cohesion, analysis and provision. By way of example, consider a large retail chain that collects internal data from its point-of-sale, supply chain and customer database systems and external data from market research companies, social media sites (e.g., its Facebook page), third-party credit card companies, and its suppliers and vendors. Collecting and organizing this volume of data on a daily basis is certainly a challenge. Whether through custom programming interfaces, proprietary data services, or other custom data ingestion processes, the mere task of getting all the data into one place at one time and in a format that allows data from disparate sources to be used together can be fertile ground for patents. Processes used to normalize disparate data, filling in “missing” data and formatting data into a common construct for easier storage and/or analysis are just a few possible areas to consider. Using the example above, information about the retailer’s customers purchasing and payment history may come from data sources using seemingly incompatible formats and be organized using different dimensions (time, product, source, etc.) and thus requires a system that ingests data structured in various formats and uses various analytics processes to generate normalized, structured metadata that describes the data.
Once the data is stored and structured in a usable form, the task of making sense of the data begins. For example, the retailer may want to understand the correlation between an uptick in the activity on their Facebook page (posts, likes, comments, etc.) about particular products and subsequent sales of that product. Conversely, poor ratings or complaints may initiate customer service calls or discontinuation of a product. While database marketing and analysis methods have been used for many years to uncover trends and predict purchasing habits, the application of these techniques to such disparate data sources or the processes for finding and extracting the key data in a timely and cost-efficient manner may be new. The complexity and volume of data may require novel query techniques such as mapping meaningful data entities to underlying database elements or breaking complex queries into more manageable, interdependent queries.
In addition to companies that collect data for their own use, many now recognize the value the data has to others, whether for market research, benchmarking studies or other analysis. In some instances, the provision of this data may in fact be the only source of revenue of the company. However, the need to manage and track where the data originated, what restrictions are placed on the use of the data and how to deliver the data in a manner that complies with these restrictions can lead to innovative, patentable subject matter. This can include techniques for anonymizing or aggregating the data such that personally identifiable data or other proprietary information is not compromised. Systems for graphically representing large volumes of data, providing access to the data or transmitting the data are also fertile ground for patentable subject matter. Patenting how the data is provided to third parties can often prove to be the most valuable means for protecting a process because such techniques are usually the only aspects of these implementations that are “customer facing” and easily identified when being copied by others.In summary, while many of the technical challenges of storing and processing massive amounts of data have been addressed for years, only recently has the ability to capture, store, process and provide data across so many domains and with such speed been fully realized. As companies collect and build these databases and begin to recognize the value the data has not only to their business but to others, the need to protect such systems becomes even more important.
 See, for example, U.S. Patent No. 7,822,768, “System and Method for Automating Data Normalization Using Text Analytics”.
 See, for example, U.S. Patent No. 7,949,685, “Modeling and Implementing Complex Data Access Operations Based on Lower Level Traditional Operations”.
 See, for example, U.S. Patent No. 7,444,655, Anonymous Aggregated Data Collection”.