Ultimately my report was a bit of a rehash of concepts we discussed in class with some personal investigations into certain aspects of Cloud Computing, but that works pretty well as an Essay on the topic to share with people, so I'm happy to throw it on here and let people get something that's hopefully a bit easier to read and understand than what is usually provided for the topic.
That may be a fickle hope, but either way it's another resource and one I can attribute my name towards. At the end of it you'll find the sources cited for further reading if you so desire those resources.
The class overall was what I will diplomatically call interesting; I learned a lot from it, but I'm not sure how much of it was intended as the lessons were largely in the indomitable human spirit and my capability to force information out of the internet much as one would force an unpeeled grapefruit through a sieve.
Hope you like the essay I wrote, have yourself a lovely day, and see you Wednesday for something more!
(Note: The exchange from a Microsoft Word Document to a blog post has made it kind of messy, you can also grab the original file to avoid the mess if you like.)
Advanced Techniques for Cloud Data Storage and Collaborative Systems by Mac Clevinger
As the advent of computer technology continues to expand and be developed to further degrees of complexity, its usefulness as a tool for the handling, processing, and analyzing of data becomes increasingly crucial to keeping up with the vast amounts of information produced by day-to-day activities. However, performing these functions efficiently and even storing the vast amounts of data that are produced are not trivial tasks, and require specialized structures, both physical and virtual, to do so effectively. The common approach is to utilize a trend in computer science called Cloud Computing, which is expressed by the usage of numerous computing machines communicating with each other over a network to solve a common task.
These machines are typically grouped together into specially built Datacenters that are capable of storing vast amounts of data and performing rapid computations on that data, the machines performing the computations being on the same physical network as those storing the data which allows for large amounts of data to be processed without the overhead usually associated with moving information from storage to the computing device. This paradigm eliminates the need for a business or otherwise private entity to possess their own computational or data-storing structure for their data, as one can procure the services of an existing Datacenter for their storage and computational needs until such a point is reached where it becomes more efficient to host their own such Datacenter.
What this means to a casual consumer is that it has become very easy to get access to large-scale data storage and immense computational power without having to personally construct either of these structure nor understand how they work in the back-end; one can simply open an account with a business and begin storing terabytes of data or get access to analytics that are capable of interpreting data from thousands of sources over a long period of time with nothing more than a monthly fee.
How, then, are these Datacenters created? How are they maintained? What are the rules behind their management? The relative power of these structures when compared to the typical personal computer is astonishing and far more capable than simply multiplying the processing prowess of a single machine to however many are working simultaneously, which suggests a coordination of their focuses that generates a power much greater than their mere sum, and computations aside: they are dealing with amounts of data that would be impossible for a non-automated approach to fully handle. How is this stored, and later accessed, in a method that does not permit an unwelcome actor to manipulate or read this data?
An important aspect to consider at the forefront of designing a cloud computing infrastructure is the scope of its operation: who is the target user, and for what purpose are they using it? There are plenty of commercial ventures that offer services to the public through the cloud, be they storage or computational power, but there also exist more privatized networks intended for ‘in-house’ exclusive usage. A company may collect their own data and need to store/process it in such a way that requires the cloud computing architecture to do so efficiently (or at all,) but would not want nor need to have it be accessed from outside the premises of their work. Such a usage could still be contracted out from a foreign entity, utilizing another’s datacenter for a fee, while keeping the brunt of their operation away from public access; the primary difference being described is how readily available access to the service is, or whether the cloud in question is Private or Public.
When planning the architecture of the datacenter, having an idea of the intended usage is vital for configuring it appropriately for its future interaction with users. Is the service going to be publicly available, having a scaling number of users too numerous to manage individually? Given that such a service is accessed from numerous locations outside of the datacenter by unknown entities, considerations of security become much more vital with so many vectors from which an attack or otherwise incidental damage could come from, especially considering that the service needs to be easily accessed by its users without allowing for flaws in its security. In a more privatized setting, such that only whitelisted parties known to the service provider may access the service and all other attempts are refused, the vulnerability from a foreign agent is handled by being far more stringent in who has permission and how they are permitted to interact with the system.
However, issues then arise if an individual possessing of permissions decides to exploit the system that they are already embedded within, and given the difficulties of foreign access to the system due to the nature of its security, trying to get back into a compromised system that has been designed to not permit that kind of access is certainly a concern to be aware of at the outset. Public and Private cloud configuration are important distinctions to consider in the implementation of the cloud computing architecture that vastly impact the user experience and the problems one may face as the manager of such a system.
From a distance, Cloud Computing looks to be a simple communion between numerous machines, but many questions arise when the mind turns towards its implementation. What is to follow is a report and analysis on how these ground-level concepts are designed and utilized, specifically looking at methods of data storage and ensuring their protection/integrity, and methods of coordinating computations between numerous machines. This will be done with a focus on how one would begin to approach their own implementation of this within a range of contexts, with explanations of the theoretical components followed by commentary on practical means of implementation.
A problem that has arisen over the past decade is one of an imbalance of growth rates: the development of physical data storage, or how much stuff we can store in a given space, increases much more quickly than does our ability to process that data. A concept called Moore’s Law suggests that data storage doubles roughly every two years, while the computational capabilities of new machines grow at a slower rate.
The amount of data that can be produced and needs to be stored is practically infinite as we find more ways to utilize and interpret information, which leaves us in the position to be able to gather and store more and more data but not be able to process it at that same increasing rate. Thus, we need efficient methods of data storage to best complement our processing practices so that this difference is alleviated to some degree; we cannot inefficiently search through troves of data for the input for every operation we try to perform.
Mentioned earlier was the idea of data locality; that the data one wishes to process be stored close to the device performing computations to reduce the time it takes to transfer data from storage to computation. In moderate volumes of data, transferring from an external storage location to the machine processing it is feasible, but in terms of the massive amounts of data that cloud computing is well suited to handle, even a very minor transportation cost is magnified to the point of severely impacting efficiency.
This leads to Datacenters being required to both host and process their large volumes of data in the same location in the interests of efficiency, which does not permit one to split up the two tasks without having an adverse effect on their overall efficiency. (That is, one could establish or access pre-existing datacenters in such a way that divides the brunt of their computations and storage requirements from one another, but to do so would lose the advantage of data locality. If one were to be building their own datacenter, any designs would need to have the unity of storage and computation as a core component if they ever expected to need to handle massive amounts of data processing.)
This aspect of data locality works very well in occasions wherein an entity is storing massive data and processing it, but an advertised feature of cloud computing is that of employing a foreign entity as a kind of storage-or-computation for hire, utilizing the structures that already exist for a fee when that is more efficient than constructing your own such facility.
In occasions where one, as a private entity, possesses data that must, then, be transferred to the hired datacenter for processing, any advantages from data locality are lost in the context of rapid computational transactions. (That is, employing the facility for the use of computations and not for storage or a combination of the two. Having to send your data over a network to have it processed and returned is a far call from the machine accessing it locally in a growing context as more data is added.)
Despite the benefit of efficiency from data locality, it still relies on the proper context to be worth implementing or designing one’s structure around, likely being utilized whenever data storage is a key feature of a service but no so prominent when data is guaranteed to come from an outside source and not be stored for an indefinite amount of time.
While the way that data is stored in geographical relation to the machines processing it is important, so too is the manner in which that data is moved, be it over a local, physical, network or through the cloud. (The cloud being an abstraction of the numerous points that two machines will route through to communicate with one another.) The way that the data is able to be accessed is tied into the way that it has been stored; does the system need to parse through arbitrarily designed files and interpret them according to a custom-built script, or are the files already set up in a widely-recognizable format that plays well with consistent, pattern-based processing?
The two cases described are those of unstructured and structured data, describing how the data is composed within storage. Structured data is often found in relational databases, where all data stored is grouped into distinct entities that follow a shared organizational scheme for their contents. Supposing one were storing people’s names, if the data were structured then there would be consistency amongst all data in where the person’s first and last names were, so if the first name from every entity were desired one could simply access each entity one time and extract just the data that was desired. Unstructured data, on the other hand, is exactly how it sounds: lacking in this kind of consistency.
While the data may be recognizable to the casual human eye, there lacks consistency to be able to create an algorithm that accesses just the relevant information, and as such is a slower and sometimes less desired paradigm. However, much of the data that exists which we want to process is already in this format (such as one’s backlog of emails or every Facebook or Twitter post that one has written), so efficient approaches to parsing colossal amounts of unstructured data are very much desired.
Approaches for parsing structured data are, of course, also desired given that giant data remains giant no matter how well organized it is, however, due to its structured nature, the matter of casual algorithms to parse it are trivial, whereas something as simple as counting the number of occurrences of a word in numerous word documents (their being unstructured data) becomes a much more complicated task to do quickly.
Before discussing approaches for unstructured data processing, in what scenarios would one wish to use structured or unstructured data paradigms when designing their own storage methodology for a datacenter? This problem is expanded when one considers that they will likely be writing the code that informs how to access that stored data and would preferably like a scalable approach that either knows what it will be dealing with no matter the input or is capable of handling wildly varying file formats that center around a common theme. (A bank transaction report being stored versus someone’s essay being checked for plagiarism, for example; the former is required to follow a consistent format while the latter is likely to vary wildly.)
An important point that should be made, though, is that the two are not discrete of one another in their usage. Structured data describes how the fields for its contents should be set up and is rigid in that regard, but one of those fields could be for a collection of unstructured data assigned to that entity. For example, one’s social media account may have the common fields of username, password, and preferred settings, but then contain a library of their messages which don’t follow any format, so considerations of a hybrid approach are also important based on the context of what problems are being solved with the constructed datacenter.
The correct usage is highly contextual and likely not even a discrete matter, but if the possible fields for an entry and its automated or user-driven construction of that entry are of a known, limited size that allows for each index of your data to be known, then a structured approach is likely an appropriate one. However, for much of analytics surrounding multimedia or any field that produces a variable amount of variable-length-and-format file data, unstructured is the necessary approach and must be handled appropriately.
This leads to the question: how does one efficiently process variably sized files of a generally known format in quantities that scale from trivial to impossible by normal approaches? The usual approach would be to parse it line by line, collate information, and then parse through that aggregate and interpret it in a fashion that may not be excessively complicated, but is made so by the sheer volume of the task. One machine would take far too long to do this, but this branch of computer science does not deal with one machine.
Cloud computing is the coordination of numerous machines to perform tasks far more effectively than each on their own, and that idea is applicable to handling the problem of massive input very easily: break the input apart, aggregate it across numerous systems, and then process the recombined data once it has been rendered into a usable format. This process is called MapReduce: you first map your input to distinct keys and then reduce them by performing an aggregating function on them. (Such as counting the occurrences of a term by making each distinct term be one of these keys, and then summing these matching keys.)
In a coordinated system of machines, these individual tasks can be split up between individual machines which perform these functions repeatedly and aggregate their results to a common pool that can be split up again until the defined function is finished. The MapReduce concept can be implemented in a variety of ways, one such commonly used open-source version called Hadoop allowing for great simplification of the networking between different machines. Despite the simplification, however, the data being unstructured means that there is no silver bullet for handling it between different cases.
The programmer must still define how the files are read in, and how their contents are handled, using (for a first-time user of Hadoop) an unfamiliar library of methods to perform these actions. This can be a difficult matter for a programmer, as there are many unexplained intricacies to getting Hadoop to work which vary greatly across different operating systems. However, once working, it is a very effective tool for processing large data utilizing numerous computing machines at a very rapid rate, so long as data locality is assured for its operations.
Considering that data is being taken from a storage location, split up and directed towards numerous computers which, after processing, then direct their output to other machines, and that this process can be repeated many times for large scales of data, the importance of these operations that move the data from one location to another being quick and consistent is paramount. Faulty connections over a wireless network or simply long travel times will slow down this process immensely, so having data locality to ensure that the data can be moved almost instantaneously is vital.
If that can be assured, and the programmer has defined efficient methodologies for processing their file input that are robust and capable of dealing with what variety there can be amongst that input, then even when their stored data is unstructured and vast it will be processed at a consistently rapid rate due to the advantages granted by cloud computing’s coordination of many machines and the Hadoop architecture.
On the matter of data security, there are many facets of the problem to consider. The obvious concern is that some third party will access data meant to be kept secret from everyone besides the service provider and its user, be that a commercial trust between customer and their service provider or a private datacenter utilized by some corporate entity that is meant to keep the data ‘in-house’, as it were. This problem alone makes up a large amount of the research done into the development of architectures that maintain security in the increasingly complex world of computer technology, but it is by no means the only way in which the maintaining of one’s data can go awry.
Merely ensuring the integrity of the data being stored is a difficult task in its own right within a context where so much information is moved rapidly over long distances alongside a wealth of other data, and even that is before considering a hostile entity tampering with the data maliciously. Whatever the cause, losing the data you intended to store is not a preferable outcome, but perhaps worse yet is the problem of detection: being able to know whether or not the data being looked at is still intact and has not been altered to serve someone else’s hostile interests.
If the system does not even know that something is wrong, incredible damage can occur before anyone realizes that there was a manipulation, and even more before the matter is fixed. Having some way to detect when an outside party has accessed your data in some way, be it to spy, manipulate, or destroy, is essential to handling those cases when they come along, and given how common these events have become and how easily they can be perpetrated against ill-prepared bodies, to not protect against them is to invite disaster. Outside of the abuse of your architecture, there are also concerns to be raised over its mere exploitation.
If the common user of your datacenter has access to its most fundamental aspects, then it becomes exceptionally simple for a user to wreak havoc without having to ‘hack’ in and given enough time such a strike could occur unintentionally simply because they made a mistake and there was no contingency in place. Further, if there were no boundaries put on the access of the system by its users, they could manipulate one another’s data with ease, which in the majority of services would not be a preferable outcome.
Thus, it becomes vital that a system of permissions be put in place to ensure the privacy of data and controls so that only those who should be able to perform certain actions can do those; having a mistake in the logical permissions could lead to malicious entities not needing to break in, instead being invited in and welcomed to tear the datacenter apart. (Or worse, exploit it for some manner of profit and be long gone before the system notices the issue, if it ever does.)
In the event that an individual becomes the host of a datacenter offering any kind of service, then all of these concerns are their responsibility to ensure are handled for the sake of maintaining the trust of their users, the efficacy of their architecture, and avoiding the great cost that repairing these intrusions/attacks would incur. However, handling the matters of Data Integrity, Data Theft/Loss, and Privacy are not trivial tasks.
Cloud computing is able to alleviate some of the incidental concerns previously expressed merely by writ of its design; often, data is stored redundantly on multiple machines as a safeguard from sudden failure eliminating vital information, so in many cases of confusion about data loss or manipulation, there are back-ups available locally to revert back to or check against if the programmer has so decided to implement such a feature to the network.
System snapshots are also a functionality available to cloud networks, wherein periodically the network collates the current state of each machine and their processes and saves that information as a safety-net to revert back to in case of emergency. A quite severe problem is raised here, however, and is one that comes up in many occasions where many machines need to be coordinated to work at the ‘same’ time: the actual clocks in each machine are not synchronized and are not able to be made to do so.
The idea of clock drift is one that has been around for centuries, and in a context where we measure operations by nanoseconds the impracticality of system clocks’ staying in-time with one another becomes a quite large issue. There is simply no way around the fact that the internal clocks vary, so how is it possible that a snapshot can be taken of the entire system architecture at the ‘same’ time given the latency of communication between a variable number of machines?
The required approach, called the Chandy-Lamport Global Snapshot Algorithm, considers the causality of each action performed by each machine instead of trying to get the same time-stamp; some first machine will save its own state and then request the state of its neighbors, all of whom will do the same action and receive responses before sending their data back to the previous machine. If a machine has already given a response and is contacted by another machine, then it simply returns a negative message informing the caller that it is already ‘saved’.
This approach gets an approximate image of the entire system and does so with as little interference as possible to ensure accuracy if the snapshot is needed to reset the state the of network. Ideally, this produces a sequence of saved states, all of whom maintain logical consistency with one another, which roughly translates to ‘if A received a message from B, then B sent a message to A.’ If this is not the case in the produced snapshot, then it is not a viable option for rebooting into to fix a memory loss/corruption issue.
While ensuring that data is kept accurate on server-side is beneficial to the server-owner’s concept of the data’s integrity, some means must be put in place for the sake of the user of the service, be it a distant customer or manager of the server accessing it from the outside. A simple approach to proving that the data has been untampered is to use public-key cryptography: generate a pair of matching keys, one public and one private, store on their local machine a ‘fingerprint’ of the file before sending the file to the server along with the public key.
Then, any time the user wants to ensure the file is unchanged, they can request and decrypt the portion corresponding to their ‘fingerprint’ they stored locally; if they match, they can assume it retains integrity and can continue to trust the service provider, otherwise they know the data has been tampered with. This approach is called Provable Data Possession, or PDP, and is one of many such approaches to permitting an outside entity to ensure that their data has been untampered with.
However, there is a flaw that is common amongst approaches to ensuring user-end data integrity: it only works with static data. If the service provided is one that stores data and performs some kind of function on it that changes the data, then it becomes no mean feat to assure the user that the data which no longer looks like what they initially stored is, in fact, theirs. Numerous competing approaches, such as Basic Proof of Retrievability (POR), Proof of Retrievability for Large Files, HAIL (High-Availability and Integrity Layer), and POR Based on Selecting Random Bit in Data Blocks all require that the data be static. Such an approach becomes advantageous in that noticing differences between the saved data’s state and the original state is simpler, but these approaches only work in such narrow cases and do not expand to more complex cases.
In conclusion of this report on the methodologies behind cloud data storage, it has been summarized what problems one would face in their approach to begin implementing the complicated structures defined by the cloud computing architecture, namely those of the forms that data can be stored as – structured and unstructured – and how best to begin parsing that data when the need for storage expands to the need for processing it quickly and efficiently – the MapReduce paradigm for unstructured data, while the benefits of structured data exempt it from needing an API such as Hadoop to process efficiently.
Following that, the matter of data security in the forms of Data Integrity, Data Theft/Loss, and Privacy were raised and discussed, the primary focus being placed on ensuring Integrity from the perspectives of the user and the host, via the Chandy-Lamport Snapshot Algorithm for backing up the host’s network and the Provable Data Possession method for assuring the user that their data remains intact.
While there remains an incredible amount of depth to yet explore in the field of cloud computing for matters as simple as the storage of data and leveraging its networked aspect for computational efficiency, this serves as a low-level look to explore the early concepts that would need to be implemented to begin developing one’s own such structures for private or public commercial uses.
Assuancao, Marcos D., Calheiros, Rodrigo N., Bianchi, Silvia, Netto, Marco A. S., Buyya, Rakjumar. “Big Data Computing and Clouds: Trends and Future Directions.” Journal of Parallel and Distributed Computing, vol. 79-80, 2015, pp. 3-15.
Sangat, Prajwol, Indrawan-Santiago, Maria, Taniar, David. “Sensor data management in the cloud: Data storage, data ingestion, and data retrieval.” Concurrency and Computation Practice and Experience, vol. 30, no. 1, 2018.
JU, Jiehui, WU, Jiyi, FU, Jianqing, LIN, Zhijie. “A Survey on Cloud Storage.” Journal of Computers, vol. 6, no. 8, 2011, pp. 1764-71.
Giri, Mahesh S., Gaur, Bhupesh, Tomar, Deepak. “A Survey on Data Integrity Techniques in Cloud Computing.” International Journal of Computer Applications, vol. 122, no. 2, 2015, pp. 27-32.
Saxena R., Dey S. (2014) Collaborative Approach for Data Integrity Verification in Cloud Computing. In: Martínez Pérez G., Thampi S.M., Ko R., Shu L. (eds) Recent Trends in Computer Networks and Distributed Systems Security. SNDS 2014. Communications in Computer and Information Science, vol 420. Springer, Berlin, Heidelberg.
Taylor, Christine. “Structured vs. Unstructured Data.” Datamation, Mar. 2018, https://www.datamation.com/big-data/structured-vs-unstructured-data.html
Chandy, K. Mani, Lamport, Lesli. ”Distributed Snapshots: Determining Global States of Distributed Systems.” ACM Transactions on Computer Systems, vol. 3, no. 1, 1985, pp. 63-75.