• SC@RUG 2019

    April 2, 2019 at Rijksuniversiteit Groningen
    BERNOULLIBORG, room 253

Keynote speaker

SC@RUG 2019

Speakers

SC@RUG 2019

A comparison between various techniques for smoothness assessment of shapes in computer visualization.

Kevin Gevers and Martijn Luinstra

Abstract In scientific visualization, the smoothness and quality of a shape are often assessed by projecting certain lines onto the shape in question. In this paper, we will discuss different techniques that can be used for this assessment. The techniques we will outline are zebra stripes, reflection lines, highlight lines, and isophotes. For each of these techniques the projection will need to be assessed [BY A PERSON?] to determine the smoothness and quality of the object in question. With zebra stripes, parallel black and white stripes are projected onto the object to see how they are shifted with regard to the curvature and shape. This is the same for reflection lines, except that a photo or other 2D texture is reflected on the surface of the object. Using highlight lines is a simple technique where a light source is reflected on the object creating a flare. Isophotes are computed contours of a constant angle between the surface normal and the eye direction.

For each technique, we will look into the efficiency and complexity of the computation, the quality of the result and what it says about the smoothness of the object. At the end of each technique we will state for which cases the technique works well and for which it is not a good choice. We expect that properly explaining how each technique works and discussing its positive and negative properties will give a good indication of which technique to choose for which case

The Role Of Data Provenance In Visual Storytelling.

Oodo Hilary Kenechukwu and Shubham Koyal

Abstract This paper is to inquire about visual storytelling and the function of data provenance in visualizing narration. The introduction of graphic demonstration in modern storytelling requires data provenance in order to unveil effective scalable images used in telling an interesting story. Data provenance plays significant roles in creating insightful storytelling because of its mining attributes and the ability to explain the origin of the data. Data provenance can be described as tracks of unveiling data from its origin and detailing the contents of such data. An interesting story, on the other hand, can better be told through the use of media like video, photograph, graphics. Hence, our brain is modelled to understand faster, the story told via visual medium than textual stories.

Let us consider how data are filtered while tracing the species of monkeys. Through filming and documentation process, useful information can be established on the origin of different species of monkeys. In processing information that contains images, provenance is useful in ensuring that the right properties like framing and composition of the images are obtained, thereby producing sharp images good enough for documentation.

Our research will focus on the different impacts of data provenance on modern storytelling visualization. It will also look at the application of ”Rule of thirds” in a visual story and how it connects to data provenance. At the end of the research, one can easily understand data provenance and its connection with visual storytelling.

Comparing Phylogenetic Trees: an overview of state-of-the-art methods.

Hidde Folkertsma and Ankit Mittal

Abstract Tree-structured data, specifically ordered rooted trees (i.e. trees with a root node, and ordered subnodes for each node), are commonly found in many research areas including computational biology, transportation and medical imaging. For these research areas, comparison of multiple such trees is an important task. The goal of comparing trees is to simultaneously find similarities and differences in these trees and reveal useful insights about the relationship between them. In biology, a prominent example of a comparison task is the comparing of phylogenetic trees. These trees contain evolutionary relationships among biological species. Comparing them can be useful in various fields, such as epidemiology, where it has been used to study various species of the Ebola virus.

Comparison tasks are seldomly well-defined, because researchers often don’t know exactly what they are looking for. This renders a purely algorithmic approach ineffective. However, since the amount of data to be processed is usually very large, visual inspection alone is not feasible. Current state-ofthe-art methods therefore combine algorithmic analysis and visual inspection by a domain expert. This combines the strengths of the two approaches.

In our proposed paper we will review several state-of-the-art methods for the ordered rooted tree comparison task. In particular, we will focus on methods for comparing phylogenetic trees. For each method, we will briefly explain how it approaches the comparison problem, show a subset of its results and discuss its strengths and limitations.

Technical debt decision making: Choosing the right moment for resolving technical debt.

Ronald Kruizinga and Ruben Scheedler

Abstract In developing software, making the right decision at the right time is challenging. Change management teams, the teams that need to decide which features are to be included in new iterations and which are postponed, often need to take into account technical debt. This is a form of debt incurred by development compromises in maintainability and functionality. Technical debt often comes with interest: it increases the general cost of changing the system in the future. It thus makes the work of change management teams even harder, as they have another complication to consider: should we refactor a software element for a long term benefit or postpone doing so at risk of increasing the debt? Moreover, technical debt can be expressed in multiple ways, such as the work needed to refactor a software element, the time and manpower required and the financial costs, which is on average $3.61 for every line of code.

Choosing if and when to incur technical debt is critical for an optimal development process and is an important topic in technical debt decision making. Within technical debt research, finding strategies to formalize decision making has gained more focus recently. There are multiple strategies to decide whether and when to tackle technical debt, ranging from highly formalized, mathematical approaches to cost-benefit matrices. In addition, change management teams need to consider not only the effects of the debt on the module in which it occurs but also the costs incurred on the software elements that depend on it.

We examine several strategies and discuss various methods for inferring the right moment and right decision-making approach to use when encountering technical debt. We compare the strategies on a variety of aspects ranging from accuracy and completeness to feasability and workload. We thereby provide a complete overview of our selected methods from a more practical viewpoint than previous research.

An overview of Technical Debt and Different Methods Used for its Analysis.

Anamitra Majumdar and Abhishek Patil

Abstract Technical debt is a widely used term in software development which is the cost of restructuring the code as a result of flaws present in the software system. This is caused due to focusing on short-term benefits rather than thinking about the long-term life of the software. Most of the time, developers do not worry too much about the overall health of the software during development and use low quality code in the process to meet their goals quicker. This shortcut results in code smells, bugs, performance issues, security loopholes and unreadable code. Ignoring these problems is the same as going into debt i.e., choosing to ignore the problems so as to “borrow” time and push out releases quicker. Many instances of debts incurred are also unintentional like in cases of updates or patches to a software. These debts put softwares at potential risks and pressurize the developers to revisit the same code to work on it again. Quantification of technical debt is necessary as it gives an idea to the developers as well as the stakeholders of the time and resources required to manage the debt. It also gives us an overall perspective as to why one should be careful while going into debt and what he can do to alleviate the problem.

In this paper, we will discuss in detail about what technical debt actually is, its types, how it impacts the software life cycle in the long run and how developers work on paying back the accumulated debt. We will then review several approaches that have been used in the past by researchers to measure different types of technical debt in a large number of open source software projects and also study their effects on maintainability and debt payback. One of these approaches include using SonarQube, an open source software which performs static analysis of code to detect bugs, code smells and security vulnerabilities, to estimate the time required to fix the debts. While another approach involves measuring the instances of self-admitted technical debt, or in other words, cases where developers have consciously induced debt in their software and written it in the comment lines. In the end, we look to conclude by putting forward our own ideas on how to reduce the problem of technical debt during the development process.

An Analysis of Domain Specific Languages and Language-Oriented Programming.

Abhisar Kaushal and Lars Doorenbos

Abstract Conventional programming paradigms are not always congenial for developers to come up with efficient solutions, primarily because the existing languages provide a generalized approach towards solving problems rather than a specific approach to a particular field. Each domain has its own peculiarities and characteristics and a mainstream programming language might not be the best choice for developers working in that domain.

To overcome this, Language-Oriented Programming (LOP) comes into play, which involves first creating one or more Domain Specific Languages (DSLs) with which the problem will be tackled. Using such a DSL allows developers to focus better on domain specific tasks. Currently, DSLs embedded in a mainstream programming language are the norm. Although this modus operandi is an enrichment over traditional programming it comes with its own set of limitations, especially because it is embedded in a mainstream language.

In this paper we evaluate the advantages and disadvantages of using DSLs, using examples from the video processing and audio synthesis domains as a basis. The main drawbacks include the cost of designing, implementing, and maintaining a DSL. These have to be weighed against the benefits, such as being able to represent a problem in the terms of the domain and the reduction in system complexity. We also shed light on the ”middle-out” approach employed in LOP, how it is different from the traditional approaches, and how it can be recursively applied to reduce system complexity even further by creating multiple DSLs.

A Brief History of Concurrency: Lessons from Three Turing Lectures.

Michael Yuja and Bogdan Bernovici

Abstract Since its inception, the field of computer science has seen the work of great minds drastically change the modern world for the better. Their work is so deeply ingrained in our society that we may barely notice its there. In this paper, we will review the works of three exceptional computer scientists. They are recipients of the Association for Computing Machinery’s A.M. Turing award, which is often regarded as the Nobel Prize in computer science. We examine the lessons offered by Edsger Dijkstra, Robin Milner, and Leslie Lamport upon reception of the award, each of whom won the award in 1972, 1989 and 2013, respectively.

First, we summarize their lectures and lessons that they offer from their personal lives as well as their work. We dive into the topics of abstraction, software correctness, and concurrency. We highlight Dijkstra’s insistence on approaching the task of programming as a high, intellectual challenge, and writing programs correctly instead of debugging them into correctness. We give an overview of Milner’s Calculus for Communicating Systems (CCS), an abstraction for formal modelling and program verification, which was built on these same ideas. Moreover, we review Lamport’s work as an extension of these ideas into more practical applications of concurrent algorithms. As Lamport boldly suggests in one of his writings, programmers will soon need to add the mathematical skills of formal methods of verification and software correctness to their toolbox, or risk becoming obsolete in the field.

Finally, we draw an arc of comparison between the many years since each of these papers have been published and explore the historical significance of their work. Even though Dijkstra and Milner researched these topics more than 30 years ago, we expect to find that their lessons are widely applicable today.

Selecting a Logical System for Compliance Regulations

Michael van de Weerd and Zhang Yuanqing

Abstract Due to the rapidly changing requirements in modern businesses and the complex rules imposed by governments etc., there is a need for services that automatically verify whether business processes are in compliance with these norms. Using a logical language, these norms can be defined such that the verification services are able to understand them and verify whether the business processes comply.

Many tools and languages such as deontic and temporal logic families have been developed, but the field lacks clear directions regarding the pros and cons of the different methods and how to use them in a practical situation. In our systematic review we explore the different options for logical languages available to professionals. Next, we will define the properties of regulations in order to categorize them into several use cases. One or more concrete use cases will be defined such that all categories are covered. With these use cases, we will demonstrate and empirically compare the usage of the logical languages — or a selection thereof if time constraints require us to do so.

Finally, we will discuss our experience using each logical language and propose the categories of use cases in which one will excel according to our comparison. We expect that each logical language will have its own specialty. If possible, we will propose future research based on the use case categories that cannot be covered by the selected logical languages, as this might imply a possible opportunity for further research or development.

Distributed Constraint Optimization Problems: Review of recent complete algorithms.

Elisa Oostwal and Sofie Lövdal

Abstract Constraint optimization problems (COPs) are a class of problems in which variables need to satisfy a number of constraints. Unlike in constraint satisfaction problems (CSPs), where the goal is to find some set of assignments that satisfy the constraints, the goal of a COP is to find the set of assignments that optimizes an objective function. In problems with a large number of variables or constraints it quickly becomes unfeasible to examine all possible solutions. However, multi-agent systems can be used to solve these problems by distributing the variables and constraints among the agents, effectively splitting the (global) COP into smaller (local) COPs. The agents then communicate their state in order to find a solution. Since the agents in such a system operate autonomously, defining a model for communication between the agents is a non-trivial task. Distributed constraint optimization (DCOP) is a research field in artificial intelligence which provides the needed methods for coordination and distributed problem solving.

The DCOP algorithms can be divided roughly into two categories: complete algorithms which guarantee convergence to an optimal solution, and incomplete algorithms which merely provide bounds on solution quality. The former class is useful for academic purposes and in other scenarios where solution quality needs to be guaranteed. The latter class is suitable in large-scale problems as well as for real-time applications, where resources are limited.

In this paper we restrict ourselves to the set of complete algorithms. Special attention is given to some state-of-the-art algorithms that have been proposed over the recent years. Their computational complexity, spatial complexity, and applicability will be discussed. Since the performance of a DCOP algorithm depends heavily on the problem at hand, we will create a guideline to which algorithm is best suitable for which type of problem.

An overview of data science versioning practices and methods.

Kayleigh Boekhoudt and Thom Carretero Seinhorst

Abstract Version control is important in software development. It makes it easy to manage collaboration between developers and to keep track of changes. In a general workflow, all source code created in a software project is contained in a (de)centralized repository to which all project members have access. This process allows for verifiability, quality and access control. Moreover, it allows for reverting unintentional changes. Data science version control is relatively new and has room for improvement. Researchers want to collect, analyze and collaborate on data sets, to retrieve insights or to distil scientific knowledge from it. Data evolves over time, often at rapid rates. Data science therefore needs its own versioning system so data scientists can better collaborate, test, share and reuse.

The goal of this project is to analyse the methods available today not only for managing source code but also data versions. In software development, there are many well-known version control systems (VCS) available, for instance, Git and one of its platforms GitHub. There are numerous less familiar methods available for data science versioning, such as the Dataset Version Control System (DSVC) and the platform DataHub.

We will discuss different methods and platforms that are accessible on a high-level by comparing the idea behind the methods and their functionalities. We will also analyse a few methods on a deeper level and examine their model, benefits, challenges and give suggestions for possible improvements. At the end of the research, we will recommend data science versioning methods based on the project it is needed for.

A different approach to the selection of an optimal hyperparameter optimisation method

Derrick Timmerman and Wesley Seubring

Abstract Several methods have been developed to improve the performance of hyperparameter optimization for a machine learning model. The naive practice to find the optimal hyperparameters is to search through an entire grid of parameters, also called Grid search. A more efficient way would be by using a method named Random search, in which the parameters are randomly selected and evaluated. Their is a wide range of more advanced hyperparameter optimization methods is available (e.g. Spectral analyse and Bayesian optimization), which could improve the performance in terms of optimization even further.

From the perspective of machine learning, the optimization of hyperparameters is time-consuming and computationally expensive. Our research will therefore compare the most promising but more obscure hyperparameter optimization methods, as well as the well-known Grid search and Random search methods. These findings can serve as a guide in the field of machine learning, to choose the most efficient hyperparameter optimization method.

The different hyperparameter optimization methods will be discussed and evaluated based on their efficiency and usability in practice. This is done by comparing the different methods and their results after being applied to a hyperparameter optimization problem. Ideally the method should be fast, scalable and relatively easy to apply in practice.

Reproducibility in Scientific Workflows: An Overview.

Konstantina Gkikopouli and Ruben Kip

Abstract Provenance, data of the chronology of steps in the workflow, is identified as an increasingly significant component in the life cycle of a scientific workflow, as it obtains in details all the occurred events of the system. Based on this gathered data, different actions can be performed (e.g. an execution history of a process or repair of the scientific workflow) that permit the reproducibility of different scientific applications. However, the capacity in terms of storage and computing needed to record all this required data in order to make the workflow reproducible still remains an unsolved issue. We summarize the current usage of Scientific workflow and its problems by creating an overview of recent papers, possibly demonstrating a potential solution. We mainly focus on scientific workflow integration in cloud computing. This integration assures the re-execution of the scientific workflows and their reproducibility in the cloud.

Predictive monitoring for Decision Making in Business Processes.

Ana Roman end Hayo Ottens

Abstract The achievement of business goals for any service or establishment is always desired, whether it concerns for example the health of patients in hospitals, the maintaining of a company image or the making of profit in profit making organizations. These business goals are achieved by (series of) business processes. In order to achieve these goals, the processes should be monitored at runtime, but their outcomes can also be predicted to prevent unwanted scenario’s. Instead of monitoring business processes upon completion, it is desired to intervene before a sub-optimal action is taken. Possible options and their risks should be taken into account in order to make a well-informed decision. Information systems currently used to support business processes keep their historical data in the form of logs. These logs can be used as a means to hint towards a more optimal action in a certain situation where a decision is to be made.

In this paper we present multiple methodologies to 1) predict the outcome of a (sub)process and 2) give a risk-analysis for possible decisions to take. Upon process execution, the business goals and constraints can be defined in the form of a model. Next a framework is used to monitor and identify the input data values which can cause the business goal to be unreachable. We will elaborate on using existing outcomeoriented predictive business process monitoring techniques as well as a way to predict the risks from log-data. Lastly, we will point out weaknesses of these methods and present research questions for future research.

A Comparison of Peer-to-Peer Energy Trading Architectures.

Anton Laukemper and Carolien Braams

Abstract Peer-to-peer energy trading (P2P DET) enables people to trade their own generated energy from renewable energy sources (RES) among end users in a distribution network. These end users who both produce and consume energy are called prosumers. The rise of RES like photo voltaic panels and wind turbines presents numerous opportunities; for example, the risk of power outages is reduced through multiple redundant power sources. However, it also poses technological challenges in the design of electric grids considering issues such as privacy, security and reliability of power supply. Recent research has shown various methods to address these challenges by finding new techniques to balance supply and demand of electricity, and to transmit it from prosumer to prosumer using a smart grid.

This paper gives an overview of the various aspects that come with a distributed grid of power producers and compare approaches of recent works. We point out that a possible architecture of a distributed peer-to-peer smart grid consists of two parts, a system that dynamically responds to variations in demand, and a system that securely and reliably sends power packages through the grid. The main contribution of the paper is a qualitative comparison of a selection of possible P2P DET platforms.

Ensuring correctness of communication-centric software systems.

Rick de Jonge and Mathijs de Jager

Abstract Software systems can be proven to be working as they were intended, but this is a difficult task for communication-centric software systems such as a client-server setup. A multitude of ways to ensure the correctness of such a system can be found in the current literature. Conversation- and π-calculus have been used to represent the flow of such a system, while research using session types has implemented a more programmatic way of ensuring correctness. These approaches have to be able to express different ways of communication. For example, linear as opposed to nonlinear connections, as well as dynamically established communications as opposed to opened connections.

Since there are many different approaches, each constructed with their own goal, we want to find out which similarities and differences can be found in the various approaches. The approaches had to tackle different issues in their approach to ensure correctness in communication-centric systems, some of which are shared and others unique. Using this, we can see the general aspects each approach has and which aspect has not been covered by these approaches. These aspects make a starting point for new research in this area.

We will centralize this research by summarizing these most-used approaches, comparing their solutions to common aspects and explaining their individual usefulness. In this comparison we will see what obstacles each approach had to overcome and what approach is best used to ensure the correctness of what programming designs and languages. This will be concluded with a global overview of the systems covered by these approaches.

A Comparative Study of Random Forest and its Probabilistic Variant.

Zahra Putri Fitrianti and Codrut-Andrei Diaconu

Abstract Machine Learning algorithms have become an important tool in data analysis. They can be used in various tasks, classification being one of the most common where we aim to predict the labels/classes of the data points using their attributes. An example of such algorithm is Random Forest (RF) which is a combination of predictors (trees), such that each tree uses a random subset of features and a random subset of the training data points [1]. However, some algorithms do not necessarily perform well when the data has noise, e.g. errors introduced by measurement tools, missing values etc. In order to take these uncertainties into account, the Probabilistic Random Forest (PRF) was implemented in such a way that it treats the features and labels as probability density functions [2]. This makes the model more robust compared to the standard RF where the deterministic values are used.

The aim of this paper is to compare the PRF method with the standard RF by performing some tests on benchmark data sets and analyze the results in terms of both performance (e.g. accuracy) and computational time.

We will test both algorithms with two kinds of data sets, i.e. real and synthetic data sets. First, we will compare the two algorithms on the clean datasets and then we will consider two types of noise: measurement noise which will be simulated by injecting noise into features and/or labels and missing data for which we will discard some of the features for a subset of the data points. Then, various performance measures will be computed as the main criteria of the comparison. We expect to observe a better performance of PRF compared to standard RF when the data set has a higher fraction of noise. In addition, we also expect the PRF to yield similar result as the standard RF, or maybe a slight worse because it is a more complex algorithm.

Comparison of data-independent Locality-Sensitive Hashing (LSH) vs. data-dependent Locality-Preserving hashing (LPH) for hashing-based approximate nearest neighbor search.

Jarvin Mutatiina and Chriss Santi

Abstract Determining the most similar data in relation to a query is an important aspect for many applications like image retrieval, document search, pattern recognition etc. However, with high dimensional data, performing a nearest neighbor search using traditional methods like k-d trees becomes complex[1] and is likened to a brute-force linear search; as each data element might have to be traversed. This can be computationally intensive and time consuming e.g for queries over the internet. Approximation for nearest neighbor search with the aid of hashing functions limit the search space and can be much suited for such high dimensional data scenarios.

Several hashing algorithms have been proposed to aid the approximation for the nearest neighbor search in high dimensional data and in this research we focus on comparing two top contenders - Locality Sensitive Hashing(LSH) and Locality Preserving Hashing(LPH).

Our investigation will focus on how and when to use LSH and LPH as approximations for hashingbased nearest neighbor search. Another investigation aspect will be data (in)dependence and its influence on the approximations. LSH is a data independent technique i.e hashing of the data is random and doesn’t preserve the similarity structure of the inputs. On the other hand, data dependent LPH maintains the relative similarity structure from the input data[3]. We will conduct experiments that investigate and document results on scenarios where the two hashing approximations are applicable and most relevant.

Automatic Fracture Detection in CT-scans of the Cervical Spine.

Kareem Al-Saudi and Frank te Nijenhuis

Abstract When a trauma patient arrives at a hospital’s Emergency Department (ED), a cervical (neck) computed tomography (CT) scan is often performed as a routine examination to check for spinal fractures. This scan is examined by a radiologist to ensure that there are no fractured vertebrae. The high workload of the radiologist, however, means that whenever possible attempts should be made to automate some of the more mundane radiological tasks. We believe the detection of fractures of the cervical spine on CTscans is one of these tasks amenable to automation.

Currently, deep learning techniques are being applied to many areas of medicine, particularly in the field of radiology. Convolutional neural networks (CNN) are the current go-to solution when it comes to machine learning related imaging problems. There has been a lack of attention when it comes to automatic fracture detection in the cervical spine, however, and we hope to expand the scientific knowledge of this part of radiological machine learning.

We will analyze the current literature in the field of deep learning so that we may propose a solution to the problem of automatic fracture detection. Furthermore, we will write about the specific issues associated with the use of deep learning in medicine.

The most straighforward approach to solving this problem would be to use a pretrained CNN on a CT-scan dataset. We want to predict whether certain preprocessing techniques, such as a segmentation step, can improve on the performance of a ’naive’ application of a CNN.

An Overview of Runtime Verification in Various Applications

Neha Rajendra Bari Tamboli and Sankar Vigneshwari

Abstract Runtime verification is an execution approach based on extracting information from a running system under inspection which satisfies or violates a given property. Some very particular properties that can be discussed are, data race and deadlock freedom which can end up having huge impact on the system performance and in some cases it can also result in critical misdiagnosis. Runtime verification reduces the complexity of traditional static or semi-automated verification technique, such as model checking. Since Runtime verification monitors only one or a few execution traces it gives better outcomes in lesser execution time.

We will contemplate on how Runtime verification is introduced in various applications such as financial transaction system, medical care practice, distributed systems and survey on how it is used as a better alternative and also briefly 1 discuss the work-flow of the runtime verification method in these applications. We would also like to discuss about what commonalities are observed between these given applications in terms of runtime verification and how they are improved in case of efficiency and accuracy(which can be critical in medical health care) using the same. This will be achieved by comparing and analyzing 3 papers, one each based on distributed systems, medical care practise and Financial Transaction Systems.

An overview of prospect tactics and techniques in the microservice management landscape.

Edser Apperloo and Mark Timmerman

Abstract In the past years, the industry has been moving towards using microservices rather than monolith codebases. These services offer flexibility, reusability and are easier to scale than traditional software. A lot of research has gone into the capabilities of microservices. The use of these microservices does introduce new issues regarding their management. Monitoring, health management and service discovery are still open subjects of research. Even though a lot of progress has been made in these fields, the rapid change of requirements in our current digital society together with large market shareholders prevent the microservice architectural style from benefiting from its true potential. In addition, it becomes increasingly difficult to assert that the microservices and their APIs implement the proposed contracts.

In this paper, an overview is given of the microservices architectural style and its existing technologies and heuristics regarding the two aforementioned challenges. These challenges being, the management of large sets of microservices and asserting the implementation of proposed contracts by those services. Benefits and pitfalls are identified and the current microservice landscape is described together with key challenges, their implications and a set of current tools to face these.

Reviewing a number of papers from the last years regarding the microservices landscape we identified the following trends. Current solutions make use of a form of external global management applications like Kubernetes, Amazon CloudWatch or Rightscale. As the set of services grows and the need for management and orchestration increases, the management application itself increasingly needs scalability. As microservices keep gaining influence the need for less vendor lock-in and distributed management will increase. In order to facilitate this we expect a future trend towards distributed management applications in which microservices manage themselves or few others in order to guarantee scalability and reliability. If this expectation holds, new problems that may arise include communication between these management services, and a clear separation between the microservice logic and its management layer.

Dynamic Updates in Distributed Data Pipelines

Sjors Mallon and Niels Meima

Abstract Distributed data processing systems are the standard for large-scale data processing tasks. The increasing growth of data gathering systems leads to new problems and requirements when it comes to such tasks. Dynamic updating (updating running code on-the-fly, without the need to shut the application down) is useful in these systems, as the current state-of-the-art systems necessitate abilities for dynamically updating algorithms, parameters and data sources which are part of the data processing pipeline. This data processing pipeline refers to a sequence of software elements that transform the data in some sort of way. Recent work shows an initial basic idea for dynamically updating above mentioned segments of the processing pipelines without much overhead.

Our research extends on initial proposals and identifies key problems and limitations of these distributed data processing pipelines. We propose concrete steps to undertake in order to handle the identified problems and limitations, with the aim of mitigating the impact of these on the data processing system. We also propose additional requirements to establish a system which are more robust than current offerings.

We review and summarize current solutions and implementations based on their usability, generalizability and overhead. We will compare solutions to identify shared and missing features. From this comparison we propose a list of requirements and implementation steps to improve current offerings. We will also look at the current perspective of mitigating the impact of the eventual downtime dynamic updates will cause.

About the Event

What is SC@RUG 2019

The school of computing science and cognition at the University of Groningen will organize the sixteenth student colloquium conference SC@RUG on 2 April 2019, bringing together master students from Computing Sciences in Groningen and its staff SC@RUG 2019 is devoted to research in computing science with emphasis on overviewing research conflicts. Previous Sc@RUG have had a broad range of presentations in the field of surveys, tutorials and case studies, and we hope to even extend that range this year.

Schedule

Part 1 April-2
09:00
Doors open
09:05-09:20
Predictive monitoring for Decision Making in Business Processes

Ana Roman end Hayo Ottens

09:23-09:38
Reproducibility in Scientific Workflows: An Overview.

Konstantina Gkikopouli and Ruben Kip

09:41-09:56
Technical debt decision making: Choosing the right

Ronald Kruizinga and Ruben Scheedler

09:59-10:14
An overview of Technical Debt and Different Methods Used for its Analysis

Anamitra Majumdar and Abhishek Patil

10:17-10:32
An Analysis of Domain Specific Languages and Language-Oriented Programming.

Abhisar Kaushal and Lars Doorenbos

10:32-10:50
Break
10:50-11:05
A Brief History of Concurrency: Lessons from Three Turing Lectures.

Michael Yuja and Bogdan Bernovici

11:08-11:23
Ensuring correctness of communication-centric software systems.

Rick de Jonge and Mathijs de Jager

11:26-11:41
An Overview of Runtime Verification in Various Applications

Neha Rajendra Bari Tamboli and Sankar Vigneshwari

11:44-11:59
An overview of prospect tactics and techniques in the microservice management landscape

Edser Apperloo and Mark Timmerman

12:02-12:17
Dynamic Updates in Distributed Data Pipelines

Sjors Mallon and Niels Meima

12:20-12:35
An overview of data science versioning practices and methods.

Kayleigh Boekhoudt and Thom Carretero Seinhorst

Part 2 April-2
12:35-13:15
Lunch break
13:15-13:45
Keynote speaker
13:45-14:00
A different approach to the selection of an optimal hyperparameter optimisation method.

Derrick Timmerman and Wesley Seubring

14:03-14:18
A Comparison of Peer-to-Peer Energy Trading Architectures.

Anton Laukemper and Carolien Braams

14:21-14:36
Writing Selecting a Logical System for Compliance Regulations.

Michael van de Weerd and Zhang Yuanqing

14:39-14:54
Distributed Constraint Optimization Problems: Review of recent complete algorithms.

Elisa Oostwal and Sofie Lövdal

14:54-15:10
break
15:10-15:25
A Comparative Study of Random Forest and its Probabilistic Variant.

Zahra Putri Fitrianti and Codrut-Andrei Diaconu

15:28-15:43
Automatic Fracture Detection in CT-scans of the Cervical Spine.

Kareem Al-Saudi and Frank te Nijenhuis

15:46-16:01
Comparison of data-independent Locality-Sensitive Hashing (LSH) vs. data-dependent Locality-Preserving hashing (LPH) for hashing-based approximate nearest neighbor search.

Jarvin Mutatiina and Chriss Santi

16:04-16:19
The Role Of Data Provenance In Visual Storytelling.

Oodo Hilary Kenechukwu and Shubham Koyal

16:22-16:37
Comparing Phylogenetic Trees: an overview of state-of-the-art methods.

Hidde Folkertsma and Ankit Mittal

16:37
Choosing best presentation & closing by honorary chair

Sponsors

Thanks to our sponsors

Location

Check out the location Information

BERNOULLIBORG

Address: NIJENBORGH 9, 9747 AG, Groningen

Get Direction