Data profiling

Data profiling

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: Find out whether existing data can be easily used for other purposes Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category Assess data quality, including whether the data conforms to particular standards or patterns Assess the risk involved in integrating data in new applications, including the challenges of joins Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies Assess whether known metadata accurately describes the actual values in the source database Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality. == Introduction == Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data. Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata. The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design. == How data profiling is conducted == Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition. The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates. Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis. Normally, purpose-built tools are used for data profiling to ease the process. The computational complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools. == When is data profiling conducted? == According to Kimball, data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated. Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set. Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements. == Benefits and examples == Data profiling can improve data quality, shorten the implementation cycle of major projects, and improve users' understanding of data. Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling. It can improve data accuracy in corporate databases.

Google Tasks

Google Tasks is a task management application developed by Google and included with Google Workspace. Included initially as a feature in Gmail and Google Calendar, Google Tasks launched as a core product with a standalone app in 2018. It is available for Android and iOS, as well as in the right-hand side panel on Google Workspace apps on the web and in Google Calendar. == History and development == Google Tasks began as an integration within other apps in G Suite (now Google Workspace), allowing to-do items to be created in Calendar and Gmail. Upon graduating to a core service on June 28, 2018, Google Tasks launched as a dedicated mobile app in which tasks can be sorted into lists, managed, and completed. Google Tasks launched the ability to create tasks from Google Chat messages in 2022.

Anytime algorithm

In computer science, an anytime algorithm is an algorithm that can return a valid solution to a problem even if it is interrupted before it ends. The algorithm is expected to find better and better solutions the longer it keeps running. Most algorithms run to completion: they provide a single answer after performing some fixed amount of computation. In some cases, however, the user may wish to terminate the algorithm prior to completion. The amount of computation required may be substantial, for example, and computational resources might need to be reallocated. Most algorithms either run to completion or they provide no useful solution information. Anytime algorithms, however, are able to return a partial answer, whose quality depends on the amount of computation they were able to perform. The answer generated by anytime algorithms is an approximation of the correct answer. == Names == An anytime algorithm may be also called an "interruptible algorithm". They are different from contract algorithms, which must declare a time in advance; in an anytime algorithm, a process can just announce that it is terminating. == Goals == The goal of anytime algorithms are to give intelligent systems the ability to make results of better quality in return for turn-around time. They are also supposed to be flexible in time and resources. They are important because artificial intelligence or AI algorithms can take a long time to complete results. This algorithm is designed to complete in a shorter amount of time. Also, these are intended to have a better understanding that the system is dependent and restricted to its agents and how they work cooperatively. An example is the Newton–Raphson iteration applied to finding the square root of a number. Another example that uses anytime algorithms is trajectory problems when you're aiming for a target; the object is moving through space while waiting for the algorithm to finish and even an approximate answer can significantly improve its accuracy if given early. What makes anytime algorithms unique is their ability to return many possible outcomes for any given input. An anytime algorithm uses many well defined quality measures to monitor progress in problem solving and distributed computing resources. It keeps searching for the best possible answer with the amount of time that it is given. It may not run until completion and may improve the answer if it is allowed to run longer. This is often used for large decision set problems. This would generally not provide useful information unless it is allowed to finish. While this may sound similar to dynamic programming, the difference is that it is fine-tuned through random adjustments, rather than sequential. Anytime algorithms are designed so that it can be told to stop at any time and would return the best result it has found so far. This is why it is called an interruptible algorithm. Certain anytime algorithms also maintain the last result, so that if they are given more time, they can continue from where they left off to obtain an even better result. == Decision trees == When the decider has to act, there must be some ambiguity. Also, there must be some idea about how to solve this ambiguity. This idea must be translatable to a state to action diagram. == Performance profile == The performance profile estimates the quality of the results based on the input and the amount of time that is allotted to the algorithm. The better the estimate, the sooner the result would be found. Some systems have a larger database that gives the probability that the output is the expected output. One algorithm can have several performance profiles. Most of the time performance profiles are constructed using mathematical statistics using representative cases. For example, in the traveling salesman problem, the performance profile was generated using a user-defined special program to generate the necessary statistics. In this example, the performance profile is the mapping of time to the expected results. This quality can be measured in several ways: certainty: where probability of correctness determines quality accuracy: where error bound determines quality specificity: where the amount of particulars determine quality == Algorithm prerequisites == Initial behavior: While some algorithms start with immediate guesses, others take a more calculated approach and have a start up period before making any guesses. Growth direction: How the quality of the program's "output" or result, varies as a function of the amount of time ("run time") Growth rate: Amount of increase with each step. Does it change constantly, such as in a bubble sort or does it change unpredictably? End condition: The amount of runtime needed

SQLf

SQLf is a SQL extended with fuzzy set theory application for expressing flexible (fuzzy) queries to traditional (or ″Regular″) Relational Databases. Among the known extensions proposed to SQL, at the present time, this is the most complete, because it allows the use of diverse fuzzy elements in all the constructions of the language SQL. SQLf is the only known proposal of flexible query system allowing linguistic quantification over set of rows in queries, achieved through the extension of SQL nesting and partitioning structures with fuzzy quantifiers. It also allows the use of quantifiers to qualify the quantity of search criteria satisfied by single rows. Several mechanisms are proposed for query evaluation, the most important being the one based on the derivation principle. This consists in deriving classic queries that produce, given a threshold t, a t-cut of the result of the fuzzy query, so that the additional processing cost of using a fuzzy language is diminished. == Basic block == The fundamental querying structure of SQLf is the multi-relational block. The conception of this structure is based on the three basic operations of the relational algebra: projection, cartesian product and selection, and the application of fuzzy sets’ concepts. The result of a SQLf query is a fuzzy set of rows that is a fuzzy relation instead of a regular relation. A basic block in SQLf consists of a SELECT clause, a FROM clause and an optional WHERE clause. The semantic of this query structure is: The SELECT clause corresponds to the projection. It specifies the relations’ attributes (or attribute expressions) that will be selected. The resulting table is a fuzzy set and it is given in decreasing ordered of satisfaction degree. The SELECT clause specifies also a calibration that is intended to restrict the set of rows retrieved. There are two kinds of calibrations: quantitative and qualitative. In quantitative calibration the user specifies the number of results to be retrieved, so that the query will retrieve the rows with highest membership degrees up to the number of required answers. In qualitative calibration the user specifies a minim level of satisfaction that must have any retrieved row. The FROM clause corresponds to the Cartesian Product. The consult is made on the Cartesian Product of the relations that are specified in this clause. The WHERE clause corresponds to the selection. It specifies the condition for which the satisfaction degree will be calculated. Rows that do not satisfy at all the condition are rejected. This condition is a fuzzy predicate that may involve any attribute of the relations. The following is an example of a SELECT query that returns a list of hotels that are cheap. The query retrieves all rows from the Hotels table that satisfice the fuzzy predicate cheap defined by the fuzzy set μ=(∞, ∞, 25, 30). The result is sorted in descending order by the membership degree of the query.

Generative literature

Generative literature is poetry or fiction that is automatically generated, often using computers. It is a genre of electronic literature, and also related to generative art. John Clark's Latin Verse Machine (1830–1843) is probably the first example of mechanised generative literature, while Christopher Strachey's love letter generator (1952) is the first digital example. With the large language models (LLMs) of the 2020s, generative literature is becoming increasingly common. == Definitions == Hannes Bajohr defines generative literature as literature involving "the automatic production of text according to predetermined parameters, usually following a combinatory, sometimes aleatory logic, and it emphasizes the production rather than the reception of the work (unlike, say, hypertext)." In his book Electronic Literature, Scott Rettberg connects generative literature to avant-garde literary movements like Dada, Surrealism, Oulipo and Fluxus. Bajohr argues that conceptual art is also an important reference. == Paradigms of generative literature == Bajohr describes two main paradigms of generative literature: the sequential paradigm, where the text generation is "executed as a sequence of rule-steps" and employs linear algorithms, and the connectionist paradigm, which is based on neural nets. The latter leads to what Bajohr calls a algorithmic empathy: "a non-anthropocentric empathy aimed not at the psychological states of the artists but at understanding the process of the work’s material production." == Poetry generation == The first examples of automated generative literature are poetry: John Clark's mechanical Latin Verse Machine (1830–1843) produced lines of hexameter verse in Latin, and Christopher Strachey's love letter generator (1952), programmed on the Manchester Mark 1 computer, generated short, satirical love letters. Examples of generative poetry using artificial neural networks include David Jhave Johnston's ReRites. == Narrative generation == Story generators have often followed specific narratological theories of how stories are constructed. An early example is Grimes' Fairy Tales, the "first to take a grammar-based approach and the first to operationalize Propp's famous model." Mike Sharples and Rafael Peréz y Peréz's book Story Machines gives a detailed history of story generation. Storyland by Nanette Wylde is an example of generative narrative. Jonathan Baillehache compares Storyland to Surrealist writing. Baillehache states, "When compared to earlier uses of chance operation in literature, a piece like this one resembles some of the automatic writings produced by André Breton and Philippe Soupault in their collective work The Magnetic Fields. . . The difference between Nanette Wylde’s Storyland and Breton and Soupault’s Magnetic Fields is that the former is produced according to a computational algorithm involving randomizers and user interaction, and the latter by two free-wheeling human subjects."

Computer Graphics: Principles and Practice

Computer Graphics: Principles and Practice is a textbook written by James D. Foley, Andries van Dam, Steven K. Feiner, John Hughes, Morgan McGuire, David F. Sklar, and Kurt Akeley and published by Addison–Wesley. First published in 1982 as Fundamentals of Interactive Computer Graphics, it is widely considered a classic standard reference book on the topic of computer graphics. It is sometimes known as the bible of computer graphics (due to its size). == Editions == === First Edition === The first edition, published in 1982 and titled Fundamentals of Interactive Computer Graphics, discussed the SGP library, which was based on ACM's SIGGRAPH CORE 1979 graphics standard, and focused on 2D vector graphics. === Second Edition === The second edition, published 1990, was completely rewritten and covered 2D and 3D raster and vector graphics, user interfaces, geometric modeling, anti-aliasing, advanced rendering algorithms and an introduction to animation. The SGP library was replaced by SRGP (Simple Raster Graphics Package), a library for 2D raster primitives and interaction handling, and SPHIGS (Simple PHIGS), a library for 3D primitives, which were specifically written for the book. === Second Edition in C === In the second edition in C, published in 1995, all examples were converted from Pascal to C. New implementations for the SRGP and SPHIGS graphics packages in C were also provided. === Third Edition === A third edition covering modern GPU architecture was released in July 2013. Examples in the third edition are written in C++, C#, WPF, GLSL, OpenGL, G3D, or pseudocode. == Awards == The book has won a Front Line Award (Hall of Fame) in 1998.

On a Red Station, Drifting

On a Red Station, Drifting is a 2012 science fiction novella by Aliette de Bodard. Set in her Xuya Universe, it focuses on two women aboard a space station with a failing artificial intelligence. It received critical acclaim, becoming a finalist for the 2012 Nebula Award for Best Novella, the 2013 Hugo Award for Best Novella, and the 2013 Locus Award for Best Novella. == Plot == Lê Thi Linh is a magistrate of the Dai Viet Empire who is forced to flee her planet after criticizing the Emperor’s wartime policies. At the same time, rebel groups seize control of her planet and kill most of her subordinates. Linh seeks refuge with her distant relatives on Prosper Station. Prosper is controlled by an artificial intelligence called the Honoured Ancestress. Lê Thi Quyen, Linh’s cousin by marriage, manages the day-to-day operations of Prosper while her husband is away at war. Quyen and Linh immediately fall into conflict. Quyen’s brother-in-law Huu Hieu sells his mem-implants, which are copies of their ancestors’ consciousnesses. Meanwhile, the Honoured Ancestress experiences increasingly severe technical problems. Hieu and Linh become close. Hieu plans use the money from the sale of the implants to leave Prosper and marry his lover on a different station. Linh is upset knowing that she will never be able to leave. A visiting cousin, Lady Oahn, provides schematics for the repair of the Honoured Ancestress. In an effort to hurt Quyen, Linh writes an unflattering poem at a banquet honoring Oanh. In doing so, she reveals that Hieu is trying to leave Prosper. Hieu attempts suicide out of shame, but Linh rescues him. Quyen is able to repair the Honoured Ancestress, restoring her functionality at the expense of erasing many of her memories. The Emperor’s Embroidered Guard arrives at Prosper Station in search of Linh. Linh finds the missing mem-implants and returns them to Quyen. Quyen and Linh briefly reconcile before Linh is arrested and removed from Prosper Station. == Major themes == A review in Kirkus wrote that the novel's "familiar setting" was a "departure point" for the novel to explore its themes. The novel explores family ties; almost everyone on Prosper Station is related in some fashion. Additionally, the use of ancestors' mem-implants further explores the concept of family ties, with some descendants being considered more "worthy" than others due to their higher number of implants. The novel also explores questions of worth, as those who fail at ability tests are often forced to become the "lesser partners" in marriages and are discriminated against due to their perceived lack of achievement. The author notes that it is interesting that gender plays no role in the question of worth, and that the majority of the men in the story are actually the "lesser partner" in their marriage. == Style == The novel is divided into three sections. Liz Bourke wrote that each section builds thematically "towards an emotional crescendo". == Reception == Writing for Locus, Liz Bourke praised the novel's exploration of interpersonal conflict between Linh and Quyen, writing that "essentially subverts the popularly-understood derogatory overtones of 'domestic conflict'". Bourke also praised the story's tension, calling it "so well-strung the prose practically vibrates under its influence". A review for Kirkus stated that the novel is a "beautifully realized story and the characters, plot, theme and writing are expertly crafted." === Awards ===